Qwantize
Optimal quantization methods for block-scaled formats.
Installation
pip install qwantize
Requires PyTorch (>=2.0) and Triton (>=3.0).
Repository
GitHub: github.com/ayghri/qwantize
Formats
INT8 – Symmetric INT8 with FP8 E4M3 scales (block sizes 32, 64, 128, 256)
NVFP4 – FP4 E2M1 with FP8 E4M3 scales (block sizes 16, 32)
MXFP4 – FP4 E2M1 with UE8M0 (power-of-2) scales (block sizes 16, 32)
Quick Start
from qwantize import nvfp4_naive, nvfp4_optimal, nvfp4_dequantize, compute_metrics
# W has shape (..., block_size) where block_size is 16 or 32
# dim specifies which dimension is the block dimension (default: -1)
W_blocked = W.reshape(M, K // 32, 32)
# Quantize: returns (scales, quants)
scales, quants = nvfp4_optimal(W_blocked, dim=-1)
# Dequantize separately
W_dq = nvfp4_dequantize(scales, quants, dim=-1)
# Or get dequantized output directly
scales, quants, W_dq = nvfp4_optimal(W_blocked, dim=-1, return_dequant=True)
metrics = compute_metrics(W, W_dq.reshape(M, K), X)
Methods
Results