NVFP4
Reference Implementation
- qwantize.nvfp4.reference.nvfp4_naive(W, dim=-1, return_dequant=False)[source]
Naive NVFP4 quantization:
s = FP8_E4M3(max|x_i| / 6)per block.- Parameters:
W – Input tensor.
W.shape[dim]must be 16 or 32 (the block size).dim – Dimension along which to quantize (default: -1).
return_dequant – If
True, also return the dequantized tensor.
- Returns:
(scales, quants)by default, or(scales, quants, dequant)if return_dequant isTrue.scales: Per-block FP8 E4M3 scales. Shape is W.shape with dimension dim removed.
quants: Signed FP4 codebook values. Same shape as W.
dequant:
quants * scalesbroadcast. Same shape as W.
- qwantize.nvfp4.reference.nvfp4_optimal(W, dim=-1, return_dequant=False)[source]
Optimal NVFP4 quantization via bounded search over FP8 E4M3 scales.
Uses clipping and dead-zone bounds to reduce the search from 126 FP8 candidates to ~4-8, with a fast-fail clipping check per candidate. See Optimal Scale Search for the algorithm.
- Parameters:
W – Input tensor.
W.shape[dim]must be 16 or 32 (the block size).dim – Dimension along which to quantize (default: -1).
return_dequant – If
True, also return the dequantized tensor.
- Returns:
(scales, quants)by default, or(scales, quants, dequant)if return_dequant isTrue.scales: Per-block optimal FP8 E4M3 scales. Shape is W.shape with dimension dim removed.
quants: Signed FP4 codebook values. Same shape as W.
dequant:
quants * scalesbroadcast. Same shape as W.
- qwantize.nvfp4.reference.nvfp4_optimal_hessian(W, dim=-1, return_dequant=False, X=None, H_blocks=None)[source]
Hessian-aware optimal NVFP4 scale search.
Like
nvfp4_optimal(), searches over FP8 E4M3 scale candidates using SSE bounds for pruning, but selects the scale minimizing the Hessian-weighted error(x - sq)^T H (x - sq)instead of raw SSE. This directly minimizes each block’s contribution to the output error||W_q X - WX||_F^2.See Hessian-Aware Optimal Scale Search for the math.
- Parameters:
W – Input tensor.
W.shape[dim]must be 16 or 32 (the block size).dim – Dimension along which to quantize (default: -1).
return_dequant – If
True, also return the dequantized tensor.X – Activation tensor of shape
(T, K). H computed as X_j^T @ X_j.H_blocks – Pre-computed block Hessians of shape
(num_col_blocks, bs, bs). If provided, X is ignored.
- Returns:
(scales, quants)by default, or(scales, quants, dequant)if return_dequant isTrue.scales: Per-block optimal FP8 E4M3 scales. Shape is W.shape with dimension dim removed.
quants: Signed FP4 codebook values. Same shape as W.
dequant:
quants * scalesbroadcast. Same shape as W.
- qwantize.nvfp4.reference.nvfp4_dequantize(scales, quants, dim=-1)[source]
Dequantize NVFP4:
dequant = quants * scales.- Parameters:
scales – Per-block scales. Shape is the original W.shape with dimension dim removed.
quants – Signed FP4 codebook values. Same shape as the original W.
dim – Block dimension in quants (default: -1).
- Returns:
Dequantized tensor with the same shape as quants.
- qwantize.nvfp4.reference.build_fp8_e4m3_scales(device='cpu')
Return sorted tensor of all 126 positive FP8 E4M3 representable values.
- Parameters:
device – Torch device for the output tensor.
- Returns:
Tensor of shape
(126,)with sorted positive FP8 E4M3 values as float32.
- qwantize.nvfp4.reference.fp4_quantize(x, s)[source]
Quantize to FP4 E2M1 codebook values given a per-block scale.
Maps each element to the nearest value in
{0, 0.5, 1, 1.5, 2, 3, 4, 6}(with sign preserved).- Parameters:
x – Input tensor of shape
(..., block_size).s – Per-block scale of shape
(..., 1), broadcastable to x.
- Returns:
Signed codebook values with the same shape as x.
- qwantize.nvfp4.reference.fp4_dequantize(quants, s)[source]
Dequantize FP4 codebook values back to float:
dequant = quants * s.- Parameters:
quants – Signed codebook values of shape
(..., block_size).s – Per-block scale of shape
(..., 1), broadcastable to quants.
- Returns:
Dequantized tensor with the same shape as quants.
- qwantize.nvfp4.reference.compute_block_sse(x, s)[source]
Compute per-block sum of squared quantization error.
- Parameters:
x – Block values of shape
(num_blocks, block_size).s – Per-block scales of shape
(num_blocks,)or(num_blocks, 1).
- Returns:
Tensor of shape
(num_blocks,)with the SSE for each block.
Triton Kernels
- qwantize.nvfp4.kernels.nvfp4_naive_triton(W, dim=-1, return_dequant=False)[source]
Naive NVFP4 quantization using Triton kernel with inline PTX ASM.
GPU-accelerated version of
nvfp4_naive().- Parameters:
W – Input tensor.
W.shape[dim]must be 16 or 32 (the block size).dim – Dimension along which to quantize (default: -1).
return_dequant – If
True, also return the dequantized tensor.
- Returns:
(scales, quants)by default, or(scales, quants, dequant)if return_dequant isTrue. Seenvfp4_naive()for shape details.
- qwantize.nvfp4.kernels.nvfp4_optimal_triton(W, dim=-1, return_dequant=False)[source]
Optimal NVFP4 quantization using Triton kernel with inline PTX ASM.
GPU-accelerated version of
nvfp4_optimal().- Parameters:
W – Input tensor.
W.shape[dim]must be 16 or 32 (the block size).dim – Dimension along which to quantize (default: -1).
return_dequant – If
True, also return the dequantized tensor.
- Returns:
(scales, quants)by default, or(scales, quants, dequant)if return_dequant isTrue. Seenvfp4_optimal()for shape details.
- qwantize.nvfp4.kernels.nvfp4_optimal_hessian_triton(W, dim=-1, return_dequant=False, X=None)[source]
Hessian-aware optimal NVFP4 quantization using Triton kernel.
GPU-accelerated version of
nvfp4_optimal_hessian().- Parameters:
W – Input tensor.
W.shape[dim]must be 16 or 32 (the block size).dim – Dimension along which to quantize (default: -1).
return_dequant – If
True, also return the dequantized tensor.X – Activation tensor of shape
(T, K). Required.
- Returns:
(scales, quants)by default, or(scales, quants, dequant)if return_dequant isTrue. Seenvfp4_optimal_hessian()for shape details.
Constants
qwantize.nvfp4.Q_MAX = 6.0– Maximum FP4 E2M1 codebook valueqwantize.nvfp4.D_0 = 0.25– Decision boundary for rounding to zeroqwantize.nvfp4.FP4_CODEBOOK = [0, 0.5, 1, 1.5, 2, 3, 4, 6]– FP4 E2M1 codebookqwantize.nvfp4.FP4_BOUNDARIES = [0.25, 0.75, 1.25, 1.75, 2.5, 3.5, 5.0]– Decision boundaries