NVFP4

Reference Implementation

qwantize.nvfp4.reference.nvfp4_naive(W, dim=-1, return_dequant=False)[source]

Naive NVFP4 quantization: s = FP8_E4M3(max|x_i| / 6) per block.

Parameters:
  • W – Input tensor. W.shape[dim] must be 16 or 32 (the block size).

  • dim – Dimension along which to quantize (default: -1).

  • return_dequant – If True, also return the dequantized tensor.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True.

  • scales: Per-block FP8 E4M3 scales. Shape is W.shape with dimension dim removed.

  • quants: Signed FP4 codebook values. Same shape as W.

  • dequant: quants * scales broadcast. Same shape as W.

qwantize.nvfp4.reference.nvfp4_optimal(W, dim=-1, return_dequant=False)[source]

Optimal NVFP4 quantization via bounded search over FP8 E4M3 scales.

Uses clipping and dead-zone bounds to reduce the search from 126 FP8 candidates to ~4-8, with a fast-fail clipping check per candidate. See Optimal Scale Search for the algorithm.

Parameters:
  • W – Input tensor. W.shape[dim] must be 16 or 32 (the block size).

  • dim – Dimension along which to quantize (default: -1).

  • return_dequant – If True, also return the dequantized tensor.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True.

  • scales: Per-block optimal FP8 E4M3 scales. Shape is W.shape with dimension dim removed.

  • quants: Signed FP4 codebook values. Same shape as W.

  • dequant: quants * scales broadcast. Same shape as W.

qwantize.nvfp4.reference.nvfp4_optimal_hessian(W, dim=-1, return_dequant=False, X=None, H_blocks=None)[source]

Hessian-aware optimal NVFP4 scale search.

Like nvfp4_optimal(), searches over FP8 E4M3 scale candidates using SSE bounds for pruning, but selects the scale minimizing the Hessian-weighted error (x - sq)^T H (x - sq) instead of raw SSE. This directly minimizes each block’s contribution to the output error ||W_q X - WX||_F^2.

See Hessian-Aware Optimal Scale Search for the math.

Parameters:
  • W – Input tensor. W.shape[dim] must be 16 or 32 (the block size).

  • dim – Dimension along which to quantize (default: -1).

  • return_dequant – If True, also return the dequantized tensor.

  • X – Activation tensor of shape (T, K). H computed as X_j^T @ X_j.

  • H_blocks – Pre-computed block Hessians of shape (num_col_blocks, bs, bs). If provided, X is ignored.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True.

  • scales: Per-block optimal FP8 E4M3 scales. Shape is W.shape with dimension dim removed.

  • quants: Signed FP4 codebook values. Same shape as W.

  • dequant: quants * scales broadcast. Same shape as W.

qwantize.nvfp4.reference.nvfp4_dequantize(scales, quants, dim=-1)[source]

Dequantize NVFP4: dequant = quants * scales.

Parameters:
  • scales – Per-block scales. Shape is the original W.shape with dimension dim removed.

  • quants – Signed FP4 codebook values. Same shape as the original W.

  • dim – Block dimension in quants (default: -1).

Returns:

Dequantized tensor with the same shape as quants.

qwantize.nvfp4.reference.build_fp8_e4m3_scales(device='cpu')

Return sorted tensor of all 126 positive FP8 E4M3 representable values.

Parameters:

device – Torch device for the output tensor.

Returns:

Tensor of shape (126,) with sorted positive FP8 E4M3 values as float32.

qwantize.nvfp4.reference.fp4_quantize(x, s)[source]

Quantize to FP4 E2M1 codebook values given a per-block scale.

Maps each element to the nearest value in {0, 0.5, 1, 1.5, 2, 3, 4, 6} (with sign preserved).

Parameters:
  • x – Input tensor of shape (..., block_size).

  • s – Per-block scale of shape (..., 1), broadcastable to x.

Returns:

Signed codebook values with the same shape as x.

qwantize.nvfp4.reference.fp4_dequantize(quants, s)[source]

Dequantize FP4 codebook values back to float: dequant = quants * s.

Parameters:
  • quants – Signed codebook values of shape (..., block_size).

  • s – Per-block scale of shape (..., 1), broadcastable to quants.

Returns:

Dequantized tensor with the same shape as quants.

qwantize.nvfp4.reference.compute_block_sse(x, s)[source]

Compute per-block sum of squared quantization error.

Parameters:
  • x – Block values of shape (num_blocks, block_size).

  • s – Per-block scales of shape (num_blocks,) or (num_blocks, 1).

Returns:

Tensor of shape (num_blocks,) with the SSE for each block.

Triton Kernels

qwantize.nvfp4.kernels.nvfp4_naive_triton(W, dim=-1, return_dequant=False)[source]

Naive NVFP4 quantization using Triton kernel with inline PTX ASM.

GPU-accelerated version of nvfp4_naive().

Parameters:
  • W – Input tensor. W.shape[dim] must be 16 or 32 (the block size).

  • dim – Dimension along which to quantize (default: -1).

  • return_dequant – If True, also return the dequantized tensor.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True. See nvfp4_naive() for shape details.

qwantize.nvfp4.kernels.nvfp4_optimal_triton(W, dim=-1, return_dequant=False)[source]

Optimal NVFP4 quantization using Triton kernel with inline PTX ASM.

GPU-accelerated version of nvfp4_optimal().

Parameters:
  • W – Input tensor. W.shape[dim] must be 16 or 32 (the block size).

  • dim – Dimension along which to quantize (default: -1).

  • return_dequant – If True, also return the dequantized tensor.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True. See nvfp4_optimal() for shape details.

qwantize.nvfp4.kernels.nvfp4_optimal_hessian_triton(W, dim=-1, return_dequant=False, X=None)[source]

Hessian-aware optimal NVFP4 quantization using Triton kernel.

GPU-accelerated version of nvfp4_optimal_hessian().

Parameters:
  • W – Input tensor. W.shape[dim] must be 16 or 32 (the block size).

  • dim – Dimension along which to quantize (default: -1).

  • return_dequant – If True, also return the dequantized tensor.

  • X – Activation tensor of shape (T, K). Required.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True. See nvfp4_optimal_hessian() for shape details.

Constants

  • qwantize.nvfp4.Q_MAX = 6.0 – Maximum FP4 E2M1 codebook value

  • qwantize.nvfp4.D_0 = 0.25 – Decision boundary for rounding to zero

  • qwantize.nvfp4.FP4_CODEBOOK = [0, 0.5, 1, 1.5, 2, 3, 4, 6] – FP4 E2M1 codebook

  • qwantize.nvfp4.FP4_BOUNDARIES = [0.25, 0.75, 1.25, 1.75, 2.5, 3.5, 5.0] – Decision boundaries