Results
Reproduce with:
python bench/full_bench.py
Note
Benchmarked on the down_proj weight of the first decoder layer from Qwen3-4B (W: 2560x9728, bfloat16), with activations collected from WikiText-2 (max_seq_len=512, num_samples=2048, X: 244449x9728, bfloat16).
Weight error: \(\lVert Q(W) - W \rVert_F / \lVert W \rVert_F\)
Output error: \(\lVert X W_q^T - X W^T \rVert_F / \lVert X W^T \rVert_F\)
INT8 (FP8 E4M3 scales)
Symmetric INT8 quantization ([-127, 127]) with per-block amax stored in FP8 E4M3.
The effective scale is amax_fp8 / 127, keeping the stored value within FP8 range
while the division by 127 is performed in float32.
Implementation |
Block Size |
Weight Error |
Output Error |
Time |
|---|---|---|---|---|
Naive (torch) |
32 |
1.01% |
0.79% |
1.7 ms |
SSE-Optimal (torch) |
32 |
0.57% |
0.40% |
236 ms |
H-Optimal (torch) |
32 |
0.60% |
0.37% |
1.2 s |
Naive (torch) |
64 |
0.93% |
0.72% |
1.7 ms |
SSE-Optimal (torch) |
64 |
0.64% |
0.45% |
204 ms |
H-Optimal (torch) |
64 |
0.66% |
0.42% |
1.5 s |
Naive (torch) |
128 |
0.88% |
0.68% |
1.6 ms |
SSE-Optimal (torch) |
128 |
0.71% |
0.49% |
173 ms |
H-Optimal (torch) |
128 |
0.73% |
0.48% |
2.8 s |
Naive (torch) |
256 |
0.87% |
0.66% |
1.6 ms |
SSE-Optimal (torch) |
256 |
0.77% |
0.54% |
165 ms |
H-Optimal (torch) |
256 |
0.79% |
0.52% |
4.9 s |
SSE-Optimal vs Naive (output error reduction):
Block size 32: -49.9% (0.79% \(\to\) 0.40%)
Block size 64: -38.1% (0.72% \(\to\) 0.45%)
Block size 128: -27.6% (0.68% \(\to\) 0.49%)
Block size 256: -18.6% (0.66% \(\to\) 0.54%)
H-Optimal vs SSE-Optimal (further output error reduction):
Block size 32: +7.0% further reduction (0.40% \(\to\) 0.37%)
Block size 64: +4.8% further reduction (0.45% \(\to\) 0.42%)
Block size 128: +3.3% further reduction (0.49% \(\to\) 0.48%)
Block size 256: +2.4% further reduction (0.54% \(\to\) 0.52%)
H-Optimal vs Naive (total output error reduction):
Block size 32: -53.4% (0.79% \(\to\) 0.37%)
Block size 64: -41.1% (0.72% \(\to\) 0.42%)
Block size 128: -29.9% (0.68% \(\to\) 0.48%)
Block size 256: -20.6% (0.66% \(\to\) 0.52%)
The massive naive-to-optimal improvement (up to 50%) is driven by the FP8 E4M3
scale grid: with only 126 discrete scale values, the naive amax snap often
lands on a scale that is significantly suboptimal, and the bounded search finds
a much better candidate. This is analogous to NVFP4’s scale search, but the effect
is even stronger because INT8’s 127 quantization levels amplify scale misalignment
(a scale error of \(\delta\) causes \(127\delta\) in the worst case, vs \(6\delta\) for FP4).
H-Optimal provides a further 2–7% reduction over SSE-Optimal by prioritizing output-sensitive weights.
NVFP4 (FP8 E4M3 scales)
Implementation |
Block Size |
Weight Error |
Output Error |
Time |
Speedup |
|---|---|---|---|---|---|
Naive (torch) |
16 |
10.05% |
6.89% |
2.8 ms |
|
Naive (Triton) |
16 |
10.05% |
6.89% |
1.9 ms |
1.5x |
SSE-Optimal (torch) |
16 |
8.74% |
6.04% |
234 ms |
|
SSE-Optimal (Triton) |
16 |
8.74% |
6.04% |
33 ms |
7.0x |
H-Optimal (torch) |
16 |
9.35% |
5.31% |
866 ms |
|
H-Optimal (Triton) |
16 |
9.35% |
5.31% |
470 ms |
1.8x |
Naive (torch) |
32 |
10.42% |
7.15% |
2.9 ms |
|
Naive (Triton) |
32 |
10.42% |
7.15% |
1.2 ms |
2.4x |
SSE-Optimal (torch) |
32 |
9.57% |
6.61% |
179 ms |
|
SSE-Optimal (Triton) |
32 |
9.57% |
6.61% |
18 ms |
10.2x |
H-Optimal (torch) |
32 |
10.12% |
5.95% |
676 ms |
|
H-Optimal (Triton) |
32 |
10.12% |
5.95% |
236 ms |
2.9x |
H-Optimal vs SSE-Optimal (output error reduction):
Block size 16: +12.0% further reduction (6.04% \(\to\) 5.31%)
Block size 32: +10.0% further reduction (6.61% \(\to\) 5.95%)
H-Optimal vs Naive (total output error reduction):
Block size 16: -22.9% (6.89% \(\to\) 5.31%)
Block size 32: -16.7% (7.15% \(\to\) 5.95%)
Weight error increases slightly (by 0.6–0.5pp) because H-Optimal optimizes for output error rather than weight error. This is the correct trade-off: a model’s quality depends on output error, not weight error.
MXFP4 (UE8M0 power-of-2 scales)
Implementation |
Block Size |
Weight Error |
Output Error |
Time |
Speedup |
|---|---|---|---|---|---|
Naive (torch) |
16 |
11.77% |
8.48% |
3.0 ms |
|
Naive (Triton) |
16 |
11.77% |
8.48% |
1.8 ms |
1.7x |
SSE-Optimal (torch) |
16 |
11.02% |
7.67% |
86 ms |
|
SSE-Optimal (Triton) |
16 |
11.02% |
7.67% |
2.6 ms |
33.6x |
H-Optimal (torch) |
16 |
11.10% |
7.62% |
545 ms |
|
Naive (torch) |
32 |
11.75% |
8.37% |
3.0 ms |
|
Naive (Triton) |
32 |
11.75% |
8.37% |
1.2 ms |
2.6x |
SSE-Optimal (torch) |
32 |
11.32% |
7.91% |
74 ms |
|
SSE-Optimal (Triton) |
32 |
11.32% |
7.91% |
1.6 ms |
45.7x |
H-Optimal (torch) |
32 |
11.42% |
7.80% |
361 ms |
H-Optimal vs SSE-Optimal (output error reduction):
Block size 16: +0.7% further reduction (7.67% \(\to\) 7.62%)
Block size 32: +1.4% further reduction (7.91% \(\to\) 7.80%)
The improvement is much smaller for MXFP4 because UE8M0 scales are powers of 2 – consecutive scales differ by a factor of 2, leaving only 1–2 candidates near the optimum. With so few choices, the Hessian criterion rarely selects a different scale than SSE.
Why NVFP4 benefits much more from Hessian-awareness
FP8 E4M3 has 126 finely-spaced positive scale values with non-uniform spacing. The SSE-optimal and H-optimal scales can differ by several FP8 steps, because the Hessian re-weights the importance of each element. With UE8M0’s coarse power-of-2 grid, this re-weighting almost always lands on the same scale.
Correctness notes
NVFP4 Triton vs Python reference: Scale computation matches exactly (0 disagreements). The max element-level abs diff (~5e-2) comes from FP4 decision-boundary tie-breaking: when \(|x|/s\) lands exactly on a codebook boundary (e.g. 0.75, 1.75, 3.5), the PTX div.full.f32 and PyTorch / produce results that round to different FP4 values. This affects ~0.01% of elements and does not affect the error metrics.
MXFP4 Triton vs Python reference: Naive kernel matches exactly (0.00 max abs diff). For the optimal kernel, in rare tie-breaking cases (1 in ~800k blocks), tl.sum tree reduction and PyTorch sequential .sum() accumulate float32 rounding differently, causing one to pick s0 and the other 2*s0 when their SSEs are identical. This produces a max abs diff of one scale step but does not affect the error metrics.
GPTQ Quantization
Reproduce with:
python experiments/quant_gptq_strided.py
GPTQ (Frantar et al., 2022) applies Optimal Brain Surgeon error compensation to sequential column-block quantization. After quantizing each block of columns, the quantization error is propagated to remaining columns using the inverse Hessian, minimizing the total output error.
Our implementation uses torch.as_strided for zero-copy sub-matrix views during
error propagation. The GPTQ block size equals the quantization block size, so each
column block is quantized and its error immediately compensated across all remaining
columns:
# After quantizing columns [cs:ce], propagate error via as_strided views:
h_cross = torch.as_strided(H_inv, (bs, rem), (K, 1), offset=cs*K + ce)
w_rem = torch.as_strided(W, (M, rem), (K, 1), offset=ce)
w_rem.sub_(err @ h_cross) # in-place, zero-copy
Three modes are compared: baseline (no GPTQ), sequential GPTQ (natural column order), and ordered GPTQ (column blocks sorted by descending Hessian-weighted quantization loss, so the highest-error blocks are quantized first and their error is compensated across the most remaining columns).
Block Size 16
Format |
Approach |
GPTQ |
Weight Error |
Output Error |
Time |
|---|---|---|---|---|---|
NVFP4 |
Naive |
— |
10.05% |
6.89% |
95ms |
NVFP4 |
GPTQ+Naive |
Seq |
12.58% |
5.53% |
402ms |
NVFP4 |
GPTQ-Ord+Naive |
Ord |
13.18% |
5.18% |
490ms |
NVFP4 |
Optimal |
— |
8.74% |
6.04% |
7.4s |
NVFP4 |
GPTQ+Optimal |
Seq |
10.94% |
4.82% |
7.7s |
NVFP4 |
GPTQ-Ord+Optimal |
Ord |
11.45% |
4.52% |
15.1s |
NVFP4 |
H-Optimal |
— |
9.37% |
5.34% |
7.7s |
NVFP4 |
GPTQ+H-Optimal |
Seq |
11.14% |
4.37% |
8.0s |
NVFP4 |
GPTQ-Ord+H-Optimal |
Ord |
11.53% |
4.21% |
15.9s |
MXFP4 |
Naive |
— |
11.77% |
8.48% |
102ms |
MXFP4 |
GPTQ+Naive |
Seq |
14.61% |
6.67% |
400ms |
MXFP4 |
GPTQ-Ord+Naive |
Ord |
15.27% |
6.20% |
517ms |
MXFP4 |
Optimal |
— |
11.02% |
7.67% |
6.7s |
MXFP4 |
GPTQ+Optimal |
Seq |
13.79% |
6.13% |
7.0s |
MXFP4 |
GPTQ-Ord+Optimal |
Ord |
14.43% |
5.72% |
13.8s |
MXFP4 |
H-Optimal |
— |
11.10% |
7.62% |
6.9s |
MXFP4 |
GPTQ+H-Optimal |
Seq |
13.82% |
6.10% |
7.1s |
MXFP4 |
GPTQ-Ord+H-Optimal |
Ord |
14.45% |
5.71% |
14.0s |
NVINT4 |
Naive |
— |
9.46% |
6.55% |
65ms |
NVINT4 |
GPTQ+Naive |
Seq |
11.84% |
5.23% |
376ms |
NVINT4 |
GPTQ-Ord+Naive |
Ord |
12.37% |
4.89% |
414ms |
NVINT4 |
Optimal |
— |
9.20% |
6.40% |
5.6s |
NVINT4 |
GPTQ+Optimal |
Seq |
11.54% |
5.12% |
5.9s |
NVINT4 |
GPTQ-Ord+Optimal |
Ord |
12.06% |
4.76% |
11.5s |
NVINT4 |
H-Optimal |
— |
9.60% |
6.04% |
5.9s |
NVINT4 |
GPTQ+H-Optimal |
Seq |
11.73% |
4.88% |
6.1s |
NVINT4 |
GPTQ-Ord+H-Optimal |
Ord |
12.20% |
4.65% |
12.0s |
Block Size 32
Format |
Approach |
GPTQ |
Weight Error |
Output Error |
Time |
|---|---|---|---|---|---|
NVFP4 |
Naive |
— |
10.42% |
7.15% |
37ms |
NVFP4 |
GPTQ+Naive |
Seq |
13.04% |
5.74% |
272ms |
NVFP4 |
GPTQ-Ord+Naive |
Ord |
13.53% |
5.43% |
320ms |
NVFP4 |
Optimal |
— |
9.57% |
6.61% |
3.6s |
NVFP4 |
GPTQ+Optimal |
Seq |
11.98% |
5.29% |
3.8s |
NVFP4 |
GPTQ-Ord+Optimal |
Ord |
12.42% |
5.01% |
7.3s |
NVFP4 |
H-Optimal |
— |
10.16% |
6.02% |
3.7s |
NVFP4 |
GPTQ+H-Optimal |
Seq |
12.21% |
4.91% |
4.0s |
NVFP4 |
GPTQ-Ord+H-Optimal |
Ord |
12.57% |
4.75% |
7.7s |
MXFP4 |
Naive |
— |
11.75% |
8.37% |
47ms |
MXFP4 |
GPTQ+Naive |
Seq |
14.62% |
6.62% |
273ms |
MXFP4 |
GPTQ-Ord+Naive |
Ord |
15.14% |
6.24% |
335ms |
MXFP4 |
Optimal |
— |
11.32% |
7.91% |
3.4s |
MXFP4 |
GPTQ+Optimal |
Seq |
14.16% |
6.32% |
3.5s |
MXFP4 |
GPTQ-Ord+Optimal |
Ord |
14.66% |
5.95% |
6.8s |
MXFP4 |
H-Optimal |
— |
11.42% |
7.80% |
3.4s |
MXFP4 |
GPTQ+H-Optimal |
Seq |
14.19% |
6.25% |
3.6s |
MXFP4 |
GPTQ-Ord+H-Optimal |
Ord |
14.68% |
5.92% |
7.0s |
NVINT4 |
Naive |
— |
10.36% |
7.18% |
24ms |
NVINT4 |
GPTQ+Naive |
Seq |
13.00% |
5.72% |
248ms |
NVINT4 |
GPTQ-Ord+Naive |
Ord |
13.45% |
5.42% |
282ms |
NVINT4 |
Optimal |
— |
10.13% |
7.10% |
2.8s |
NVINT4 |
GPTQ+Optimal |
Seq |
12.71% |
5.65% |
3.0s |
NVINT4 |
GPTQ-Ord+Optimal |
Ord |
13.14% |
5.33% |
5.8s |
NVINT4 |
H-Optimal |
— |
10.59% |
6.92% |
2.9s |
NVINT4 |
GPTQ+H-Optimal |
Seq |
13.12% |
5.57% |
3.1s |
NVINT4 |
GPTQ-Ord+H-Optimal |
Ord |
13.54% |
5.34% |
6.0s |
Ordered vs Sequential GPTQ
Additional output error reduction from reordering (pp over sequential):
Format |
Approach |
BS=16 |
BS=32 |
|---|---|---|---|
NVFP4 |
Naive |
-0.34pp |
-0.30pp |
NVFP4 |
Optimal |
-0.31pp |
-0.28pp |
NVFP4 |
H-Optimal |
-0.16pp |
-0.15pp |
MXFP4 |
Naive |
-0.47pp |
-0.38pp |
MXFP4 |
Optimal |
-0.41pp |
-0.37pp |
MXFP4 |
H-Optimal |
-0.39pp |
-0.33pp |
NVINT4 |
Naive |
-0.34pp |
-0.31pp |
NVINT4 |
Optimal |
-0.36pp |
-0.32pp |
NVINT4 |
H-Optimal |
-0.24pp |
-0.23pp |
Ordered GPTQ (quantizing highest-loss blocks first) consistently outperforms sequential GPTQ by 0.15–0.47pp. The gain is largest for MXFP4 (coarser scales create bigger per-block errors to redistribute) and for naive/optimal approaches (H-Optimal already concentrates error where it matters least, leaving less room for reordering to help). Weight error increases slightly more (~0.4–0.6pp over sequential) as a natural consequence of the stronger output-error optimization.
Exotic Scales
Reproduce with:
python experiments/quant_exotic_scales.py(no GPTQ)
python experiments/quant_gptq_exotic_scales.py(with GPTQ-Seq, GPTQ-Ord)
NVFP4 stores per-block scales in FP8 E4M3 (signed, 1+4+3 bits, 126 positive values). Scales are always non-negative, so the sign bit is wasted. We try two unsigned 8-bit alternatives that re-purpose the sign bit:
UE4M4 – 4-exp, 4-mantissa, bias 7. Trades the sign for one extra mantissa bit. Same dynamic range as E4M3 (max \(\approx\) 496 vs 448), but 2x denser scale grid (255 distinct positive values).
UE5M3 – 5-exp, 3-mantissa, bias 15. Same mantissa precision as E4M3 but much wider dynamic range (max \(\approx\) 122880). Also 255 positive values.
All codes are treated as finite (no NaN/Inf reserved). The FP4 codebook \(\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}\) is unchanged; only the per-block scale representation differs. Each table below crosses {Naive, SSE-Optimal, H-Optimal} with {no-GPTQ, GPTQ-Seq, GPTQ-Ord} for each scale grid.
Block Size 16
Scale |
Approach |
GPTQ |
Weight Error |
Output Error |
Time |
|---|---|---|---|---|---|
E4M3 |
Naive |
— |
10.05% |
6.89% |
87ms |
E4M3 |
GPTQ+Naive |
Seq |
12.58% |
5.53% |
402ms |
E4M3 |
GPTQ-Ord+Naive |
Ord |
13.18% |
5.18% |
492ms |
E4M3 |
Optimal |
— |
8.74% |
6.04% |
7.4s |
E4M3 |
GPTQ+Optimal |
Seq |
10.94% |
4.82% |
7.5s |
E4M3 |
GPTQ-Ord+Optimal |
Ord |
11.45% |
4.52% |
14.8s |
E4M3 |
H-Optimal |
— |
9.37% |
5.34% |
7.6s |
E4M3 |
GPTQ+H-Optimal |
Seq |
11.14% |
4.37% |
7.9s |
E4M3 |
GPTQ-Ord+H-Optimal |
Ord |
11.53% |
4.21% |
15.5s |
UE4M4 |
Naive |
— |
9.54% |
6.55% |
84ms |
UE4M4 |
GPTQ+Naive |
Seq |
11.97% |
5.25% |
394ms |
UE4M4 |
GPTQ-Ord+Naive |
Ord |
12.54% |
4.93% |
505ms |
UE4M4 |
Optimal |
— |
8.19% |
5.66% |
14.1s |
UE4M4 |
GPTQ+Optimal |
Seq |
10.26% |
4.52% |
14.5s |
UE4M4 |
GPTQ-Ord+Optimal |
Ord |
10.75% |
4.23% |
28.3s |
UE4M4 |
H-Optimal |
— |
8.95% |
4.97% |
14.7s |
UE4M4 |
GPTQ+H-Optimal |
Seq |
10.58% |
4.08% |
15.1s |
UE4M4 |
GPTQ-Ord+H-Optimal |
Ord |
10.94% |
3.94% |
29.9s |
UE5M3 |
Naive |
— |
9.47% |
6.51% |
85ms |
UE5M3 |
GPTQ+Naive |
Seq |
11.89% |
5.22% |
393ms |
UE5M3 |
GPTQ-Ord+Naive |
Ord |
12.46% |
4.89% |
504ms |
UE5M3 |
Optimal |
— |
8.13% |
5.63% |
12.5s |
UE5M3 |
GPTQ+Optimal |
Seq |
10.19% |
4.49% |
12.7s |
UE5M3 |
GPTQ-Ord+Optimal |
Ord |
10.67% |
4.21% |
25.1s |
UE5M3 |
H-Optimal |
— |
8.92% |
4.99% |
12.8s |
UE5M3 |
GPTQ+H-Optimal |
Seq |
10.56% |
4.09% |
13.2s |
UE5M3 |
GPTQ-Ord+H-Optimal |
Ord |
10.92% |
3.95% |
25.9s |
Block Size 32
Scale |
Approach |
GPTQ |
Weight Error |
Output Error |
Time |
|---|---|---|---|---|---|
E4M3 |
Naive |
— |
10.42% |
7.15% |
37ms |
E4M3 |
GPTQ+Naive |
Seq |
13.04% |
5.74% |
271ms |
E4M3 |
GPTQ-Ord+Naive |
Ord |
13.53% |
5.43% |
318ms |
E4M3 |
Optimal |
— |
9.57% |
6.61% |
3.5s |
E4M3 |
GPTQ+Optimal |
Seq |
11.98% |
5.29% |
3.7s |
E4M3 |
GPTQ-Ord+Optimal |
Ord |
12.42% |
5.01% |
7.2s |
E4M3 |
H-Optimal |
— |
10.16% |
6.02% |
3.7s |
E4M3 |
GPTQ+H-Optimal |
Seq |
12.21% |
4.91% |
3.9s |
E4M3 |
GPTQ-Ord+H-Optimal |
Ord |
12.57% |
4.75% |
7.5s |
UE4M4 |
Naive |
— |
10.18% |
6.99% |
42ms |
UE4M4 |
GPTQ+Naive |
Seq |
12.76% |
5.61% |
271ms |
UE4M4 |
GPTQ-Ord+Naive |
Ord |
13.24% |
5.31% |
326ms |
UE4M4 |
Optimal |
— |
9.16% |
6.32% |
6.8s |
UE4M4 |
GPTQ+Optimal |
Seq |
11.47% |
5.06% |
7.0s |
UE4M4 |
GPTQ-Ord+Optimal |
Ord |
11.88% |
4.79% |
13.8s |
UE4M4 |
H-Optimal |
— |
9.90% |
5.73% |
7.1s |
UE4M4 |
GPTQ+H-Optimal |
Seq |
11.85% |
4.70% |
7.3s |
UE4M4 |
GPTQ-Ord+H-Optimal |
Ord |
12.19% |
4.56% |
14.4s |
UE5M3 |
Naive |
— |
10.16% |
6.98% |
42ms |
UE5M3 |
GPTQ+Naive |
Seq |
12.74% |
5.61% |
271ms |
UE5M3 |
GPTQ-Ord+Naive |
Ord |
13.22% |
5.31% |
327ms |
UE5M3 |
Optimal |
— |
9.14% |
6.31% |
5.9s |
UE5M3 |
GPTQ+Optimal |
Seq |
11.44% |
5.06% |
6.2s |
UE5M3 |
GPTQ-Ord+Optimal |
Ord |
11.86% |
4.78% |
12.1s |
UE5M3 |
H-Optimal |
— |
9.89% |
5.75% |
6.1s |
UE5M3 |
GPTQ+H-Optimal |
Seq |
11.86% |
4.71% |
6.3s |
UE5M3 |
GPTQ-Ord+H-Optimal |
Ord |
12.20% |
4.58% |
12.3s |
Best output error per scale
Scale |
BS=16 best |
BS=32 best |
|---|---|---|
E4M3 |
4.21% |
4.75% |
UE4M4 |
3.94% |
4.56% |
UE5M3 |
3.95% |
4.58% |
(All bests are achieved by GPTQ-Ord+H-Optimal.)
Output error reduction vs E4M3 (same approach + mode)
Approach + mode |
BS=16: UE4M4 |
BS=16: UE5M3 |
BS=32: UE4M4 |
BS=32: UE5M3 |
|---|---|---|---|---|
Naive (no GPTQ) |
-0.34pp |
-0.38pp |
-0.16pp |
-0.17pp |
Optimal (no GPTQ) |
-0.38pp |
-0.41pp |
-0.29pp |
-0.30pp |
H-Optimal (no GPTQ) |
-0.37pp |
-0.35pp |
-0.29pp |
-0.27pp |
GPTQ+Naive |
-0.28pp |
-0.31pp |
-0.13pp |
-0.13pp |
GPTQ+Optimal |
-0.30pp |
-0.33pp |
-0.23pp |
-0.23pp |
GPTQ+H-Optimal |
-0.29pp |
-0.28pp |
-0.21pp |
-0.20pp |
GPTQ-Ord+Naive |
-0.25pp |
-0.29pp |
-0.12pp |
-0.12pp |
GPTQ-Ord+Optimal |
-0.29pp |
-0.31pp |
-0.22pp |
-0.23pp |
GPTQ-Ord+H-Optimal |
-0.27pp |
-0.26pp |
-0.19pp |
-0.17pp |
Both unsigned formats beat E4M3 across every approach × mode × block size. The relative gain shrinks somewhat once GPTQ is layered on (GPTQ already compensates for some of the per-block scale-snapping loss), but the absolute output error keeps falling – the best result of every scale grid is GPTQ-Ord+H-Optimal, and UE4M4/UE5M3 still beat E4M3 there by 0.17–0.27pp.
UE4M4 and UE5M3 perform almost identically (within 0.01–0.03pp) across the full grid, even though UE5M3 has \(\sim\)250x more dynamic range. Weight magnitudes in this layer fall well within E4M3’s range, so extra range is wasted – what matters is grid density near the optimal scale, and both formats double the density relative to E4M3.
Caveat: standard FP8 E4M3 hardware support exists on Hopper/Ada; UE4M4 and UE5M3 do not have hardware encoders, so naive-mode quantization is slower in production (the snap requires a table lookup rather than a hardware cast). Optimal/H-Optimal modes are unaffected since they iterate over the scale table either way; the ~2x slowdown there is purely from doubling the candidate count (255 vs 126).
Larger blocks: bs=64 (E4M3 scales) and bs=128 (FP16 scales)
Same layer_0 setup; W = 2560x9728, X = 244449x9728. Layouts here trade
scale precision and block size for the same total bits/weight:
Config |
Block |
Scale |
Scale b/w |
Total b/w |
|---|---|---|---|---|
Baseline NVFP4 |
16 |
FP8 E4M3 |
0.500 |
4.500 |
– |
32 |
FP8 E4M3 |
0.250 |
4.250 |
New |
64 |
FP8 E4M3 |
0.125 |
4.125 |
New |
128 |
FP16 E5M10 |
0.125 |
4.125 |
For bs=128 + FP16, the snapped continuous optimum is essentially the true SSE / H-optimal minimum, so per-block scales are found by iterative alternation (q -> closed-form continuous s -> fp16 snap) rather than grid search.
bs=64, FP8 E4M3 scales (4.125 b/w)
Codebook |
Approach |
GPTQ |
Weight Error |
Output Error |
Time |
|---|---|---|---|---|---|
FP4 |
Naive |
– |
10.77% |
7.41% |
42ms |
FP4 |
Optimal |
– |
10.19% |
7.05% |
1.0s |
FP4 |
H-Optimal |
– |
10.66% |
6.48% |
21.8s |
FP4 |
GPTQ-Ord+H-Optimal |
Ord |
13.34% |
5.22% |
202s |
FP4 |
GPTQ-Ord+H-Opt+SPGL1 |
Ord |
13.28% |
4.65% |
373s |
INT4 |
Naive |
– |
11.37% |
7.88% |
9ms |
INT4 |
Optimal |
– |
10.89% |
7.71% |
687ms |
INT4 |
H-Optimal |
– |
11.37% |
7.23% |
21.6s |
INT4 |
GPTQ-Ord+H-Optimal |
Ord |
14.77% |
5.99% |
145s |
INT4 |
GPTQ-Ord+H-Opt+SPGL1 |
Ord |
15.24% |
5.22% |
367s |
bs=128, FP16 E5M10 scales (4.125 b/w)
Codebook |
Approach |
GPTQ |
Weight Error |
Output Error |
Time |
|---|---|---|---|---|---|
FP4 |
Naive |
– |
11.00% |
7.56% |
8ms |
FP4 |
Optimal (iter) |
– |
10.56% |
7.33% |
61ms |
FP4 |
H-Optimal (iter) |
– |
10.76% |
6.88% |
158ms |
FP4 |
GPTQ-Seq+Naive |
Seq |
13.76% |
6.08% |
706ms |
FP4 |
GPTQ-Ord+H-Optimal |
Ord |
13.47% |
5.47% |
621ms |
FP4 |
GPTQ-Ord+H-Opt+SPGL1 |
Ord |
13.85% |
4.77% |
30.4s |
INT4 |
Naive |
– |
12.31% |
8.54% |
6ms |
INT4 |
Optimal (iter) |
– |
11.49% |
8.27% |
80ms |
INT4 |
H-Optimal (iter) |
– |
11.79% |
7.87% |
10.1s |
INT4 |
GPTQ-Seq+Naive |
Seq |
15.50% |
6.84% |
361ms |
INT4 |
GPTQ-Ord+H-Optimal |
Ord |
14.93% |
6.17% |
8.3s |
INT4 |
GPTQ-Ord+H-Opt+SPGL1 |
Ord |
15.37% |
5.35% |
110s |
Same-budget head-to-head (4.125 b/w)
Codebook |
Config |
GPTQ+H-Opt O% |
+ SPGL1 O% |
ΔO from SPGL1 |
|---|---|---|---|---|
FP4 |
bs=64, E4M3 |
5.22 |
4.65 |
-0.57 |
FP4 |
bs=128, FP16 |
5.47 |
4.77 |
-0.70 |
INT4 |
bs=64, E4M3 |
5.99 |
5.22 |
-0.77 |
INT4 |
bs=128, FP16 |
6.17 |
5.35 |
-0.82 |
At the same 4.125 b/w budget, bs=64 + E4M3 beats bs=128 + FP16 for both codebooks (FP4: -0.12pp, INT4: -0.13pp) – coarser scale precision but tighter per-block fit wins out. SPGL1 contributes a larger absolute gain at the bs=128/FP16 point (~0.7-0.8pp) than at bs=64/E4M3 (~0.6-0.8pp), but not enough to flip the ordering.
Where these land vs the bs=16 best
Config |
b/w |
H-Opt O% |
+SPGL1 O% |
|---|---|---|---|
FP4 bs=16, E4M3 (NVFP4) |
4.500 |
5.31 |
3.64 |
FP4 bs=64, E4M3 |
4.125 |
6.48 |
4.65 |
FP4 bs=128, FP16 |
4.125 |
6.88 |
4.77 |
INT4 bs=16, E4M3 (NVINT4) |
4.500 |
5.60 |
(not run) |
INT4 bs=64, E4M3 |
4.125 |
7.23 |
5.22 |
INT4 bs=128, FP16 |
4.125 |
7.87 |
5.35 |
FP4 dominates INT4 by ~0.6-0.9pp output error at every operating point. The codebook’s wider dynamic range (0..6 vs symmetric 0..7) more efficiently captures the long-tailed per-block weight distribution at larger block sizes.
SPGL1 compensation method (recap)
After each block is snapped (in descending H-loss order), instead of
GPTQ’s unconstrained H_inv error propagation, an L1-constrained SPGL1
LASSO is solved on the not-yet-snapped columns to minimize
||X*(Delta_eff + delta)^T||_2 subject to ||delta||_1 <= tau. Solved in
reduced (Gram) form – no H^-1, no Cholesky – robust to ill-conditioned
H. See experiments/spgl1_gptq_*.py for the implementations and
notes/progress_track_spgl1.md for the full research log.