Results

Reproduce with: python bench/full_bench.py

Note

Benchmarked on the down_proj weight of the first decoder layer from Qwen3-4B (W: 2560x9728, bfloat16), with activations collected from WikiText-2 (max_seq_len=512, num_samples=2048, X: 244449x9728, bfloat16).

  • Weight error: \(\lVert Q(W) - W \rVert_F / \lVert W \rVert_F\)

  • Output error: \(\lVert X W_q^T - X W^T \rVert_F / \lVert X W^T \rVert_F\)

INT8 (FP8 E4M3 scales)

Symmetric INT8 quantization ([-127, 127]) with per-block amax stored in FP8 E4M3. The effective scale is amax_fp8 / 127, keeping the stored value within FP8 range while the division by 127 is performed in float32.

Implementation

Block Size

Weight Error

Output Error

Time

Naive (torch)

32

1.01%

0.79%

1.7 ms

SSE-Optimal (torch)

32

0.57%

0.40%

236 ms

H-Optimal (torch)

32

0.60%

0.37%

1.2 s

Naive (torch)

64

0.93%

0.72%

1.7 ms

SSE-Optimal (torch)

64

0.64%

0.45%

204 ms

H-Optimal (torch)

64

0.66%

0.42%

1.5 s

Naive (torch)

128

0.88%

0.68%

1.6 ms

SSE-Optimal (torch)

128

0.71%

0.49%

173 ms

H-Optimal (torch)

128

0.73%

0.48%

2.8 s

Naive (torch)

256

0.87%

0.66%

1.6 ms

SSE-Optimal (torch)

256

0.77%

0.54%

165 ms

H-Optimal (torch)

256

0.79%

0.52%

4.9 s

SSE-Optimal vs Naive (output error reduction):

  • Block size 32: -49.9% (0.79% \(\to\) 0.40%)

  • Block size 64: -38.1% (0.72% \(\to\) 0.45%)

  • Block size 128: -27.6% (0.68% \(\to\) 0.49%)

  • Block size 256: -18.6% (0.66% \(\to\) 0.54%)

H-Optimal vs SSE-Optimal (further output error reduction):

  • Block size 32: +7.0% further reduction (0.40% \(\to\) 0.37%)

  • Block size 64: +4.8% further reduction (0.45% \(\to\) 0.42%)

  • Block size 128: +3.3% further reduction (0.49% \(\to\) 0.48%)

  • Block size 256: +2.4% further reduction (0.54% \(\to\) 0.52%)

H-Optimal vs Naive (total output error reduction):

  • Block size 32: -53.4% (0.79% \(\to\) 0.37%)

  • Block size 64: -41.1% (0.72% \(\to\) 0.42%)

  • Block size 128: -29.9% (0.68% \(\to\) 0.48%)

  • Block size 256: -20.6% (0.66% \(\to\) 0.52%)

The massive naive-to-optimal improvement (up to 50%) is driven by the FP8 E4M3 scale grid: with only 126 discrete scale values, the naive amax snap often lands on a scale that is significantly suboptimal, and the bounded search finds a much better candidate. This is analogous to NVFP4’s scale search, but the effect is even stronger because INT8’s 127 quantization levels amplify scale misalignment (a scale error of \(\delta\) causes \(127\delta\) in the worst case, vs \(6\delta\) for FP4).

H-Optimal provides a further 2–7% reduction over SSE-Optimal by prioritizing output-sensitive weights.

NVFP4 (FP8 E4M3 scales)

Implementation

Block Size

Weight Error

Output Error

Time

Speedup

Naive (torch)

16

10.05%

6.89%

2.8 ms

Naive (Triton)

16

10.05%

6.89%

1.9 ms

1.5x

SSE-Optimal (torch)

16

8.74%

6.04%

234 ms

SSE-Optimal (Triton)

16

8.74%

6.04%

33 ms

7.0x

H-Optimal (torch)

16

9.35%

5.31%

866 ms

H-Optimal (Triton)

16

9.35%

5.31%

470 ms

1.8x

Naive (torch)

32

10.42%

7.15%

2.9 ms

Naive (Triton)

32

10.42%

7.15%

1.2 ms

2.4x

SSE-Optimal (torch)

32

9.57%

6.61%

179 ms

SSE-Optimal (Triton)

32

9.57%

6.61%

18 ms

10.2x

H-Optimal (torch)

32

10.12%

5.95%

676 ms

H-Optimal (Triton)

32

10.12%

5.95%

236 ms

2.9x

H-Optimal vs SSE-Optimal (output error reduction):

  • Block size 16: +12.0% further reduction (6.04% \(\to\) 5.31%)

  • Block size 32: +10.0% further reduction (6.61% \(\to\) 5.95%)

H-Optimal vs Naive (total output error reduction):

  • Block size 16: -22.9% (6.89% \(\to\) 5.31%)

  • Block size 32: -16.7% (7.15% \(\to\) 5.95%)

Weight error increases slightly (by 0.6–0.5pp) because H-Optimal optimizes for output error rather than weight error. This is the correct trade-off: a model’s quality depends on output error, not weight error.

MXFP4 (UE8M0 power-of-2 scales)

Implementation

Block Size

Weight Error

Output Error

Time

Speedup

Naive (torch)

16

11.77%

8.48%

3.0 ms

Naive (Triton)

16

11.77%

8.48%

1.8 ms

1.7x

SSE-Optimal (torch)

16

11.02%

7.67%

86 ms

SSE-Optimal (Triton)

16

11.02%

7.67%

2.6 ms

33.6x

H-Optimal (torch)

16

11.10%

7.62%

545 ms

Naive (torch)

32

11.75%

8.37%

3.0 ms

Naive (Triton)

32

11.75%

8.37%

1.2 ms

2.6x

SSE-Optimal (torch)

32

11.32%

7.91%

74 ms

SSE-Optimal (Triton)

32

11.32%

7.91%

1.6 ms

45.7x

H-Optimal (torch)

32

11.42%

7.80%

361 ms

H-Optimal vs SSE-Optimal (output error reduction):

  • Block size 16: +0.7% further reduction (7.67% \(\to\) 7.62%)

  • Block size 32: +1.4% further reduction (7.91% \(\to\) 7.80%)

The improvement is much smaller for MXFP4 because UE8M0 scales are powers of 2 – consecutive scales differ by a factor of 2, leaving only 1–2 candidates near the optimum. With so few choices, the Hessian criterion rarely selects a different scale than SSE.

Why NVFP4 benefits much more from Hessian-awareness

FP8 E4M3 has 126 finely-spaced positive scale values with non-uniform spacing. The SSE-optimal and H-optimal scales can differ by several FP8 steps, because the Hessian re-weights the importance of each element. With UE8M0’s coarse power-of-2 grid, this re-weighting almost always lands on the same scale.

Correctness notes

NVFP4 Triton vs Python reference: Scale computation matches exactly (0 disagreements). The max element-level abs diff (~5e-2) comes from FP4 decision-boundary tie-breaking: when \(|x|/s\) lands exactly on a codebook boundary (e.g. 0.75, 1.75, 3.5), the PTX div.full.f32 and PyTorch / produce results that round to different FP4 values. This affects ~0.01% of elements and does not affect the error metrics.

MXFP4 Triton vs Python reference: Naive kernel matches exactly (0.00 max abs diff). For the optimal kernel, in rare tie-breaking cases (1 in ~800k blocks), tl.sum tree reduction and PyTorch sequential .sum() accumulate float32 rounding differently, causing one to pick s0 and the other 2*s0 when their SSEs are identical. This produces a max abs diff of one scale step but does not affect the error metrics.

GPTQ Quantization

Reproduce with: python experiments/quant_gptq_strided.py

GPTQ (Frantar et al., 2022) applies Optimal Brain Surgeon error compensation to sequential column-block quantization. After quantizing each block of columns, the quantization error is propagated to remaining columns using the inverse Hessian, minimizing the total output error.

Our implementation uses torch.as_strided for zero-copy sub-matrix views during error propagation. The GPTQ block size equals the quantization block size, so each column block is quantized and its error immediately compensated across all remaining columns:

# After quantizing columns [cs:ce], propagate error via as_strided views:
h_cross = torch.as_strided(H_inv, (bs, rem), (K, 1), offset=cs*K + ce)
w_rem   = torch.as_strided(W,     (M, rem),  (K, 1), offset=ce)
w_rem.sub_(err @ h_cross)  # in-place, zero-copy

Three modes are compared: baseline (no GPTQ), sequential GPTQ (natural column order), and ordered GPTQ (column blocks sorted by descending Hessian-weighted quantization loss, so the highest-error blocks are quantized first and their error is compensated across the most remaining columns).

Block Size 16

Format

Approach

GPTQ

Weight Error

Output Error

Time

NVFP4

Naive

10.05%

6.89%

95ms

NVFP4

GPTQ+Naive

Seq

12.58%

5.53%

402ms

NVFP4

GPTQ-Ord+Naive

Ord

13.18%

5.18%

490ms

NVFP4

Optimal

8.74%

6.04%

7.4s

NVFP4

GPTQ+Optimal

Seq

10.94%

4.82%

7.7s

NVFP4

GPTQ-Ord+Optimal

Ord

11.45%

4.52%

15.1s

NVFP4

H-Optimal

9.37%

5.34%

7.7s

NVFP4

GPTQ+H-Optimal

Seq

11.14%

4.37%

8.0s

NVFP4

GPTQ-Ord+H-Optimal

Ord

11.53%

4.21%

15.9s

MXFP4

Naive

11.77%

8.48%

102ms

MXFP4

GPTQ+Naive

Seq

14.61%

6.67%

400ms

MXFP4

GPTQ-Ord+Naive

Ord

15.27%

6.20%

517ms

MXFP4

Optimal

11.02%

7.67%

6.7s

MXFP4

GPTQ+Optimal

Seq

13.79%

6.13%

7.0s

MXFP4

GPTQ-Ord+Optimal

Ord

14.43%

5.72%

13.8s

MXFP4

H-Optimal

11.10%

7.62%

6.9s

MXFP4

GPTQ+H-Optimal

Seq

13.82%

6.10%

7.1s

MXFP4

GPTQ-Ord+H-Optimal

Ord

14.45%

5.71%

14.0s

NVINT4

Naive

9.46%

6.55%

65ms

NVINT4

GPTQ+Naive

Seq

11.84%

5.23%

376ms

NVINT4

GPTQ-Ord+Naive

Ord

12.37%

4.89%

414ms

NVINT4

Optimal

9.20%

6.40%

5.6s

NVINT4

GPTQ+Optimal

Seq

11.54%

5.12%

5.9s

NVINT4

GPTQ-Ord+Optimal

Ord

12.06%

4.76%

11.5s

NVINT4

H-Optimal

9.60%

6.04%

5.9s

NVINT4

GPTQ+H-Optimal

Seq

11.73%

4.88%

6.1s

NVINT4

GPTQ-Ord+H-Optimal

Ord

12.20%

4.65%

12.0s

Block Size 32

Format

Approach

GPTQ

Weight Error

Output Error

Time

NVFP4

Naive

10.42%

7.15%

37ms

NVFP4

GPTQ+Naive

Seq

13.04%

5.74%

272ms

NVFP4

GPTQ-Ord+Naive

Ord

13.53%

5.43%

320ms

NVFP4

Optimal

9.57%

6.61%

3.6s

NVFP4

GPTQ+Optimal

Seq

11.98%

5.29%

3.8s

NVFP4

GPTQ-Ord+Optimal

Ord

12.42%

5.01%

7.3s

NVFP4

H-Optimal

10.16%

6.02%

3.7s

NVFP4

GPTQ+H-Optimal

Seq

12.21%

4.91%

4.0s

NVFP4

GPTQ-Ord+H-Optimal

Ord

12.57%

4.75%

7.7s

MXFP4

Naive

11.75%

8.37%

47ms

MXFP4

GPTQ+Naive

Seq

14.62%

6.62%

273ms

MXFP4

GPTQ-Ord+Naive

Ord

15.14%

6.24%

335ms

MXFP4

Optimal

11.32%

7.91%

3.4s

MXFP4

GPTQ+Optimal

Seq

14.16%

6.32%

3.5s

MXFP4

GPTQ-Ord+Optimal

Ord

14.66%

5.95%

6.8s

MXFP4

H-Optimal

11.42%

7.80%

3.4s

MXFP4

GPTQ+H-Optimal

Seq

14.19%

6.25%

3.6s

MXFP4

GPTQ-Ord+H-Optimal

Ord

14.68%

5.92%

7.0s

NVINT4

Naive

10.36%

7.18%

24ms

NVINT4

GPTQ+Naive

Seq

13.00%

5.72%

248ms

NVINT4

GPTQ-Ord+Naive

Ord

13.45%

5.42%

282ms

NVINT4

Optimal

10.13%

7.10%

2.8s

NVINT4

GPTQ+Optimal

Seq

12.71%

5.65%

3.0s

NVINT4

GPTQ-Ord+Optimal

Ord

13.14%

5.33%

5.8s

NVINT4

H-Optimal

10.59%

6.92%

2.9s

NVINT4

GPTQ+H-Optimal

Seq

13.12%

5.57%

3.1s

NVINT4

GPTQ-Ord+H-Optimal

Ord

13.54%

5.34%

6.0s

Ordered vs Sequential GPTQ

Additional output error reduction from reordering (pp over sequential):

Format

Approach

BS=16

BS=32

NVFP4

Naive

-0.34pp

-0.30pp

NVFP4

Optimal

-0.31pp

-0.28pp

NVFP4

H-Optimal

-0.16pp

-0.15pp

MXFP4

Naive

-0.47pp

-0.38pp

MXFP4

Optimal

-0.41pp

-0.37pp

MXFP4

H-Optimal

-0.39pp

-0.33pp

NVINT4

Naive

-0.34pp

-0.31pp

NVINT4

Optimal

-0.36pp

-0.32pp

NVINT4

H-Optimal

-0.24pp

-0.23pp

Ordered GPTQ (quantizing highest-loss blocks first) consistently outperforms sequential GPTQ by 0.15–0.47pp. The gain is largest for MXFP4 (coarser scales create bigger per-block errors to redistribute) and for naive/optimal approaches (H-Optimal already concentrates error where it matters least, leaving less room for reordering to help). Weight error increases slightly more (~0.4–0.6pp over sequential) as a natural consequence of the stronger output-error optimization.

Exotic Scales

Reproduce with:

  • python experiments/quant_exotic_scales.py (no GPTQ)

  • python experiments/quant_gptq_exotic_scales.py (with GPTQ-Seq, GPTQ-Ord)

NVFP4 stores per-block scales in FP8 E4M3 (signed, 1+4+3 bits, 126 positive values). Scales are always non-negative, so the sign bit is wasted. We try two unsigned 8-bit alternatives that re-purpose the sign bit:

  • UE4M4 – 4-exp, 4-mantissa, bias 7. Trades the sign for one extra mantissa bit. Same dynamic range as E4M3 (max \(\approx\) 496 vs 448), but 2x denser scale grid (255 distinct positive values).

  • UE5M3 – 5-exp, 3-mantissa, bias 15. Same mantissa precision as E4M3 but much wider dynamic range (max \(\approx\) 122880). Also 255 positive values.

All codes are treated as finite (no NaN/Inf reserved). The FP4 codebook \(\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}\) is unchanged; only the per-block scale representation differs. Each table below crosses {Naive, SSE-Optimal, H-Optimal} with {no-GPTQ, GPTQ-Seq, GPTQ-Ord} for each scale grid.

Block Size 16

Scale

Approach

GPTQ

Weight Error

Output Error

Time

E4M3

Naive

10.05%

6.89%

87ms

E4M3

GPTQ+Naive

Seq

12.58%

5.53%

402ms

E4M3

GPTQ-Ord+Naive

Ord

13.18%

5.18%

492ms

E4M3

Optimal

8.74%

6.04%

7.4s

E4M3

GPTQ+Optimal

Seq

10.94%

4.82%

7.5s

E4M3

GPTQ-Ord+Optimal

Ord

11.45%

4.52%

14.8s

E4M3

H-Optimal

9.37%

5.34%

7.6s

E4M3

GPTQ+H-Optimal

Seq

11.14%

4.37%

7.9s

E4M3

GPTQ-Ord+H-Optimal

Ord

11.53%

4.21%

15.5s

UE4M4

Naive

9.54%

6.55%

84ms

UE4M4

GPTQ+Naive

Seq

11.97%

5.25%

394ms

UE4M4

GPTQ-Ord+Naive

Ord

12.54%

4.93%

505ms

UE4M4

Optimal

8.19%

5.66%

14.1s

UE4M4

GPTQ+Optimal

Seq

10.26%

4.52%

14.5s

UE4M4

GPTQ-Ord+Optimal

Ord

10.75%

4.23%

28.3s

UE4M4

H-Optimal

8.95%

4.97%

14.7s

UE4M4

GPTQ+H-Optimal

Seq

10.58%

4.08%

15.1s

UE4M4

GPTQ-Ord+H-Optimal

Ord

10.94%

3.94%

29.9s

UE5M3

Naive

9.47%

6.51%

85ms

UE5M3

GPTQ+Naive

Seq

11.89%

5.22%

393ms

UE5M3

GPTQ-Ord+Naive

Ord

12.46%

4.89%

504ms

UE5M3

Optimal

8.13%

5.63%

12.5s

UE5M3

GPTQ+Optimal

Seq

10.19%

4.49%

12.7s

UE5M3

GPTQ-Ord+Optimal

Ord

10.67%

4.21%

25.1s

UE5M3

H-Optimal

8.92%

4.99%

12.8s

UE5M3

GPTQ+H-Optimal

Seq

10.56%

4.09%

13.2s

UE5M3

GPTQ-Ord+H-Optimal

Ord

10.92%

3.95%

25.9s

Block Size 32

Scale

Approach

GPTQ

Weight Error

Output Error

Time

E4M3

Naive

10.42%

7.15%

37ms

E4M3

GPTQ+Naive

Seq

13.04%

5.74%

271ms

E4M3

GPTQ-Ord+Naive

Ord

13.53%

5.43%

318ms

E4M3

Optimal

9.57%

6.61%

3.5s

E4M3

GPTQ+Optimal

Seq

11.98%

5.29%

3.7s

E4M3

GPTQ-Ord+Optimal

Ord

12.42%

5.01%

7.2s

E4M3

H-Optimal

10.16%

6.02%

3.7s

E4M3

GPTQ+H-Optimal

Seq

12.21%

4.91%

3.9s

E4M3

GPTQ-Ord+H-Optimal

Ord

12.57%

4.75%

7.5s

UE4M4

Naive

10.18%

6.99%

42ms

UE4M4

GPTQ+Naive

Seq

12.76%

5.61%

271ms

UE4M4

GPTQ-Ord+Naive

Ord

13.24%

5.31%

326ms

UE4M4

Optimal

9.16%

6.32%

6.8s

UE4M4

GPTQ+Optimal

Seq

11.47%

5.06%

7.0s

UE4M4

GPTQ-Ord+Optimal

Ord

11.88%

4.79%

13.8s

UE4M4

H-Optimal

9.90%

5.73%

7.1s

UE4M4

GPTQ+H-Optimal

Seq

11.85%

4.70%

7.3s

UE4M4

GPTQ-Ord+H-Optimal

Ord

12.19%

4.56%

14.4s

UE5M3

Naive

10.16%

6.98%

42ms

UE5M3

GPTQ+Naive

Seq

12.74%

5.61%

271ms

UE5M3

GPTQ-Ord+Naive

Ord

13.22%

5.31%

327ms

UE5M3

Optimal

9.14%

6.31%

5.9s

UE5M3

GPTQ+Optimal

Seq

11.44%

5.06%

6.2s

UE5M3

GPTQ-Ord+Optimal

Ord

11.86%

4.78%

12.1s

UE5M3

H-Optimal

9.89%

5.75%

6.1s

UE5M3

GPTQ+H-Optimal

Seq

11.86%

4.71%

6.3s

UE5M3

GPTQ-Ord+H-Optimal

Ord

12.20%

4.58%

12.3s

Best output error per scale

Scale

BS=16 best

BS=32 best

E4M3

4.21%

4.75%

UE4M4

3.94%

4.56%

UE5M3

3.95%

4.58%

(All bests are achieved by GPTQ-Ord+H-Optimal.)

Output error reduction vs E4M3 (same approach + mode)

Approach + mode

BS=16: UE4M4

BS=16: UE5M3

BS=32: UE4M4

BS=32: UE5M3

Naive (no GPTQ)

-0.34pp

-0.38pp

-0.16pp

-0.17pp

Optimal (no GPTQ)

-0.38pp

-0.41pp

-0.29pp

-0.30pp

H-Optimal (no GPTQ)

-0.37pp

-0.35pp

-0.29pp

-0.27pp

GPTQ+Naive

-0.28pp

-0.31pp

-0.13pp

-0.13pp

GPTQ+Optimal

-0.30pp

-0.33pp

-0.23pp

-0.23pp

GPTQ+H-Optimal

-0.29pp

-0.28pp

-0.21pp

-0.20pp

GPTQ-Ord+Naive

-0.25pp

-0.29pp

-0.12pp

-0.12pp

GPTQ-Ord+Optimal

-0.29pp

-0.31pp

-0.22pp

-0.23pp

GPTQ-Ord+H-Optimal

-0.27pp

-0.26pp

-0.19pp

-0.17pp

Both unsigned formats beat E4M3 across every approach × mode × block size. The relative gain shrinks somewhat once GPTQ is layered on (GPTQ already compensates for some of the per-block scale-snapping loss), but the absolute output error keeps falling – the best result of every scale grid is GPTQ-Ord+H-Optimal, and UE4M4/UE5M3 still beat E4M3 there by 0.17–0.27pp.

UE4M4 and UE5M3 perform almost identically (within 0.01–0.03pp) across the full grid, even though UE5M3 has \(\sim\)250x more dynamic range. Weight magnitudes in this layer fall well within E4M3’s range, so extra range is wasted – what matters is grid density near the optimal scale, and both formats double the density relative to E4M3.

Caveat: standard FP8 E4M3 hardware support exists on Hopper/Ada; UE4M4 and UE5M3 do not have hardware encoders, so naive-mode quantization is slower in production (the snap requires a table lookup rather than a hardware cast). Optimal/H-Optimal modes are unaffected since they iterate over the scale table either way; the ~2x slowdown there is purely from doubling the candidate count (255 vs 126).

Larger blocks: bs=64 (E4M3 scales) and bs=128 (FP16 scales)

Same layer_0 setup; W = 2560x9728, X = 244449x9728. Layouts here trade scale precision and block size for the same total bits/weight:

Config

Block

Scale

Scale b/w

Total b/w

Baseline NVFP4

16

FP8 E4M3

0.500

4.500

32

FP8 E4M3

0.250

4.250

New

64

FP8 E4M3

0.125

4.125

New

128

FP16 E5M10

0.125

4.125

For bs=128 + FP16, the snapped continuous optimum is essentially the true SSE / H-optimal minimum, so per-block scales are found by iterative alternation (q -> closed-form continuous s -> fp16 snap) rather than grid search.

bs=64, FP8 E4M3 scales (4.125 b/w)

Codebook

Approach

GPTQ

Weight Error

Output Error

Time

FP4

Naive

10.77%

7.41%

42ms

FP4

Optimal

10.19%

7.05%

1.0s

FP4

H-Optimal

10.66%

6.48%

21.8s

FP4

GPTQ-Ord+H-Optimal

Ord

13.34%

5.22%

202s

FP4

GPTQ-Ord+H-Opt+SPGL1

Ord

13.28%

4.65%

373s

INT4

Naive

11.37%

7.88%

9ms

INT4

Optimal

10.89%

7.71%

687ms

INT4

H-Optimal

11.37%

7.23%

21.6s

INT4

GPTQ-Ord+H-Optimal

Ord

14.77%

5.99%

145s

INT4

GPTQ-Ord+H-Opt+SPGL1

Ord

15.24%

5.22%

367s

bs=128, FP16 E5M10 scales (4.125 b/w)

Codebook

Approach

GPTQ

Weight Error

Output Error

Time

FP4

Naive

11.00%

7.56%

8ms

FP4

Optimal (iter)

10.56%

7.33%

61ms

FP4

H-Optimal (iter)

10.76%

6.88%

158ms

FP4

GPTQ-Seq+Naive

Seq

13.76%

6.08%

706ms

FP4

GPTQ-Ord+H-Optimal

Ord

13.47%

5.47%

621ms

FP4

GPTQ-Ord+H-Opt+SPGL1

Ord

13.85%

4.77%

30.4s

INT4

Naive

12.31%

8.54%

6ms

INT4

Optimal (iter)

11.49%

8.27%

80ms

INT4

H-Optimal (iter)

11.79%

7.87%

10.1s

INT4

GPTQ-Seq+Naive

Seq

15.50%

6.84%

361ms

INT4

GPTQ-Ord+H-Optimal

Ord

14.93%

6.17%

8.3s

INT4

GPTQ-Ord+H-Opt+SPGL1

Ord

15.37%

5.35%

110s

Same-budget head-to-head (4.125 b/w)

Codebook

Config

GPTQ+H-Opt O%

+ SPGL1 O%

ΔO from SPGL1

FP4

bs=64, E4M3

5.22

4.65

-0.57

FP4

bs=128, FP16

5.47

4.77

-0.70

INT4

bs=64, E4M3

5.99

5.22

-0.77

INT4

bs=128, FP16

6.17

5.35

-0.82

At the same 4.125 b/w budget, bs=64 + E4M3 beats bs=128 + FP16 for both codebooks (FP4: -0.12pp, INT4: -0.13pp) – coarser scale precision but tighter per-block fit wins out. SPGL1 contributes a larger absolute gain at the bs=128/FP16 point (~0.7-0.8pp) than at bs=64/E4M3 (~0.6-0.8pp), but not enough to flip the ordering.

Where these land vs the bs=16 best

Config

b/w

H-Opt O%

+SPGL1 O%

FP4 bs=16, E4M3 (NVFP4)

4.500

5.31

3.64

FP4 bs=64, E4M3

4.125

6.48

4.65

FP4 bs=128, FP16

4.125

6.88

4.77

INT4 bs=16, E4M3 (NVINT4)

4.500

5.60

(not run)

INT4 bs=64, E4M3

4.125

7.23

5.22

INT4 bs=128, FP16

4.125

7.87

5.35

FP4 dominates INT4 by ~0.6-0.9pp output error at every operating point. The codebook’s wider dynamic range (0..6 vs symmetric 0..7) more efficiently captures the long-tailed per-block weight distribution at larger block sizes.

SPGL1 compensation method (recap)

After each block is snapped (in descending H-loss order), instead of GPTQ’s unconstrained H_inv error propagation, an L1-constrained SPGL1 LASSO is solved on the not-yet-snapped columns to minimize ||X*(Delta_eff + delta)^T||_2 subject to ||delta||_1 <= tau. Solved in reduced (Gram) form – no H^-1, no Cholesky – robust to ill-conditioned H. See experiments/spgl1_gptq_*.py for the implementations and notes/progress_track_spgl1.md for the full research log.