# Results > Reproduce with: `python bench/full_bench.py` ```{note} Benchmarked on the `down_proj` weight of the first decoder layer from Qwen3-4B (W: 2560x9728, bfloat16), with activations collected from WikiText-2 (max_seq_len=512, num_samples=2048, X: 244449x9728, bfloat16). ``` - **Weight error**: $\lVert Q(W) - W \rVert_F / \lVert W \rVert_F$ - **Output error**: $\lVert X W_q^T - X W^T \rVert_F / \lVert X W^T \rVert_F$ ## INT8 (FP8 E4M3 scales) Symmetric INT8 quantization ([-127, 127]) with per-block amax stored in FP8 E4M3. The effective scale is ``amax_fp8 / 127``, keeping the stored value within FP8 range while the division by 127 is performed in float32. | Implementation | Block Size | Weight Error | Output Error | Time | |:--|:--:|:--:|:--:|--:| | Naive (torch) | 32 | 1.01% | 0.79% | 1.7 ms | | SSE-Optimal (torch) | 32 | 0.57% | 0.40% | 236 ms | | H-Optimal (torch) | 32 | 0.60% | **0.37%** | 1.2 s | | Naive (torch) | 64 | 0.93% | 0.72% | 1.7 ms | | SSE-Optimal (torch) | 64 | 0.64% | 0.45% | 204 ms | | H-Optimal (torch) | 64 | 0.66% | **0.42%** | 1.5 s | | Naive (torch) | 128 | 0.88% | 0.68% | 1.6 ms | | SSE-Optimal (torch) | 128 | 0.71% | 0.49% | 173 ms | | H-Optimal (torch) | 128 | 0.73% | **0.48%** | 2.8 s | | Naive (torch) | 256 | 0.87% | 0.66% | 1.6 ms | | SSE-Optimal (torch) | 256 | 0.77% | 0.54% | 165 ms | | H-Optimal (torch) | 256 | 0.79% | **0.52%** | 4.9 s | **SSE-Optimal vs Naive** (output error reduction): - Block size 32: **-49.9%** (0.79% $\to$ 0.40%) - Block size 64: **-38.1%** (0.72% $\to$ 0.45%) - Block size 128: **-27.6%** (0.68% $\to$ 0.49%) - Block size 256: **-18.6%** (0.66% $\to$ 0.54%) **H-Optimal vs SSE-Optimal** (further output error reduction): - Block size 32: **+7.0%** further reduction (0.40% $\to$ 0.37%) - Block size 64: **+4.8%** further reduction (0.45% $\to$ 0.42%) - Block size 128: **+3.3%** further reduction (0.49% $\to$ 0.48%) - Block size 256: **+2.4%** further reduction (0.54% $\to$ 0.52%) **H-Optimal vs Naive** (total output error reduction): - Block size 32: **-53.4%** (0.79% $\to$ 0.37%) - Block size 64: **-41.1%** (0.72% $\to$ 0.42%) - Block size 128: **-29.9%** (0.68% $\to$ 0.48%) - Block size 256: **-20.6%** (0.66% $\to$ 0.52%) The massive naive-to-optimal improvement (up to 50%) is driven by the FP8 E4M3 scale grid: with only 126 discrete scale values, the naive ``amax`` snap often lands on a scale that is significantly suboptimal, and the bounded search finds a much better candidate. This is analogous to NVFP4's scale search, but the effect is even stronger because INT8's 127 quantization levels amplify scale misalignment (a scale error of $\delta$ causes $127\delta$ in the worst case, vs $6\delta$ for FP4). H-Optimal provides a further 2--7% reduction over SSE-Optimal by prioritizing output-sensitive weights. ## NVFP4 (FP8 E4M3 scales) | Implementation | Block Size | Weight Error | Output Error | Time | Speedup | |:--|:--:|:--:|:--:|--:|--:| | Naive (torch) | 16 | 10.05% | 6.89% | 2.8 ms | | | Naive (Triton) | 16 | 10.05% | 6.89% | 1.9 ms | 1.5x | | SSE-Optimal (torch) | 16 | 8.74% | 6.04% | 234 ms | | | SSE-Optimal (Triton) | 16 | 8.74% | 6.04% | 33 ms | **7.0x** | | H-Optimal (torch) | 16 | 9.35% | **5.31%** | 866 ms | | | H-Optimal (Triton) | 16 | 9.35% | **5.31%** | 470 ms | 1.8x | | Naive (torch) | 32 | 10.42% | 7.15% | 2.9 ms | | | Naive (Triton) | 32 | 10.42% | 7.15% | 1.2 ms | 2.4x | | SSE-Optimal (torch) | 32 | 9.57% | 6.61% | 179 ms | | | SSE-Optimal (Triton) | 32 | 9.57% | 6.61% | 18 ms | **10.2x** | | H-Optimal (torch) | 32 | 10.12% | **5.95%** | 676 ms | | | H-Optimal (Triton) | 32 | 10.12% | **5.95%** | 236 ms | 2.9x | **H-Optimal vs SSE-Optimal** (output error reduction): - Block size 16: **+12.0%** further reduction (6.04% $\to$ 5.31%) - Block size 32: **+10.0%** further reduction (6.61% $\to$ 5.95%) **H-Optimal vs Naive** (total output error reduction): - Block size 16: **-22.9%** (6.89% $\to$ 5.31%) - Block size 32: **-16.7%** (7.15% $\to$ 5.95%) Weight error increases slightly (by 0.6--0.5pp) because H-Optimal optimizes for output error rather than weight error. This is the correct trade-off: a model's quality depends on output error, not weight error. ## MXFP4 (UE8M0 power-of-2 scales) | Implementation | Block Size | Weight Error | Output Error | Time | Speedup | |:--|:--:|:--:|:--:|--:|--:| | Naive (torch) | 16 | 11.77% | 8.48% | 3.0 ms | | | Naive (Triton) | 16 | 11.77% | 8.48% | 1.8 ms | 1.7x | | SSE-Optimal (torch) | 16 | 11.02% | 7.67% | 86 ms | | | SSE-Optimal (Triton) | 16 | 11.02% | 7.67% | 2.6 ms | **33.6x** | | H-Optimal (torch) | 16 | 11.10% | **7.62%** | 545 ms | | | Naive (torch) | 32 | 11.75% | 8.37% | 3.0 ms | | | Naive (Triton) | 32 | 11.75% | 8.37% | 1.2 ms | 2.6x | | SSE-Optimal (torch) | 32 | 11.32% | 7.91% | 74 ms | | | SSE-Optimal (Triton) | 32 | 11.32% | 7.91% | 1.6 ms | **45.7x** | | H-Optimal (torch) | 32 | 11.42% | **7.80%** | 361 ms | | **H-Optimal vs SSE-Optimal** (output error reduction): - Block size 16: **+0.7%** further reduction (7.67% $\to$ 7.62%) - Block size 32: **+1.4%** further reduction (7.91% $\to$ 7.80%) The improvement is much smaller for MXFP4 because UE8M0 scales are powers of 2 -- consecutive scales differ by a factor of 2, leaving only 1--2 candidates near the optimum. With so few choices, the Hessian criterion rarely selects a different scale than SSE. ## Why NVFP4 benefits much more from Hessian-awareness FP8 E4M3 has 126 finely-spaced positive scale values with non-uniform spacing. The SSE-optimal and H-optimal scales can differ by several FP8 steps, because the Hessian re-weights the importance of each element. With UE8M0's coarse power-of-2 grid, this re-weighting almost always lands on the same scale. ## Correctness notes **NVFP4 Triton vs Python reference**: Scale computation matches exactly (0 disagreements). The max element-level abs diff (~5e-2) comes from FP4 decision-boundary tie-breaking: when $|x|/s$ lands exactly on a codebook boundary (e.g. 0.75, 1.75, 3.5), the PTX ``div.full.f32`` and PyTorch ``/`` produce results that round to different FP4 values. This affects ~0.01% of elements and does not affect the error metrics. **MXFP4 Triton vs Python reference**: Naive kernel matches exactly (0.00 max abs diff). For the optimal kernel, in rare tie-breaking cases (1 in ~800k blocks), ``tl.sum`` tree reduction and PyTorch sequential ``.sum()`` accumulate float32 rounding differently, causing one to pick ``s0`` and the other ``2*s0`` when their SSEs are identical. This produces a max abs diff of one scale step but does not affect the error metrics. ## GPTQ Quantization > Reproduce with: `python experiments/quant_gptq_strided.py` GPTQ (Frantar et al., 2022) applies Optimal Brain Surgeon error compensation to sequential column-block quantization. After quantizing each block of columns, the quantization error is propagated to remaining columns using the inverse Hessian, minimizing the total output error. Our implementation uses `torch.as_strided` for zero-copy sub-matrix views during error propagation. The GPTQ block size equals the quantization block size, so each column block is quantized and its error immediately compensated across all remaining columns: ```python # After quantizing columns [cs:ce], propagate error via as_strided views: h_cross = torch.as_strided(H_inv, (bs, rem), (K, 1), offset=cs*K + ce) w_rem = torch.as_strided(W, (M, rem), (K, 1), offset=ce) w_rem.sub_(err @ h_cross) # in-place, zero-copy ``` Three modes are compared: **baseline** (no GPTQ), **sequential** GPTQ (natural column order), and **ordered** GPTQ (column blocks sorted by descending Hessian-weighted quantization loss, so the highest-error blocks are quantized first and their error is compensated across the most remaining columns). ### Block Size 16 | Format | Approach | GPTQ | Weight Error | Output Error | Time | |:--|:--|:--:|:--:|:--:|--:| | NVFP4 | Naive | — | 10.05% | 6.89% | 95ms | | NVFP4 | GPTQ+Naive | Seq | 12.58% | 5.53% | 402ms | | NVFP4 | GPTQ-Ord+Naive | Ord | 13.18% | 5.18% | 490ms | | NVFP4 | Optimal | — | 8.74% | 6.04% | 7.4s | | NVFP4 | GPTQ+Optimal | Seq | 10.94% | 4.82% | 7.7s | | NVFP4 | GPTQ-Ord+Optimal | Ord | 11.45% | 4.52% | 15.1s | | NVFP4 | H-Optimal | — | 9.37% | 5.34% | 7.7s | | NVFP4 | GPTQ+H-Optimal | Seq | 11.14% | 4.37% | 8.0s | | NVFP4 | GPTQ-Ord+H-Optimal | Ord | 11.53% | **4.21%** | 15.9s | | MXFP4 | Naive | — | 11.77% | 8.48% | 102ms | | MXFP4 | GPTQ+Naive | Seq | 14.61% | 6.67% | 400ms | | MXFP4 | GPTQ-Ord+Naive | Ord | 15.27% | 6.20% | 517ms | | MXFP4 | Optimal | — | 11.02% | 7.67% | 6.7s | | MXFP4 | GPTQ+Optimal | Seq | 13.79% | 6.13% | 7.0s | | MXFP4 | GPTQ-Ord+Optimal | Ord | 14.43% | 5.72% | 13.8s | | MXFP4 | H-Optimal | — | 11.10% | 7.62% | 6.9s | | MXFP4 | GPTQ+H-Optimal | Seq | 13.82% | 6.10% | 7.1s | | MXFP4 | GPTQ-Ord+H-Optimal | Ord | 14.45% | **5.71%** | 14.0s | | NVINT4 | Naive | — | 9.46% | 6.55% | 65ms | | NVINT4 | GPTQ+Naive | Seq | 11.84% | 5.23% | 376ms | | NVINT4 | GPTQ-Ord+Naive | Ord | 12.37% | 4.89% | 414ms | | NVINT4 | Optimal | — | 9.20% | 6.40% | 5.6s | | NVINT4 | GPTQ+Optimal | Seq | 11.54% | 5.12% | 5.9s | | NVINT4 | GPTQ-Ord+Optimal | Ord | 12.06% | 4.76% | 11.5s | | NVINT4 | H-Optimal | — | 9.60% | 6.04% | 5.9s | | NVINT4 | GPTQ+H-Optimal | Seq | 11.73% | 4.88% | 6.1s | | NVINT4 | GPTQ-Ord+H-Optimal | Ord | 12.20% | **4.65%** | 12.0s | ### Block Size 32 | Format | Approach | GPTQ | Weight Error | Output Error | Time | |:--|:--|:--:|:--:|:--:|--:| | NVFP4 | Naive | — | 10.42% | 7.15% | 37ms | | NVFP4 | GPTQ+Naive | Seq | 13.04% | 5.74% | 272ms | | NVFP4 | GPTQ-Ord+Naive | Ord | 13.53% | 5.43% | 320ms | | NVFP4 | Optimal | — | 9.57% | 6.61% | 3.6s | | NVFP4 | GPTQ+Optimal | Seq | 11.98% | 5.29% | 3.8s | | NVFP4 | GPTQ-Ord+Optimal | Ord | 12.42% | 5.01% | 7.3s | | NVFP4 | H-Optimal | — | 10.16% | 6.02% | 3.7s | | NVFP4 | GPTQ+H-Optimal | Seq | 12.21% | 4.91% | 4.0s | | NVFP4 | GPTQ-Ord+H-Optimal | Ord | 12.57% | **4.75%** | 7.7s | | MXFP4 | Naive | — | 11.75% | 8.37% | 47ms | | MXFP4 | GPTQ+Naive | Seq | 14.62% | 6.62% | 273ms | | MXFP4 | GPTQ-Ord+Naive | Ord | 15.14% | 6.24% | 335ms | | MXFP4 | Optimal | — | 11.32% | 7.91% | 3.4s | | MXFP4 | GPTQ+Optimal | Seq | 14.16% | 6.32% | 3.5s | | MXFP4 | GPTQ-Ord+Optimal | Ord | 14.66% | 5.95% | 6.8s | | MXFP4 | H-Optimal | — | 11.42% | 7.80% | 3.4s | | MXFP4 | GPTQ+H-Optimal | Seq | 14.19% | 6.25% | 3.6s | | MXFP4 | GPTQ-Ord+H-Optimal | Ord | 14.68% | **5.92%** | 7.0s | | NVINT4 | Naive | — | 10.36% | 7.18% | 24ms | | NVINT4 | GPTQ+Naive | Seq | 13.00% | 5.72% | 248ms | | NVINT4 | GPTQ-Ord+Naive | Ord | 13.45% | 5.42% | 282ms | | NVINT4 | Optimal | — | 10.13% | 7.10% | 2.8s | | NVINT4 | GPTQ+Optimal | Seq | 12.71% | 5.65% | 3.0s | | NVINT4 | GPTQ-Ord+Optimal | Ord | 13.14% | 5.33% | 5.8s | | NVINT4 | H-Optimal | — | 10.59% | 6.92% | 2.9s | | NVINT4 | GPTQ+H-Optimal | Seq | 13.12% | 5.57% | 3.1s | | NVINT4 | GPTQ-Ord+H-Optimal | Ord | 13.54% | **5.34%** | 6.0s | ### Ordered vs Sequential GPTQ Additional output error reduction from reordering (pp over sequential): | Format | Approach | BS=16 | BS=32 | |:--|:--|:--:|:--:| | NVFP4 | Naive | **-0.34pp** | -0.30pp | | NVFP4 | Optimal | **-0.31pp** | -0.28pp | | NVFP4 | H-Optimal | -0.16pp | -0.15pp | | MXFP4 | Naive | **-0.47pp** | **-0.38pp** | | MXFP4 | Optimal | **-0.41pp** | **-0.37pp** | | MXFP4 | H-Optimal | **-0.39pp** | -0.33pp | | NVINT4 | Naive | **-0.34pp** | -0.31pp | | NVINT4 | Optimal | **-0.36pp** | -0.32pp | | NVINT4 | H-Optimal | -0.24pp | -0.23pp | Ordered GPTQ (quantizing highest-loss blocks first) consistently outperforms sequential GPTQ by 0.15--0.47pp. The gain is largest for MXFP4 (coarser scales create bigger per-block errors to redistribute) and for naive/optimal approaches (H-Optimal already concentrates error where it matters least, leaving less room for reordering to help). Weight error increases slightly more (~0.4--0.6pp over sequential) as a natural consequence of the stronger output-error optimization. ## Exotic Scales > Reproduce with: > - `python experiments/quant_exotic_scales.py` (no GPTQ) > - `python experiments/quant_gptq_exotic_scales.py` (with GPTQ-Seq, GPTQ-Ord) NVFP4 stores per-block scales in **FP8 E4M3** (signed, 1+4+3 bits, 126 positive values). Scales are always non-negative, so the sign bit is wasted. We try two unsigned 8-bit alternatives that re-purpose the sign bit: - **UE4M4** -- 4-exp, 4-mantissa, bias 7. Trades the sign for one extra mantissa bit. Same dynamic range as E4M3 (max $\approx$ 496 vs 448), but **2x denser** scale grid (255 distinct positive values). - **UE5M3** -- 5-exp, 3-mantissa, bias 15. Same mantissa precision as E4M3 but **much wider** dynamic range (max $\approx$ 122880). Also 255 positive values. All codes are treated as finite (no NaN/Inf reserved). The FP4 codebook $\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}$ is unchanged; only the per-block scale representation differs. Each table below crosses {Naive, SSE-Optimal, H-Optimal} with {no-GPTQ, GPTQ-Seq, GPTQ-Ord} for each scale grid. ### Block Size 16 | Scale | Approach | GPTQ | Weight Error | Output Error | Time | |:--|:--|:--:|:--:|:--:|--:| | E4M3 | Naive | — | 10.05% | 6.89% | 87ms | | E4M3 | GPTQ+Naive | Seq | 12.58% | 5.53% | 402ms | | E4M3 | GPTQ-Ord+Naive | Ord | 13.18% | 5.18% | 492ms | | E4M3 | Optimal | — | 8.74% | 6.04% | 7.4s | | E4M3 | GPTQ+Optimal | Seq | 10.94% | 4.82% | 7.5s | | E4M3 | GPTQ-Ord+Optimal | Ord | 11.45% | 4.52% | 14.8s | | E4M3 | H-Optimal | — | 9.37% | 5.34% | 7.6s | | E4M3 | GPTQ+H-Optimal | Seq | 11.14% | 4.37% | 7.9s | | E4M3 | GPTQ-Ord+H-Optimal | Ord | 11.53% | 4.21% | 15.5s | | UE4M4 | Naive | — | 9.54% | 6.55% | 84ms | | UE4M4 | GPTQ+Naive | Seq | 11.97% | 5.25% | 394ms | | UE4M4 | GPTQ-Ord+Naive | Ord | 12.54% | 4.93% | 505ms | | UE4M4 | Optimal | — | 8.19% | 5.66% | 14.1s | | UE4M4 | GPTQ+Optimal | Seq | 10.26% | 4.52% | 14.5s | | UE4M4 | GPTQ-Ord+Optimal | Ord | 10.75% | 4.23% | 28.3s | | UE4M4 | H-Optimal | — | 8.95% | 4.97% | 14.7s | | UE4M4 | GPTQ+H-Optimal | Seq | 10.58% | 4.08% | 15.1s | | UE4M4 | GPTQ-Ord+H-Optimal | Ord | 10.94% | **3.94%** | 29.9s | | UE5M3 | Naive | — | 9.47% | 6.51% | 85ms | | UE5M3 | GPTQ+Naive | Seq | 11.89% | 5.22% | 393ms | | UE5M3 | GPTQ-Ord+Naive | Ord | 12.46% | 4.89% | 504ms | | UE5M3 | Optimal | — | 8.13% | 5.63% | 12.5s | | UE5M3 | GPTQ+Optimal | Seq | 10.19% | 4.49% | 12.7s | | UE5M3 | GPTQ-Ord+Optimal | Ord | 10.67% | 4.21% | 25.1s | | UE5M3 | H-Optimal | — | 8.92% | 4.99% | 12.8s | | UE5M3 | GPTQ+H-Optimal | Seq | 10.56% | 4.09% | 13.2s | | UE5M3 | GPTQ-Ord+H-Optimal | Ord | 10.92% | **3.95%** | 25.9s | ### Block Size 32 | Scale | Approach | GPTQ | Weight Error | Output Error | Time | |:--|:--|:--:|:--:|:--:|--:| | E4M3 | Naive | — | 10.42% | 7.15% | 37ms | | E4M3 | GPTQ+Naive | Seq | 13.04% | 5.74% | 271ms | | E4M3 | GPTQ-Ord+Naive | Ord | 13.53% | 5.43% | 318ms | | E4M3 | Optimal | — | 9.57% | 6.61% | 3.5s | | E4M3 | GPTQ+Optimal | Seq | 11.98% | 5.29% | 3.7s | | E4M3 | GPTQ-Ord+Optimal | Ord | 12.42% | 5.01% | 7.2s | | E4M3 | H-Optimal | — | 10.16% | 6.02% | 3.7s | | E4M3 | GPTQ+H-Optimal | Seq | 12.21% | 4.91% | 3.9s | | E4M3 | GPTQ-Ord+H-Optimal | Ord | 12.57% | 4.75% | 7.5s | | UE4M4 | Naive | — | 10.18% | 6.99% | 42ms | | UE4M4 | GPTQ+Naive | Seq | 12.76% | 5.61% | 271ms | | UE4M4 | GPTQ-Ord+Naive | Ord | 13.24% | 5.31% | 326ms | | UE4M4 | Optimal | — | 9.16% | 6.32% | 6.8s | | UE4M4 | GPTQ+Optimal | Seq | 11.47% | 5.06% | 7.0s | | UE4M4 | GPTQ-Ord+Optimal | Ord | 11.88% | 4.79% | 13.8s | | UE4M4 | H-Optimal | — | 9.90% | 5.73% | 7.1s | | UE4M4 | GPTQ+H-Optimal | Seq | 11.85% | 4.70% | 7.3s | | UE4M4 | GPTQ-Ord+H-Optimal | Ord | 12.19% | **4.56%** | 14.4s | | UE5M3 | Naive | — | 10.16% | 6.98% | 42ms | | UE5M3 | GPTQ+Naive | Seq | 12.74% | 5.61% | 271ms | | UE5M3 | GPTQ-Ord+Naive | Ord | 13.22% | 5.31% | 327ms | | UE5M3 | Optimal | — | 9.14% | 6.31% | 5.9s | | UE5M3 | GPTQ+Optimal | Seq | 11.44% | 5.06% | 6.2s | | UE5M3 | GPTQ-Ord+Optimal | Ord | 11.86% | 4.78% | 12.1s | | UE5M3 | H-Optimal | — | 9.89% | 5.75% | 6.1s | | UE5M3 | GPTQ+H-Optimal | Seq | 11.86% | 4.71% | 6.3s | | UE5M3 | GPTQ-Ord+H-Optimal | Ord | 12.20% | **4.58%** | 12.3s | ### Best output error per scale | Scale | BS=16 best | BS=32 best | |:--|:--:|:--:| | E4M3 | 4.21% | 4.75% | | UE4M4 | **3.94%** | **4.56%** | | UE5M3 | 3.95% | 4.58% | (All bests are achieved by GPTQ-Ord+H-Optimal.) ### Output error reduction vs E4M3 (same approach + mode) | Approach + mode | BS=16: UE4M4 | BS=16: UE5M3 | BS=32: UE4M4 | BS=32: UE5M3 | |:--|:--:|:--:|:--:|:--:| | Naive (no GPTQ) | -0.34pp | -0.38pp | -0.16pp | -0.17pp | | Optimal (no GPTQ) | -0.38pp | -0.41pp | -0.29pp | -0.30pp | | H-Optimal (no GPTQ) | -0.37pp | -0.35pp | -0.29pp | -0.27pp | | GPTQ+Naive | -0.28pp | -0.31pp | -0.13pp | -0.13pp | | GPTQ+Optimal | -0.30pp | -0.33pp | -0.23pp | -0.23pp | | GPTQ+H-Optimal | -0.29pp | -0.28pp | -0.21pp | -0.20pp | | GPTQ-Ord+Naive | -0.25pp | -0.29pp | -0.12pp | -0.12pp | | GPTQ-Ord+Optimal | -0.29pp | -0.31pp | -0.22pp | -0.23pp | | GPTQ-Ord+H-Optimal | **-0.27pp** | **-0.26pp** | **-0.19pp** | **-0.17pp** | Both unsigned formats beat E4M3 across every approach × mode × block size. The relative gain shrinks somewhat once GPTQ is layered on (GPTQ already compensates for some of the per-block scale-snapping loss), but the absolute output error keeps falling -- the **best result of every scale grid is GPTQ-Ord+H-Optimal**, and UE4M4/UE5M3 still beat E4M3 there by 0.17--0.27pp. UE4M4 and UE5M3 perform almost identically (within 0.01--0.03pp) across the full grid, even though UE5M3 has $\sim$250x more dynamic range. Weight magnitudes in this layer fall well within E4M3's range, so extra range is wasted -- what matters is **grid density near the optimal scale**, and both formats double the density relative to E4M3. Caveat: standard FP8 E4M3 hardware support exists on Hopper/Ada; UE4M4 and UE5M3 do not have hardware encoders, so naive-mode quantization is slower in production (the snap requires a table lookup rather than a hardware cast). Optimal/H-Optimal modes are unaffected since they iterate over the scale table either way; the ~2x slowdown there is purely from doubling the candidate count (255 vs 126). ## Larger blocks: bs=64 (E4M3 scales) and bs=128 (FP16 scales) Same `layer_0` setup; W = 2560x9728, X = 244449x9728. Layouts here trade scale precision and block size for the same total bits/weight: | Config | Block | Scale | Scale b/w | Total b/w | |:--|:--:|:--:|:--:|:--:| | Baseline NVFP4 | 16 | FP8 E4M3 | 0.500 | 4.500 | | -- | 32 | FP8 E4M3 | 0.250 | 4.250 | | **New** | **64** | **FP8 E4M3** | **0.125** | **4.125** | | **New** | **128** | **FP16 E5M10** | **0.125** | **4.125** | For bs=128 + FP16, the snapped continuous optimum is essentially the true SSE / H-optimal minimum, so per-block scales are found by iterative alternation (q -> closed-form continuous s -> fp16 snap) rather than grid search. ### bs=64, FP8 E4M3 scales (4.125 b/w) | Codebook | Approach | GPTQ | Weight Error | Output Error | Time | |:--|:--|:--:|:--:|:--:|--:| | FP4 | Naive | -- | 10.77% | 7.41% | 42ms | | FP4 | Optimal | -- | 10.19% | 7.05% | 1.0s | | FP4 | H-Optimal | -- | 10.66% | 6.48% | 21.8s | | FP4 | GPTQ-Ord+H-Optimal | Ord | 13.34% | 5.22% | 202s | | FP4 | GPTQ-Ord+H-Opt+SPGL1 | Ord | 13.28% | **4.65%** | 373s | | INT4 | Naive | -- | 11.37% | 7.88% | 9ms | | INT4 | Optimal | -- | 10.89% | 7.71% | 687ms | | INT4 | H-Optimal | -- | 11.37% | 7.23% | 21.6s | | INT4 | GPTQ-Ord+H-Optimal | Ord | 14.77% | 5.99% | 145s | | INT4 | GPTQ-Ord+H-Opt+SPGL1 | Ord | 15.24% | **5.22%** | 367s | ### bs=128, FP16 E5M10 scales (4.125 b/w) | Codebook | Approach | GPTQ | Weight Error | Output Error | Time | |:--|:--|:--:|:--:|:--:|--:| | FP4 | Naive | -- | 11.00% | 7.56% | 8ms | | FP4 | Optimal (iter) | -- | 10.56% | 7.33% | 61ms | | FP4 | H-Optimal (iter) | -- | 10.76% | 6.88% | 158ms | | FP4 | GPTQ-Seq+Naive | Seq | 13.76% | 6.08% | 706ms | | FP4 | GPTQ-Ord+H-Optimal | Ord | 13.47% | 5.47% | 621ms | | FP4 | GPTQ-Ord+H-Opt+SPGL1 | Ord | 13.85% | **4.77%** | 30.4s | | INT4 | Naive | -- | 12.31% | 8.54% | 6ms | | INT4 | Optimal (iter) | -- | 11.49% | 8.27% | 80ms | | INT4 | H-Optimal (iter) | -- | 11.79% | 7.87% | 10.1s | | INT4 | GPTQ-Seq+Naive | Seq | 15.50% | 6.84% | 361ms | | INT4 | GPTQ-Ord+H-Optimal | Ord | 14.93% | 6.17% | 8.3s | | INT4 | GPTQ-Ord+H-Opt+SPGL1 | Ord | 15.37% | **5.35%** | 110s | ### Same-budget head-to-head (4.125 b/w) | Codebook | Config | GPTQ+H-Opt O% | + SPGL1 O% | ΔO from SPGL1 | |:--|:--|:--:|:--:|:--:| | FP4 | bs=64, E4M3 | 5.22 | **4.65** | -0.57 | | FP4 | bs=128, FP16 | 5.47 | 4.77 | -0.70 | | INT4 | bs=64, E4M3 | 5.99 | **5.22** | -0.77 | | INT4 | bs=128, FP16 | 6.17 | 5.35 | -0.82 | At the same 4.125 b/w budget, bs=64 + E4M3 beats bs=128 + FP16 for both codebooks (FP4: -0.12pp, INT4: -0.13pp) -- coarser scale precision but tighter per-block fit wins out. SPGL1 contributes a larger absolute gain at the bs=128/FP16 point (~0.7-0.8pp) than at bs=64/E4M3 (~0.6-0.8pp), but not enough to flip the ordering. ### Where these land vs the bs=16 best | Config | b/w | H-Opt O% | +SPGL1 O% | |:--|:--:|:--:|:--:| | FP4 bs=16, E4M3 (NVFP4) | 4.500 | 5.31 | **3.64** | | FP4 bs=64, E4M3 | 4.125 | 6.48 | 4.65 | | FP4 bs=128, FP16 | 4.125 | 6.88 | 4.77 | | INT4 bs=16, E4M3 (NVINT4) | 4.500 | 5.60 | (not run) | | INT4 bs=64, E4M3 | 4.125 | 7.23 | 5.22 | | INT4 bs=128, FP16 | 4.125 | 7.87 | 5.35 | FP4 dominates INT4 by ~0.6-0.9pp output error at every operating point. The codebook's wider dynamic range (0..6 vs symmetric 0..7) more efficiently captures the long-tailed per-block weight distribution at larger block sizes. ### SPGL1 compensation method (recap) After each block is snapped (in descending H-loss order), instead of GPTQ's unconstrained `H_inv` error propagation, an L1-constrained SPGL1 LASSO is solved on the not-yet-snapped columns to minimize `||X*(Delta_eff + delta)^T||_2` subject to `||delta||_1 <= tau`. Solved in reduced (Gram) form -- no `H^-1`, no Cholesky -- robust to ill-conditioned H. See `experiments/spgl1_gptq_*.py` for the implementations and `notes/progress_track_spgl1.md` for the full research log.