# Results

> Reproduce with: `python bench/full_bench.py`

```{note}
Benchmarked on the `down_proj` weight of the first decoder layer from Qwen3-4B (W: 2560x9728, bfloat16), with activations collected from WikiText-2 (max_seq_len=512, num_samples=2048, X: 244449x9728, bfloat16).
```

- **Weight error**: $\lVert Q(W) - W \rVert_F / \lVert W \rVert_F$
- **Output error**: $\lVert X W_q^T - X W^T \rVert_F / \lVert X W^T \rVert_F$

## INT8 (FP8 E4M3 scales)

Symmetric INT8 quantization ([-127, 127]) with per-block amax stored in FP8 E4M3.
The effective scale is ``amax_fp8 / 127``, keeping the stored value within FP8 range
while the division by 127 is performed in float32.

| Implementation | Block Size | Weight Error | Output Error | Time |
|:--|:--:|:--:|:--:|--:|
| Naive (torch) | 32 | 1.01% | 0.79% | 1.7 ms |
| SSE-Optimal (torch) | 32 | 0.57% | 0.40% | 236 ms |
| H-Optimal (torch) | 32 | 0.60% | **0.37%** | 1.2 s |
| Naive (torch) | 64 | 0.93% | 0.72% | 1.7 ms |
| SSE-Optimal (torch) | 64 | 0.64% | 0.45% | 204 ms |
| H-Optimal (torch) | 64 | 0.66% | **0.42%** | 1.5 s |
| Naive (torch) | 128 | 0.88% | 0.68% | 1.6 ms |
| SSE-Optimal (torch) | 128 | 0.71% | 0.49% | 173 ms |
| H-Optimal (torch) | 128 | 0.73% | **0.48%** | 2.8 s |
| Naive (torch) | 256 | 0.87% | 0.66% | 1.6 ms |
| SSE-Optimal (torch) | 256 | 0.77% | 0.54% | 165 ms |
| H-Optimal (torch) | 256 | 0.79% | **0.52%** | 4.9 s |

**SSE-Optimal vs Naive** (output error reduction):
- Block size 32: **-49.9%** (0.79% $\to$ 0.40%)
- Block size 64: **-38.1%** (0.72% $\to$ 0.45%)
- Block size 128: **-27.6%** (0.68% $\to$ 0.49%)
- Block size 256: **-18.6%** (0.66% $\to$ 0.54%)

**H-Optimal vs SSE-Optimal** (further output error reduction):
- Block size 32: **+7.0%** further reduction (0.40% $\to$ 0.37%)
- Block size 64: **+4.8%** further reduction (0.45% $\to$ 0.42%)
- Block size 128: **+3.3%** further reduction (0.49% $\to$ 0.48%)
- Block size 256: **+2.4%** further reduction (0.54% $\to$ 0.52%)

**H-Optimal vs Naive** (total output error reduction):
- Block size 32: **-53.4%** (0.79% $\to$ 0.37%)
- Block size 64: **-41.1%** (0.72% $\to$ 0.42%)
- Block size 128: **-29.9%** (0.68% $\to$ 0.48%)
- Block size 256: **-20.6%** (0.66% $\to$ 0.52%)

The massive naive-to-optimal improvement (up to 50%) is driven by the FP8 E4M3
scale grid: with only 126 discrete scale values, the naive ``amax`` snap often
lands on a scale that is significantly suboptimal, and the bounded search finds
a much better candidate. This is analogous to NVFP4's scale search, but the effect
is even stronger because INT8's 127 quantization levels amplify scale misalignment
(a scale error of $\delta$ causes $127\delta$ in the worst case, vs $6\delta$ for FP4).

H-Optimal provides a further 2--7% reduction over SSE-Optimal by prioritizing
output-sensitive weights.

## NVFP4 (FP8 E4M3 scales)

| Implementation | Block Size | Weight Error | Output Error | Time | Speedup |
|:--|:--:|:--:|:--:|--:|--:|
| Naive (torch) | 16 | 10.05% | 6.89% | 2.8 ms | |
| Naive (Triton) | 16 | 10.05% | 6.89% | 1.9 ms | 1.5x |
| SSE-Optimal (torch) | 16 | 8.74% | 6.04% | 234 ms | |
| SSE-Optimal (Triton) | 16 | 8.74% | 6.04% | 33 ms | **7.0x** |
| H-Optimal (torch) | 16 | 9.35% | **5.31%** | 866 ms | |
| H-Optimal (Triton) | 16 | 9.35% | **5.31%** | 470 ms | 1.8x |
| Naive (torch) | 32 | 10.42% | 7.15% | 2.9 ms | |
| Naive (Triton) | 32 | 10.42% | 7.15% | 1.2 ms | 2.4x |
| SSE-Optimal (torch) | 32 | 9.57% | 6.61% | 179 ms | |
| SSE-Optimal (Triton) | 32 | 9.57% | 6.61% | 18 ms | **10.2x** |
| H-Optimal (torch) | 32 | 10.12% | **5.95%** | 676 ms | |
| H-Optimal (Triton) | 32 | 10.12% | **5.95%** | 236 ms | 2.9x |

**H-Optimal vs SSE-Optimal** (output error reduction):
- Block size 16: **+12.0%** further reduction (6.04% $\to$ 5.31%)
- Block size 32: **+10.0%** further reduction (6.61% $\to$ 5.95%)

**H-Optimal vs Naive** (total output error reduction):
- Block size 16: **-22.9%** (6.89% $\to$ 5.31%)
- Block size 32: **-16.7%** (7.15% $\to$ 5.95%)

Weight error increases slightly (by 0.6--0.5pp) because H-Optimal optimizes for output
error rather than weight error. This is the correct trade-off: a model's quality depends
on output error, not weight error.

## MXFP4 (UE8M0 power-of-2 scales)

| Implementation | Block Size | Weight Error | Output Error | Time | Speedup |
|:--|:--:|:--:|:--:|--:|--:|
| Naive (torch) | 16 | 11.77% | 8.48% | 3.0 ms | |
| Naive (Triton) | 16 | 11.77% | 8.48% | 1.8 ms | 1.7x |
| SSE-Optimal (torch) | 16 | 11.02% | 7.67% | 86 ms | |
| SSE-Optimal (Triton) | 16 | 11.02% | 7.67% | 2.6 ms | **33.6x** |
| H-Optimal (torch) | 16 | 11.10% | **7.62%** | 545 ms | |
| Naive (torch) | 32 | 11.75% | 8.37% | 3.0 ms | |
| Naive (Triton) | 32 | 11.75% | 8.37% | 1.2 ms | 2.6x |
| SSE-Optimal (torch) | 32 | 11.32% | 7.91% | 74 ms | |
| SSE-Optimal (Triton) | 32 | 11.32% | 7.91% | 1.6 ms | **45.7x** |
| H-Optimal (torch) | 32 | 11.42% | **7.80%** | 361 ms | |

**H-Optimal vs SSE-Optimal** (output error reduction):
- Block size 16: **+0.7%** further reduction (7.67% $\to$ 7.62%)
- Block size 32: **+1.4%** further reduction (7.91% $\to$ 7.80%)

The improvement is much smaller for MXFP4 because UE8M0 scales are powers of 2 --
consecutive scales differ by a factor of 2, leaving only 1--2 candidates near the optimum.
With so few choices, the Hessian criterion rarely selects a different scale than SSE.

## Why NVFP4 benefits much more from Hessian-awareness

FP8 E4M3 has 126 finely-spaced positive scale values with non-uniform spacing.
The SSE-optimal and H-optimal scales can differ by several FP8 steps, because the
Hessian re-weights the importance of each element. With UE8M0's coarse power-of-2
grid, this re-weighting almost always lands on the same scale.

## Correctness notes

**NVFP4 Triton vs Python reference**: Scale computation matches exactly (0 disagreements). The max element-level abs diff (~5e-2) comes from FP4 decision-boundary tie-breaking: when $|x|/s$ lands exactly on a codebook boundary (e.g. 0.75, 1.75, 3.5), the PTX ``div.full.f32`` and PyTorch ``/`` produce results that round to different FP4 values. This affects ~0.01% of elements and does not affect the error metrics.

**MXFP4 Triton vs Python reference**: Naive kernel matches exactly (0.00 max abs diff). For the optimal kernel, in rare tie-breaking cases (1 in ~800k blocks), ``tl.sum`` tree reduction and PyTorch sequential ``.sum()`` accumulate float32 rounding differently, causing one to pick ``s0`` and the other ``2*s0`` when their SSEs are identical. This produces a max abs diff of one scale step but does not affect the error metrics.

## GPTQ Quantization

> Reproduce with: `python experiments/quant_gptq_strided.py`

GPTQ (Frantar et al., 2022) applies Optimal Brain Surgeon error compensation
to sequential column-block quantization. After quantizing each block of columns,
the quantization error is propagated to remaining columns using the inverse Hessian,
minimizing the total output error.

Our implementation uses `torch.as_strided` for zero-copy sub-matrix views during
error propagation. The GPTQ block size equals the quantization block size, so each
column block is quantized and its error immediately compensated across all remaining
columns:

```python
# After quantizing columns [cs:ce], propagate error via as_strided views:
h_cross = torch.as_strided(H_inv, (bs, rem), (K, 1), offset=cs*K + ce)
w_rem   = torch.as_strided(W,     (M, rem),  (K, 1), offset=ce)
w_rem.sub_(err @ h_cross)  # in-place, zero-copy
```

Three modes are compared: **baseline** (no GPTQ), **sequential** GPTQ (natural
column order), and **ordered** GPTQ (column blocks sorted by descending
Hessian-weighted quantization loss, so the highest-error blocks are quantized
first and their error is compensated across the most remaining columns).

### Block Size 16

| Format | Approach | GPTQ | Weight Error | Output Error | Time |
|:--|:--|:--:|:--:|:--:|--:|
| NVFP4 | Naive | — | 10.05% | 6.89% | 95ms |
| NVFP4 | GPTQ+Naive | Seq | 12.58% | 5.53% | 402ms |
| NVFP4 | GPTQ-Ord+Naive | Ord | 13.18% | 5.18% | 490ms |
| NVFP4 | Optimal | — | 8.74% | 6.04% | 7.4s |
| NVFP4 | GPTQ+Optimal | Seq | 10.94% | 4.82% | 7.7s |
| NVFP4 | GPTQ-Ord+Optimal | Ord | 11.45% | 4.52% | 15.1s |
| NVFP4 | H-Optimal | — | 9.37% | 5.34% | 7.7s |
| NVFP4 | GPTQ+H-Optimal | Seq | 11.14% | 4.37% | 8.0s |
| NVFP4 | GPTQ-Ord+H-Optimal | Ord | 11.53% | **4.21%** | 15.9s |
| MXFP4 | Naive | — | 11.77% | 8.48% | 102ms |
| MXFP4 | GPTQ+Naive | Seq | 14.61% | 6.67% | 400ms |
| MXFP4 | GPTQ-Ord+Naive | Ord | 15.27% | 6.20% | 517ms |
| MXFP4 | Optimal | — | 11.02% | 7.67% | 6.7s |
| MXFP4 | GPTQ+Optimal | Seq | 13.79% | 6.13% | 7.0s |
| MXFP4 | GPTQ-Ord+Optimal | Ord | 14.43% | 5.72% | 13.8s |
| MXFP4 | H-Optimal | — | 11.10% | 7.62% | 6.9s |
| MXFP4 | GPTQ+H-Optimal | Seq | 13.82% | 6.10% | 7.1s |
| MXFP4 | GPTQ-Ord+H-Optimal | Ord | 14.45% | **5.71%** | 14.0s |
| NVINT4 | Naive | — | 9.46% | 6.55% | 65ms |
| NVINT4 | GPTQ+Naive | Seq | 11.84% | 5.23% | 376ms |
| NVINT4 | GPTQ-Ord+Naive | Ord | 12.37% | 4.89% | 414ms |
| NVINT4 | Optimal | — | 9.20% | 6.40% | 5.6s |
| NVINT4 | GPTQ+Optimal | Seq | 11.54% | 5.12% | 5.9s |
| NVINT4 | GPTQ-Ord+Optimal | Ord | 12.06% | 4.76% | 11.5s |
| NVINT4 | H-Optimal | — | 9.60% | 6.04% | 5.9s |
| NVINT4 | GPTQ+H-Optimal | Seq | 11.73% | 4.88% | 6.1s |
| NVINT4 | GPTQ-Ord+H-Optimal | Ord | 12.20% | **4.65%** | 12.0s |

### Block Size 32

| Format | Approach | GPTQ | Weight Error | Output Error | Time |
|:--|:--|:--:|:--:|:--:|--:|
| NVFP4 | Naive | — | 10.42% | 7.15% | 37ms |
| NVFP4 | GPTQ+Naive | Seq | 13.04% | 5.74% | 272ms |
| NVFP4 | GPTQ-Ord+Naive | Ord | 13.53% | 5.43% | 320ms |
| NVFP4 | Optimal | — | 9.57% | 6.61% | 3.6s |
| NVFP4 | GPTQ+Optimal | Seq | 11.98% | 5.29% | 3.8s |
| NVFP4 | GPTQ-Ord+Optimal | Ord | 12.42% | 5.01% | 7.3s |
| NVFP4 | H-Optimal | — | 10.16% | 6.02% | 3.7s |
| NVFP4 | GPTQ+H-Optimal | Seq | 12.21% | 4.91% | 4.0s |
| NVFP4 | GPTQ-Ord+H-Optimal | Ord | 12.57% | **4.75%** | 7.7s |
| MXFP4 | Naive | — | 11.75% | 8.37% | 47ms |
| MXFP4 | GPTQ+Naive | Seq | 14.62% | 6.62% | 273ms |
| MXFP4 | GPTQ-Ord+Naive | Ord | 15.14% | 6.24% | 335ms |
| MXFP4 | Optimal | — | 11.32% | 7.91% | 3.4s |
| MXFP4 | GPTQ+Optimal | Seq | 14.16% | 6.32% | 3.5s |
| MXFP4 | GPTQ-Ord+Optimal | Ord | 14.66% | 5.95% | 6.8s |
| MXFP4 | H-Optimal | — | 11.42% | 7.80% | 3.4s |
| MXFP4 | GPTQ+H-Optimal | Seq | 14.19% | 6.25% | 3.6s |
| MXFP4 | GPTQ-Ord+H-Optimal | Ord | 14.68% | **5.92%** | 7.0s |
| NVINT4 | Naive | — | 10.36% | 7.18% | 24ms |
| NVINT4 | GPTQ+Naive | Seq | 13.00% | 5.72% | 248ms |
| NVINT4 | GPTQ-Ord+Naive | Ord | 13.45% | 5.42% | 282ms |
| NVINT4 | Optimal | — | 10.13% | 7.10% | 2.8s |
| NVINT4 | GPTQ+Optimal | Seq | 12.71% | 5.65% | 3.0s |
| NVINT4 | GPTQ-Ord+Optimal | Ord | 13.14% | 5.33% | 5.8s |
| NVINT4 | H-Optimal | — | 10.59% | 6.92% | 2.9s |
| NVINT4 | GPTQ+H-Optimal | Seq | 13.12% | 5.57% | 3.1s |
| NVINT4 | GPTQ-Ord+H-Optimal | Ord | 13.54% | **5.34%** | 6.0s |

### Ordered vs Sequential GPTQ

Additional output error reduction from reordering (pp over sequential):

| Format | Approach | BS=16 | BS=32 |
|:--|:--|:--:|:--:|
| NVFP4 | Naive | **-0.34pp** | -0.30pp |
| NVFP4 | Optimal | **-0.31pp** | -0.28pp |
| NVFP4 | H-Optimal | -0.16pp | -0.15pp |
| MXFP4 | Naive | **-0.47pp** | **-0.38pp** |
| MXFP4 | Optimal | **-0.41pp** | **-0.37pp** |
| MXFP4 | H-Optimal | **-0.39pp** | -0.33pp |
| NVINT4 | Naive | **-0.34pp** | -0.31pp |
| NVINT4 | Optimal | **-0.36pp** | -0.32pp |
| NVINT4 | H-Optimal | -0.24pp | -0.23pp |

Ordered GPTQ (quantizing highest-loss blocks first) consistently outperforms
sequential GPTQ by 0.15--0.47pp. The gain is largest for MXFP4 (coarser scales
create bigger per-block errors to redistribute) and for naive/optimal approaches
(H-Optimal already concentrates error where it matters least, leaving less room
for reordering to help). Weight error increases slightly more (~0.4--0.6pp over
sequential) as a natural consequence of the stronger output-error optimization.

## Exotic Scales

> Reproduce with:
> - `python experiments/quant_exotic_scales.py` (no GPTQ)
> - `python experiments/quant_gptq_exotic_scales.py` (with GPTQ-Seq, GPTQ-Ord)

NVFP4 stores per-block scales in **FP8 E4M3** (signed, 1+4+3 bits, 126 positive
values). Scales are always non-negative, so the sign bit is wasted. We try
two unsigned 8-bit alternatives that re-purpose the sign bit:

- **UE4M4** -- 4-exp, 4-mantissa, bias 7. Trades the sign for one extra
  mantissa bit. Same dynamic range as E4M3 (max $\approx$ 496 vs 448), but
  **2x denser** scale grid (255 distinct positive values).
- **UE5M3** -- 5-exp, 3-mantissa, bias 15. Same mantissa precision as E4M3
  but **much wider** dynamic range (max $\approx$ 122880). Also 255 positive
  values.

All codes are treated as finite (no NaN/Inf reserved). The FP4 codebook
$\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}$ is unchanged; only the per-block scale
representation differs. Each table below crosses
{Naive, SSE-Optimal, H-Optimal} with {no-GPTQ, GPTQ-Seq, GPTQ-Ord} for
each scale grid.

### Block Size 16

| Scale | Approach | GPTQ | Weight Error | Output Error | Time |
|:--|:--|:--:|:--:|:--:|--:|
| E4M3 | Naive | — | 10.05% | 6.89% | 87ms |
| E4M3 | GPTQ+Naive | Seq | 12.58% | 5.53% | 402ms |
| E4M3 | GPTQ-Ord+Naive | Ord | 13.18% | 5.18% | 492ms |
| E4M3 | Optimal | — | 8.74% | 6.04% | 7.4s |
| E4M3 | GPTQ+Optimal | Seq | 10.94% | 4.82% | 7.5s |
| E4M3 | GPTQ-Ord+Optimal | Ord | 11.45% | 4.52% | 14.8s |
| E4M3 | H-Optimal | — | 9.37% | 5.34% | 7.6s |
| E4M3 | GPTQ+H-Optimal | Seq | 11.14% | 4.37% | 7.9s |
| E4M3 | GPTQ-Ord+H-Optimal | Ord | 11.53% | 4.21% | 15.5s |
| UE4M4 | Naive | — | 9.54% | 6.55% | 84ms |
| UE4M4 | GPTQ+Naive | Seq | 11.97% | 5.25% | 394ms |
| UE4M4 | GPTQ-Ord+Naive | Ord | 12.54% | 4.93% | 505ms |
| UE4M4 | Optimal | — | 8.19% | 5.66% | 14.1s |
| UE4M4 | GPTQ+Optimal | Seq | 10.26% | 4.52% | 14.5s |
| UE4M4 | GPTQ-Ord+Optimal | Ord | 10.75% | 4.23% | 28.3s |
| UE4M4 | H-Optimal | — | 8.95% | 4.97% | 14.7s |
| UE4M4 | GPTQ+H-Optimal | Seq | 10.58% | 4.08% | 15.1s |
| UE4M4 | GPTQ-Ord+H-Optimal | Ord | 10.94% | **3.94%** | 29.9s |
| UE5M3 | Naive | — | 9.47% | 6.51% | 85ms |
| UE5M3 | GPTQ+Naive | Seq | 11.89% | 5.22% | 393ms |
| UE5M3 | GPTQ-Ord+Naive | Ord | 12.46% | 4.89% | 504ms |
| UE5M3 | Optimal | — | 8.13% | 5.63% | 12.5s |
| UE5M3 | GPTQ+Optimal | Seq | 10.19% | 4.49% | 12.7s |
| UE5M3 | GPTQ-Ord+Optimal | Ord | 10.67% | 4.21% | 25.1s |
| UE5M3 | H-Optimal | — | 8.92% | 4.99% | 12.8s |
| UE5M3 | GPTQ+H-Optimal | Seq | 10.56% | 4.09% | 13.2s |
| UE5M3 | GPTQ-Ord+H-Optimal | Ord | 10.92% | **3.95%** | 25.9s |

### Block Size 32

| Scale | Approach | GPTQ | Weight Error | Output Error | Time |
|:--|:--|:--:|:--:|:--:|--:|
| E4M3 | Naive | — | 10.42% | 7.15% | 37ms |
| E4M3 | GPTQ+Naive | Seq | 13.04% | 5.74% | 271ms |
| E4M3 | GPTQ-Ord+Naive | Ord | 13.53% | 5.43% | 318ms |
| E4M3 | Optimal | — | 9.57% | 6.61% | 3.5s |
| E4M3 | GPTQ+Optimal | Seq | 11.98% | 5.29% | 3.7s |
| E4M3 | GPTQ-Ord+Optimal | Ord | 12.42% | 5.01% | 7.2s |
| E4M3 | H-Optimal | — | 10.16% | 6.02% | 3.7s |
| E4M3 | GPTQ+H-Optimal | Seq | 12.21% | 4.91% | 3.9s |
| E4M3 | GPTQ-Ord+H-Optimal | Ord | 12.57% | 4.75% | 7.5s |
| UE4M4 | Naive | — | 10.18% | 6.99% | 42ms |
| UE4M4 | GPTQ+Naive | Seq | 12.76% | 5.61% | 271ms |
| UE4M4 | GPTQ-Ord+Naive | Ord | 13.24% | 5.31% | 326ms |
| UE4M4 | Optimal | — | 9.16% | 6.32% | 6.8s |
| UE4M4 | GPTQ+Optimal | Seq | 11.47% | 5.06% | 7.0s |
| UE4M4 | GPTQ-Ord+Optimal | Ord | 11.88% | 4.79% | 13.8s |
| UE4M4 | H-Optimal | — | 9.90% | 5.73% | 7.1s |
| UE4M4 | GPTQ+H-Optimal | Seq | 11.85% | 4.70% | 7.3s |
| UE4M4 | GPTQ-Ord+H-Optimal | Ord | 12.19% | **4.56%** | 14.4s |
| UE5M3 | Naive | — | 10.16% | 6.98% | 42ms |
| UE5M3 | GPTQ+Naive | Seq | 12.74% | 5.61% | 271ms |
| UE5M3 | GPTQ-Ord+Naive | Ord | 13.22% | 5.31% | 327ms |
| UE5M3 | Optimal | — | 9.14% | 6.31% | 5.9s |
| UE5M3 | GPTQ+Optimal | Seq | 11.44% | 5.06% | 6.2s |
| UE5M3 | GPTQ-Ord+Optimal | Ord | 11.86% | 4.78% | 12.1s |
| UE5M3 | H-Optimal | — | 9.89% | 5.75% | 6.1s |
| UE5M3 | GPTQ+H-Optimal | Seq | 11.86% | 4.71% | 6.3s |
| UE5M3 | GPTQ-Ord+H-Optimal | Ord | 12.20% | **4.58%** | 12.3s |

### Best output error per scale

| Scale | BS=16 best | BS=32 best |
|:--|:--:|:--:|
| E4M3 | 4.21% | 4.75% |
| UE4M4 | **3.94%** | **4.56%** |
| UE5M3 | 3.95% | 4.58% |

(All bests are achieved by GPTQ-Ord+H-Optimal.)

### Output error reduction vs E4M3 (same approach + mode)

| Approach + mode | BS=16: UE4M4 | BS=16: UE5M3 | BS=32: UE4M4 | BS=32: UE5M3 |
|:--|:--:|:--:|:--:|:--:|
| Naive (no GPTQ) | -0.34pp | -0.38pp | -0.16pp | -0.17pp |
| Optimal (no GPTQ) | -0.38pp | -0.41pp | -0.29pp | -0.30pp |
| H-Optimal (no GPTQ) | -0.37pp | -0.35pp | -0.29pp | -0.27pp |
| GPTQ+Naive | -0.28pp | -0.31pp | -0.13pp | -0.13pp |
| GPTQ+Optimal | -0.30pp | -0.33pp | -0.23pp | -0.23pp |
| GPTQ+H-Optimal | -0.29pp | -0.28pp | -0.21pp | -0.20pp |
| GPTQ-Ord+Naive | -0.25pp | -0.29pp | -0.12pp | -0.12pp |
| GPTQ-Ord+Optimal | -0.29pp | -0.31pp | -0.22pp | -0.23pp |
| GPTQ-Ord+H-Optimal | **-0.27pp** | **-0.26pp** | **-0.19pp** | **-0.17pp** |

Both unsigned formats beat E4M3 across every approach × mode × block size.
The relative gain shrinks somewhat once GPTQ is layered on (GPTQ already
compensates for some of the per-block scale-snapping loss), but the absolute
output error keeps falling -- the **best result of every scale grid is
GPTQ-Ord+H-Optimal**, and UE4M4/UE5M3 still beat E4M3 there by 0.17--0.27pp.

UE4M4 and UE5M3 perform almost identically (within 0.01--0.03pp) across the
full grid, even though UE5M3 has $\sim$250x more dynamic range. Weight
magnitudes in this layer fall well within E4M3's range, so extra range is
wasted -- what matters is **grid density near the optimal scale**, and both
formats double the density relative to E4M3.

Caveat: standard FP8 E4M3 hardware support exists on Hopper/Ada; UE4M4 and
UE5M3 do not have hardware encoders, so naive-mode quantization is slower
in production (the snap requires a table lookup rather than a hardware
cast). Optimal/H-Optimal modes are unaffected since they iterate over the
scale table either way; the ~2x slowdown there is purely from doubling the
candidate count (255 vs 126).

## Larger blocks: bs=64 (E4M3 scales) and bs=128 (FP16 scales)

Same `layer_0` setup; W = 2560x9728, X = 244449x9728. Layouts here trade
scale precision and block size for the same total bits/weight:

| Config | Block | Scale | Scale b/w | Total b/w |
|:--|:--:|:--:|:--:|:--:|
| Baseline NVFP4 | 16 | FP8 E4M3 | 0.500 | 4.500 |
| -- | 32 | FP8 E4M3 | 0.250 | 4.250 |
| **New** | **64** | **FP8 E4M3** | **0.125** | **4.125** |
| **New** | **128** | **FP16 E5M10** | **0.125** | **4.125** |

For bs=128 + FP16, the snapped continuous optimum is essentially the true
SSE / H-optimal minimum, so per-block scales are found by iterative
alternation (q -> closed-form continuous s -> fp16 snap) rather than grid
search.

### bs=64, FP8 E4M3 scales (4.125 b/w)

| Codebook | Approach | GPTQ | Weight Error | Output Error | Time |
|:--|:--|:--:|:--:|:--:|--:|
| FP4 | Naive | -- | 10.77% | 7.41% | 42ms |
| FP4 | Optimal | -- | 10.19% | 7.05% | 1.0s |
| FP4 | H-Optimal | -- | 10.66% | 6.48% | 21.8s |
| FP4 | GPTQ-Ord+H-Optimal | Ord | 13.34% | 5.22% | 202s |
| FP4 | GPTQ-Ord+H-Opt+SPGL1 | Ord | 13.28% | **4.65%** | 373s |
| INT4 | Naive | -- | 11.37% | 7.88% | 9ms |
| INT4 | Optimal | -- | 10.89% | 7.71% | 687ms |
| INT4 | H-Optimal | -- | 11.37% | 7.23% | 21.6s |
| INT4 | GPTQ-Ord+H-Optimal | Ord | 14.77% | 5.99% | 145s |
| INT4 | GPTQ-Ord+H-Opt+SPGL1 | Ord | 15.24% | **5.22%** | 367s |

### bs=128, FP16 E5M10 scales (4.125 b/w)

| Codebook | Approach | GPTQ | Weight Error | Output Error | Time |
|:--|:--|:--:|:--:|:--:|--:|
| FP4 | Naive | -- | 11.00% | 7.56% | 8ms |
| FP4 | Optimal (iter) | -- | 10.56% | 7.33% | 61ms |
| FP4 | H-Optimal (iter) | -- | 10.76% | 6.88% | 158ms |
| FP4 | GPTQ-Seq+Naive | Seq | 13.76% | 6.08% | 706ms |
| FP4 | GPTQ-Ord+H-Optimal | Ord | 13.47% | 5.47% | 621ms |
| FP4 | GPTQ-Ord+H-Opt+SPGL1 | Ord | 13.85% | **4.77%** | 30.4s |
| INT4 | Naive | -- | 12.31% | 8.54% | 6ms |
| INT4 | Optimal (iter) | -- | 11.49% | 8.27% | 80ms |
| INT4 | H-Optimal (iter) | -- | 11.79% | 7.87% | 10.1s |
| INT4 | GPTQ-Seq+Naive | Seq | 15.50% | 6.84% | 361ms |
| INT4 | GPTQ-Ord+H-Optimal | Ord | 14.93% | 6.17% | 8.3s |
| INT4 | GPTQ-Ord+H-Opt+SPGL1 | Ord | 15.37% | **5.35%** | 110s |

### Same-budget head-to-head (4.125 b/w)

| Codebook | Config | GPTQ+H-Opt O% | + SPGL1 O% | ΔO from SPGL1 |
|:--|:--|:--:|:--:|:--:|
| FP4 | bs=64, E4M3 | 5.22 | **4.65** | -0.57 |
| FP4 | bs=128, FP16 | 5.47 | 4.77 | -0.70 |
| INT4 | bs=64, E4M3 | 5.99 | **5.22** | -0.77 |
| INT4 | bs=128, FP16 | 6.17 | 5.35 | -0.82 |

At the same 4.125 b/w budget, bs=64 + E4M3 beats bs=128 + FP16 for both
codebooks (FP4: -0.12pp, INT4: -0.13pp) -- coarser scale precision but
tighter per-block fit wins out. SPGL1 contributes a larger absolute gain at
the bs=128/FP16 point (~0.7-0.8pp) than at bs=64/E4M3 (~0.6-0.8pp), but
not enough to flip the ordering.

### Where these land vs the bs=16 best

| Config | b/w | H-Opt O% | +SPGL1 O% |
|:--|:--:|:--:|:--:|
| FP4 bs=16, E4M3 (NVFP4) | 4.500 | 5.31 | **3.64** |
| FP4 bs=64, E4M3 | 4.125 | 6.48 | 4.65 |
| FP4 bs=128, FP16 | 4.125 | 6.88 | 4.77 |
| INT4 bs=16, E4M3 (NVINT4) | 4.500 | 5.60 | (not run) |
| INT4 bs=64, E4M3 | 4.125 | 7.23 | 5.22 |
| INT4 bs=128, FP16 | 4.125 | 7.87 | 5.35 |

FP4 dominates INT4 by ~0.6-0.9pp output error at every operating point.
The codebook's wider dynamic range (0..6 vs symmetric 0..7) more efficiently
captures the long-tailed per-block weight distribution at larger block sizes.

### SPGL1 compensation method (recap)

After each block is snapped (in descending H-loss order), instead of
GPTQ's unconstrained `H_inv` error propagation, an L1-constrained SPGL1
LASSO is solved on the not-yet-snapped columns to minimize
`||X*(Delta_eff + delta)^T||_2` subject to `||delta||_1 <= tau`. Solved in
reduced (Gram) form -- no `H^-1`, no Cholesky -- robust to ill-conditioned
H. See `experiments/spgl1_gptq_*.py` for the implementations and
`notes/progress_track_spgl1.md` for the full research log.