Scale Distance Analysis
This page presents empirical analysis of the distance between naive and optimal scales for both NVFP4 and MXFP4 quantization formats. The results justify using a fixed-window search as a practical alternative to the full bounded search.
Reproduce with:
python bench/nvfp4_scale_distance.pyandpython bench/mxfp4_scale_distance.py
Note
Benchmarked on the down_proj weight of the first decoder layer from Qwen3-4B (W: 2560x9728, bfloat16).
Motivation
The full optimal scale search (Section 5 of Optimal Scale Search) uses data-dependent bounds that vary per block. On GPU, this causes warp divergence: threads within a warp may iterate over different numbers of scale candidates, leaving some threads idle.
If optimal scales always fall within a fixed window of table steps from the naive scale, we can replace the variable-length search with a constant-count loop – achieving zero divergence with no loss in quality.
NVFP4: FP8 E4M3 Scale Distances
NVFP4 uses FP8 E4M3 as the scale format, giving 126 positive representable values. We convert both naive and optimal scales to their FP8 byte representation, map them to indices in the sorted 126-entry FP8 table, and compute the signed index distance \(\delta = \text{idx}\_{\text{optimal}} - \text{idx}\_{\text{naive}}\).
Key property exploited: for positive FP8 E4M3 values, byte ordering equals value ordering, so searchsorted on uint8 bytes works directly.
Distance Distribution
Metric |
Block Size 16 |
Block Size 32 |
|---|---|---|
Total blocks |
1,553,920 |
776,960 |
Blocks changed |
58.7% |
52.8% |
Index distance range |
[-2, +11] |
[-2, +9] |
Mean distance |
+1.2 |
+0.9 |
Median distance |
+1 |
+1 |
Direction |
Mostly upward |
Mostly upward |
The optimal scale tends to be larger than naive. This makes sense: the naive scale \(\text{FP8}(\max/6)\) rounds the continuous scale to FP8, often rounding down. A slightly larger scale can reduce rounding error for the bulk of elements at the cost of marginal clipping on the maximum.
Fixed-Window Search Quality
Window |
Candidates |
BS16 Gap to Optimal |
BS32 Gap to Optimal |
|---|---|---|---|
\(\pm 1\) |
3 |
~30% remaining gap |
~35% |
\(\pm 3\) |
7 |
~2-4% |
~2-4% |
\(\pm 5\) |
11 |
~0% |
~0% |
Full (126) |
126 |
0% (reference) |
0% |
A window of \(\pm 5\) FP8 table steps (11 candidates) captures 100% of the optimal improvement. This is a fixed, data-independent loop count suitable for a zero-divergence GPU kernel.
Quantization Quality
Method |
Block Size |
\(|Q(W)-W|/|W|\) |
\(|W_q X - WX|/|WX|\) |
Weight Error Reduction |
Output Error Reduction |
|---|---|---|---|---|---|
Naive |
16 |
10.05% |
– |
– |
– |
Optimal |
16 |
8.74% |
– |
13.07% |
12.03% |
Naive |
32 |
10.42% |
– |
– |
– |
Optimal |
32 |
9.57% |
– |
8.15% |
7.53% |
MXFP4: UE8M0 Scale Distances
MXFP4 uses UE8M0 (power-of-2) scales: \(s = 2^{e-128}\) for exponent \(e \in \{1, \ldots, 254\}\), giving 254 scale values. Since successive scales differ by exactly a factor of 2, the “distance” between naive and optimal is measured in exponent steps.
Distance Distribution
Metric |
Block Size 16 |
Block Size 32 |
|---|---|---|
Total blocks |
1,553,920 |
776,960 |
Blocks changed |
15.8% |
16.4% |
Exponent distance range |
[0, +1] |
[0, +1] |
Mean distance |
+0.158 |
+0.164 |
The optimal scale for MXFP4 is always within 1 step of the naive scale. This is expected: with power-of-2 scales, consecutive scales differ by a factor of 2, so the naive \(s_0 = 2^{\lfloor \log_2(\max|x_i|) - 2 + 127 \rfloor - 128}\) is already very close to optimal. The only scenario where the optimal differs is when most elements in the block are much smaller than the maximum – making a one-step-larger scale (which doubles the scale) beneficial.
Fixed-Window Search Quality
Window |
Candidates |
BS16 Gap to Optimal |
BS32 Gap to Optimal |
|---|---|---|---|
\(\pm 1\) |
3 |
0% |
0% |
Full (254) |
254 |
0% (reference) |
0% |
A window of \(\pm 1\) UE8M0 exponent step (3 candidates) is sufficient to capture the full optimal improvement. This trivially allows a zero-divergence GPU kernel.
Quantization Quality
Method |
Block Size |
\(|Q(W)-W|/|W|\) |
|---|---|---|
Naive |
16 |
10.34% |
Optimal |
16 |
10.13% |
Naive |
32 |
10.76% |
Optimal |
32 |
10.58% |
The improvement for MXFP4 is smaller than NVFP4 (as expected, since UE8M0 scales are already near-optimal due to the factor-of-2 spacing).
Summary
Format |
Scale Type |
Scale Values |
Max Distance |
Required Window |
|---|---|---|---|---|
NVFP4 |
FP8 E4M3 |
126 |
\(\pm 5\) steps |
\(\pm 5\) (11 candidates) |
MXFP4 |
UE8M0 |
254 |
+1 step |
\(\pm 1\) (3 candidates) |
These results enable fixed-window kernels that achieve optimal quantization quality with zero warp divergence.