Scale Distance Analysis

This page presents empirical analysis of the distance between naive and optimal scales for both NVFP4 and MXFP4 quantization formats. The results justify using a fixed-window search as a practical alternative to the full bounded search.

Reproduce with: python bench/nvfp4_scale_distance.py and python bench/mxfp4_scale_distance.py

Note

Benchmarked on the down_proj weight of the first decoder layer from Qwen3-4B (W: 2560x9728, bfloat16).

Motivation

The full optimal scale search (Section 5 of Optimal Scale Search) uses data-dependent bounds that vary per block. On GPU, this causes warp divergence: threads within a warp may iterate over different numbers of scale candidates, leaving some threads idle.

If optimal scales always fall within a fixed window of table steps from the naive scale, we can replace the variable-length search with a constant-count loop – achieving zero divergence with no loss in quality.

NVFP4: FP8 E4M3 Scale Distances

NVFP4 uses FP8 E4M3 as the scale format, giving 126 positive representable values. We convert both naive and optimal scales to their FP8 byte representation, map them to indices in the sorted 126-entry FP8 table, and compute the signed index distance \(\delta = \text{idx}\_{\text{optimal}} - \text{idx}\_{\text{naive}}\).

Key property exploited: for positive FP8 E4M3 values, byte ordering equals value ordering, so searchsorted on uint8 bytes works directly.

Distance Distribution

Metric

Block Size 16

Block Size 32

Total blocks

1,553,920

776,960

Blocks changed

58.7%

52.8%

Index distance range

[-2, +11]

[-2, +9]

Mean distance

+1.2

+0.9

Median distance

+1

+1

Direction

Mostly upward

Mostly upward

The optimal scale tends to be larger than naive. This makes sense: the naive scale \(\text{FP8}(\max/6)\) rounds the continuous scale to FP8, often rounding down. A slightly larger scale can reduce rounding error for the bulk of elements at the cost of marginal clipping on the maximum.

Fixed-Window Search Quality

Window

Candidates

BS16 Gap to Optimal

BS32 Gap to Optimal

\(\pm 1\)

3

~30% remaining gap

~35%

\(\pm 3\)

7

~2-4%

~2-4%

\(\pm 5\)

11

~0%

~0%

Full (126)

126

0% (reference)

0%

A window of \(\pm 5\) FP8 table steps (11 candidates) captures 100% of the optimal improvement. This is a fixed, data-independent loop count suitable for a zero-divergence GPU kernel.

Quantization Quality

Method

Block Size

\(|Q(W)-W|/|W|\)

\(|W_q X - WX|/|WX|\)

Weight Error Reduction

Output Error Reduction

Naive

16

10.05%

Optimal

16

8.74%

13.07%

12.03%

Naive

32

10.42%

Optimal

32

9.57%

8.15%

7.53%

MXFP4: UE8M0 Scale Distances

MXFP4 uses UE8M0 (power-of-2) scales: \(s = 2^{e-128}\) for exponent \(e \in \{1, \ldots, 254\}\), giving 254 scale values. Since successive scales differ by exactly a factor of 2, the “distance” between naive and optimal is measured in exponent steps.

Distance Distribution

Metric

Block Size 16

Block Size 32

Total blocks

1,553,920

776,960

Blocks changed

15.8%

16.4%

Exponent distance range

[0, +1]

[0, +1]

Mean distance

+0.158

+0.164

The optimal scale for MXFP4 is always within 1 step of the naive scale. This is expected: with power-of-2 scales, consecutive scales differ by a factor of 2, so the naive \(s_0 = 2^{\lfloor \log_2(\max|x_i|) - 2 + 127 \rfloor - 128}\) is already very close to optimal. The only scenario where the optimal differs is when most elements in the block are much smaller than the maximum – making a one-step-larger scale (which doubles the scale) beneficial.

Fixed-Window Search Quality

Window

Candidates

BS16 Gap to Optimal

BS32 Gap to Optimal

\(\pm 1\)

3

0%

0%

Full (254)

254

0% (reference)

0%

A window of \(\pm 1\) UE8M0 exponent step (3 candidates) is sufficient to capture the full optimal improvement. This trivially allows a zero-divergence GPU kernel.

Quantization Quality

Method

Block Size

\(|Q(W)-W|/|W|\)

Naive

16

10.34%

Optimal

16

10.13%

Naive

32

10.76%

Optimal

32

10.58%

The improvement for MXFP4 is smaller than NVFP4 (as expected, since UE8M0 scales are already near-optimal due to the factor-of-2 spacing).

Summary

Format

Scale Type

Scale Values

Max Distance

Required Window

NVFP4

FP8 E4M3

126

\(\pm 5\) steps

\(\pm 5\) (11 candidates)

MXFP4

UE8M0

254

+1 step

\(\pm 1\) (3 candidates)

These results enable fixed-window kernels that achieve optimal quantization quality with zero warp divergence.