Optimal Scale Search
This page describes the mathematical foundation for finding per-block optimal scales in block-scaled FP4 quantization.
1. Problem Formulation
Let \(x \in \mathbb{R}^K\) be a block of weights or activations. Let \(\mathcal{Q}\) be the set of representable values in FP4 E2M1. The representable positive values are:
Maximum representable value: \(q_{\max} = 6\)
Smallest non-zero magnitude: \(q_{\min} = 0.5\)
Decision boundary for zero: \(d_0 = 0.25\) (any value \(|y| < 0.25\) rounds to 0)
Let \(\mathcal{S}\) be the set of representable positive scale values (FP8 E4M3 for NVFP4, UE8M0 for MXFP4). The quantized value of \(x_i\) given a scale \(s\) is \(s \cdot Q(x_i/s)\), where \(Q(\cdot)\) is the nearest-neighbor mapping to \(\mathcal{Q}\).
Our objective is to find the optimal scale \(s^*\) that minimizes the Sum of Squared Errors (SSE):
2. Baseline Error Anchor
Standard practice uses the baseline continuous scale \(s_{\text{base}} = \max_i |x_i| / q_{\max}\). Let \(s_0 \in \mathcal{S}\) be the closest representable scale to \(s_{\text{base}}\). We compute its error: \(E_0 = E(s_0)\).
Since \(s^*\) is the absolute optimum, it must strictly satisfy:
This inequality \(E(s) \le E_0\) is the anchor we use to prove that scales too large or too small cannot be optimal.
Edge case: If \(\sum x_i^2 \le E_0\), quantizing everything to zero is no worse than the baseline. The block is effectively noise; return \(s_0\) immediately.
3. Lower Bound (Clipping Dominance)
If a scale \(s\) is too small, large values in \(x\) will clip to \(s \cdot q_{\max}\), generating large error.
Proof. For any element \(x_i\), the maximum representable magnitude is \(s \cdot q_{\max}\). If \(|x_i| > s \cdot q_{\max}\), it clips. The error for that single element is strictly \((|x_i| - s \cdot q_{\max})^2\). Since the squared error of all other elements is \(\ge 0\):
For \(s\) to be a valid candidate, we must have \(E(s) \le E_0\). Let \(x_{\max} = \max_i |x_i|\). Assuming \(s < x_{\max}/q_{\max}\) (clipping occurs):
Strict algorithmic lower bound. A stronger per-candidate fast-fail uses the full clipping sum:
\(H(s)\) is monotonically decreasing with \(s\). Any scale \(s\) where \(H(s) > E_{\text{best}}\) can be immediately discarded without computing the full quantization SSE.
4. Upper Bound (Dead-Zone Dominance)
If a scale \(s\) is too large, the values \(x_i/s\) become so small that they fall below the decision boundary \(d_0 = 0.25\), causing elements to snap to \(0\).
Proof. If \(|x_i| / s < d_0\), then \(Q(x_i/s) = 0\). The error contributed by this element is exactly \(x_i^2\). Let \(A(s) = \{i \mid |x_i| < s \cdot d_0\}\) be the set of indices quantized to zero.
Because the second term is non-negative:
To ensure \(E(s) \le E_0\), the sum of the squares of the elements quantized to zero cannot exceed \(E_0\). Sort the absolute values in ascending order: \(y_1 \le y_2 \le \dots \le y_K\). Compute the cumulative sum of squares: \(C_k = \sum_{j=1}^k y_j^2\).
Find the maximum index \(k^*\) such that \(C_{k^*} \le E_0\). This tells us we can afford to quantize at most \(k^*\) elements to zero. The \((k^*+1)\)-th smallest element must NOT quantize to zero:
5. The Optimized Search Algorithm
Instead of brute-forcing all representable scales, we execute the following sequence:
Step 1: Setup and Baseline
Extract the block \(x \in \mathbb{R}^K\). Compute \(x_{\max} = \max|x_i|\).
Set analytical baseline: \(s_{\text{cont}} = x_{\max} / q_{\max}\).
Snap \(s_{\text{cont}}\) to the nearest representable scale to get \(s_0\).
Quantize \(x\) using \(s_0\) and calculate the baseline error \(E_0\).
Edge Case Check: If \(\sum x_i^2 \le E_0\), the block is effectively noise. Return \(s_0\) immediately.
Step 2: Calculate Bounds
Upper Bound: Sort \(|x|\) ascending as \(y_1, \dots, y_K\). Find highest \(k^*\) where \(\sum_{j=1}^{k^*} y_j^2 \le E_0\). Compute \(s_{\max} = y_{k^*+1} / d_0\).
Lower Bound: Compute analytical floor: \(s_{\min} = \max(0, (x_{\max} - \sqrt{E_0}) / q_{\max})\).
Step 3: Bounded Search
Filter the scale table to only those in the range \([s_{\min}, s_{\max}]\).
Initialize
best_s = s_0andmin_E = E_0.For each \(s\) in the filtered set:
Fast-Fail Check: Calculate clipping error \(H(s) = \sum \max(|x_i| - q_{\max} \cdot s, 0)^2\). If \(H(s) > \text{min\_E}\), skip.
Otherwise, compute full quantization error \(E(s)\).
If \(E(s) < \text{min\_E}\), update
min_E = E(s)andbest_s = s.
Return
best_s.
Efficiency
Sorting \(K\) elements (where \(K\) is typically 16 or 32) takes negligible time compared to memory bandwidth and repeated quantization math.
Search space reduction: The bounds usually restrict candidates to a window of 4 to 8 scales around \(s_0\). The fast-fail check further skips full evaluations, bringing the heavy \(Q(\cdot)\) operations down to an absolute minimum.