# Optimal Scale Search This page describes the mathematical foundation for finding per-block optimal scales in block-scaled FP4 quantization. ## 1. Problem Formulation Let $x \in \mathbb{R}^K$ be a block of weights or activations. Let $\mathcal{Q}$ be the set of representable values in FP4 E2M1. The representable positive values are: $$\mathcal{Q}^+ = \{0, 0.5, 1, 1.5, 2, 3, 4, 6\}$$ - Maximum representable value: $q_{\max} = 6$ - Smallest non-zero magnitude: $q_{\min} = 0.5$ - Decision boundary for zero: $d_0 = 0.25$ (any value $|y| < 0.25$ rounds to 0) Let $\mathcal{S}$ be the set of representable positive scale values (FP8 E4M3 for NVFP4, UE8M0 for MXFP4). The quantized value of $x_i$ given a scale $s$ is $s \cdot Q(x_i/s)$, where $Q(\cdot)$ is the nearest-neighbor mapping to $\mathcal{Q}$. Our objective is to find the optimal scale $s^*$ that minimizes the Sum of Squared Errors (SSE): $$E(s) = \sum_{i=1}^K \left(x_i - s \cdot Q\left(\frac{x_i}{s}\right)\right)^2$$ $$s^* = \arg\min_{s \in \mathcal{S}} E(s)$$ ## 2. Baseline Error Anchor Standard practice uses the baseline continuous scale $s_{\text{base}} = \max_i |x_i| / q_{\max}$. Let $s_0 \in \mathcal{S}$ be the closest representable scale to $s_{\text{base}}$. We compute its error: $E_0 = E(s_0)$. Since $s^*$ is the absolute optimum, it must strictly satisfy: $$E(s^*) \le E_0$$ This inequality $E(s) \le E_0$ is the anchor we use to prove that scales too large or too small cannot be optimal. **Edge case**: If $\sum x_i^2 \le E_0$, quantizing everything to zero is no worse than the baseline. The block is effectively noise; return $s_0$ immediately. ## 3. Lower Bound (Clipping Dominance) If a scale $s$ is too small, large values in $x$ will clip to $s \cdot q_{\max}$, generating large error. **Proof.** For any element $x_i$, the maximum representable magnitude is $s \cdot q_{\max}$. If $|x_i| > s \cdot q_{\max}$, it clips. The error for that single element is strictly $(|x_i| - s \cdot q_{\max})^2$. Since the squared error of all other elements is $\ge 0$: $$E(s) \ge \sum_{i=1}^K \max(|x_i| - s \cdot q_{\max}, 0)^2 \ge \max_i (|x_i| - s \cdot q_{\max})^2$$ For $s$ to be a valid candidate, we must have $E(s) \le E_0$. Let $x_{\max} = \max_i |x_i|$. Assuming $s < x_{\max}/q_{\max}$ (clipping occurs): $$(x_{\max} - s \cdot q_{\max})^2 \le E_0$$ $$x_{\max} - s \cdot q_{\max} \le \sqrt{E_0}$$ $$s \ge \frac{x_{\max} - \sqrt{E_0}}{q_{\max}} = s_{\min}$$ **Strict algorithmic lower bound.** A stronger per-candidate fast-fail uses the full clipping sum: $$H(s) = \sum_{i=1}^K \max(|x_i| - s \cdot q_{\max}, 0)^2$$ $H(s)$ is monotonically decreasing with $s$. Any scale $s$ where $H(s) > E_{\text{best}}$ can be immediately discarded without computing the full quantization SSE. ## 4. Upper Bound (Dead-Zone Dominance) If a scale $s$ is too large, the values $x_i/s$ become so small that they fall below the decision boundary $d_0 = 0.25$, causing elements to snap to $0$. **Proof.** If $|x_i| / s < d_0$, then $Q(x_i/s) = 0$. The error contributed by this element is exactly $x_i^2$. Let $A(s) = \{i \mid |x_i| < s \cdot d_0\}$ be the set of indices quantized to zero. $$E(s) = \sum_{i \in A(s)} x_i^2 + \sum_{i \notin A(s)} (x_i - s \cdot Q(x_i/s))^2$$ Because the second term is non-negative: $$E(s) \ge \sum_{i \in A(s)} x_i^2$$ To ensure $E(s) \le E_0$, the sum of the squares of the elements quantized to zero cannot exceed $E_0$. Sort the absolute values in ascending order: $y_1 \le y_2 \le \dots \le y_K$. Compute the cumulative sum of squares: $C_k = \sum_{j=1}^k y_j^2$. Find the maximum index $k^*$ such that $C_{k^*} \le E_0$. This tells us we can afford to quantize at most $k^*$ elements to zero. The $(k^*+1)$-th smallest element must NOT quantize to zero: $$|y_{k^*+1}| / s \ge d_0$$ $$s \le \frac{y_{k^*+1}}{d_0} = s_{\max}$$ ## 5. The Optimized Search Algorithm Instead of brute-forcing all representable scales, we execute the following sequence: **Step 1: Setup and Baseline** 1. Extract the block $x \in \mathbb{R}^K$. Compute $x_{\max} = \max|x_i|$. 2. Set analytical baseline: $s_{\text{cont}} = x_{\max} / q_{\max}$. 3. Snap $s_{\text{cont}}$ to the nearest representable scale to get $s_0$. 4. Quantize $x$ using $s_0$ and calculate the baseline error $E_0$. 5. *Edge Case Check:* If $\sum x_i^2 \le E_0$, the block is effectively noise. Return $s_0$ immediately. **Step 2: Calculate Bounds** 1. **Upper Bound:** Sort $|x|$ ascending as $y_1, \dots, y_K$. Find highest $k^*$ where $\sum_{j=1}^{k^*} y_j^2 \le E_0$. Compute $s_{\max} = y_{k^*+1} / d_0$. 2. **Lower Bound:** Compute analytical floor: $s_{\min} = \max(0, (x_{\max} - \sqrt{E_0}) / q_{\max})$. **Step 3: Bounded Search** 1. Filter the scale table to only those in the range $[s_{\min}, s_{\max}]$. 2. Initialize `best_s = s_0` and `min_E = E_0`. 3. For each $s$ in the filtered set: - *Fast-Fail Check:* Calculate clipping error $H(s) = \sum \max(|x_i| - q_{\max} \cdot s, 0)^2$. If $H(s) > \text{min\_E}$, skip. - Otherwise, compute full quantization error $E(s)$. - If $E(s) < \text{min\_E}$, update `min_E = E(s)` and `best_s = s`. 4. Return `best_s`. ### Efficiency - **Sorting $K$ elements** (where $K$ is typically 16 or 32) takes negligible time compared to memory bandwidth and repeated quantization math. - **Search space reduction:** The bounds usually restrict candidates to a window of **4 to 8 scales** around $s_0$. The fast-fail check further skips full evaluations, bringing the heavy $Q(\cdot)$ operations down to an absolute minimum.