Qwantize Logo

Methods

  • Optimal Scale Search
    • 1. Problem Formulation
    • 2. Baseline Error Anchor
    • 3. Lower Bound (Clipping Dominance)
    • 4. Upper Bound (Dead-Zone Dominance)
    • 5. The Optimized Search Algorithm
      • Efficiency
  • Hessian-Aware Optimal Scale Search
  • SPGL1 LASSO Compensation
  • Scale Distance Analysis
  • Triton Kernels
  • Custom Codebook Optimization

Results

  • Results

API Reference

  • INT8
  • NVFP4
  • MXFP4
  • Metrics
Qwantize
  • Optimal Scale Search
  • View page source

Optimal Scale Search

This page describes the mathematical foundation for finding per-block optimal scales in block-scaled FP4 quantization.

1. Problem Formulation

Let \(x \in \mathbb{R}^K\) be a block of weights or activations. Let \(\mathcal{Q}\) be the set of representable values in FP4 E2M1. The representable positive values are:

\[\mathcal{Q}^+ = \{0, 0.5, 1, 1.5, 2, 3, 4, 6\}\]
  • Maximum representable value: \(q_{\max} = 6\)

  • Smallest non-zero magnitude: \(q_{\min} = 0.5\)

  • Decision boundary for zero: \(d_0 = 0.25\) (any value \(|y| < 0.25\) rounds to 0)

Let \(\mathcal{S}\) be the set of representable positive scale values (FP8 E4M3 for NVFP4, UE8M0 for MXFP4). The quantized value of \(x_i\) given a scale \(s\) is \(s \cdot Q(x_i/s)\), where \(Q(\cdot)\) is the nearest-neighbor mapping to \(\mathcal{Q}\).

Our objective is to find the optimal scale \(s^*\) that minimizes the Sum of Squared Errors (SSE):

\[E(s) = \sum_{i=1}^K \left(x_i - s \cdot Q\left(\frac{x_i}{s}\right)\right)^2\]
\[s^* = \arg\min_{s \in \mathcal{S}} E(s)\]

2. Baseline Error Anchor

Standard practice uses the baseline continuous scale \(s_{\text{base}} = \max_i |x_i| / q_{\max}\). Let \(s_0 \in \mathcal{S}\) be the closest representable scale to \(s_{\text{base}}\). We compute its error: \(E_0 = E(s_0)\).

Since \(s^*\) is the absolute optimum, it must strictly satisfy:

\[E(s^*) \le E_0\]

This inequality \(E(s) \le E_0\) is the anchor we use to prove that scales too large or too small cannot be optimal.

Edge case: If \(\sum x_i^2 \le E_0\), quantizing everything to zero is no worse than the baseline. The block is effectively noise; return \(s_0\) immediately.

3. Lower Bound (Clipping Dominance)

If a scale \(s\) is too small, large values in \(x\) will clip to \(s \cdot q_{\max}\), generating large error.

Proof. For any element \(x_i\), the maximum representable magnitude is \(s \cdot q_{\max}\). If \(|x_i| > s \cdot q_{\max}\), it clips. The error for that single element is strictly \((|x_i| - s \cdot q_{\max})^2\). Since the squared error of all other elements is \(\ge 0\):

\[E(s) \ge \sum_{i=1}^K \max(|x_i| - s \cdot q_{\max}, 0)^2 \ge \max_i (|x_i| - s \cdot q_{\max})^2\]

For \(s\) to be a valid candidate, we must have \(E(s) \le E_0\). Let \(x_{\max} = \max_i |x_i|\). Assuming \(s < x_{\max}/q_{\max}\) (clipping occurs):

\[(x_{\max} - s \cdot q_{\max})^2 \le E_0\]
\[x_{\max} - s \cdot q_{\max} \le \sqrt{E_0}\]
\[s \ge \frac{x_{\max} - \sqrt{E_0}}{q_{\max}} = s_{\min}\]

Strict algorithmic lower bound. A stronger per-candidate fast-fail uses the full clipping sum:

\[H(s) = \sum_{i=1}^K \max(|x_i| - s \cdot q_{\max}, 0)^2\]

\(H(s)\) is monotonically decreasing with \(s\). Any scale \(s\) where \(H(s) > E_{\text{best}}\) can be immediately discarded without computing the full quantization SSE.

4. Upper Bound (Dead-Zone Dominance)

If a scale \(s\) is too large, the values \(x_i/s\) become so small that they fall below the decision boundary \(d_0 = 0.25\), causing elements to snap to \(0\).

Proof. If \(|x_i| / s < d_0\), then \(Q(x_i/s) = 0\). The error contributed by this element is exactly \(x_i^2\). Let \(A(s) = \{i \mid |x_i| < s \cdot d_0\}\) be the set of indices quantized to zero.

\[E(s) = \sum_{i \in A(s)} x_i^2 + \sum_{i \notin A(s)} (x_i - s \cdot Q(x_i/s))^2\]

Because the second term is non-negative:

\[E(s) \ge \sum_{i \in A(s)} x_i^2\]

To ensure \(E(s) \le E_0\), the sum of the squares of the elements quantized to zero cannot exceed \(E_0\). Sort the absolute values in ascending order: \(y_1 \le y_2 \le \dots \le y_K\). Compute the cumulative sum of squares: \(C_k = \sum_{j=1}^k y_j^2\).

Find the maximum index \(k^*\) such that \(C_{k^*} \le E_0\). This tells us we can afford to quantize at most \(k^*\) elements to zero. The \((k^*+1)\)-th smallest element must NOT quantize to zero:

\[|y_{k^*+1}| / s \ge d_0\]
\[s \le \frac{y_{k^*+1}}{d_0} = s_{\max}\]

5. The Optimized Search Algorithm

Instead of brute-forcing all representable scales, we execute the following sequence:

Step 1: Setup and Baseline

  1. Extract the block \(x \in \mathbb{R}^K\). Compute \(x_{\max} = \max|x_i|\).

  2. Set analytical baseline: \(s_{\text{cont}} = x_{\max} / q_{\max}\).

  3. Snap \(s_{\text{cont}}\) to the nearest representable scale to get \(s_0\).

  4. Quantize \(x\) using \(s_0\) and calculate the baseline error \(E_0\).

  5. Edge Case Check: If \(\sum x_i^2 \le E_0\), the block is effectively noise. Return \(s_0\) immediately.

Step 2: Calculate Bounds

  1. Upper Bound: Sort \(|x|\) ascending as \(y_1, \dots, y_K\). Find highest \(k^*\) where \(\sum_{j=1}^{k^*} y_j^2 \le E_0\). Compute \(s_{\max} = y_{k^*+1} / d_0\).

  2. Lower Bound: Compute analytical floor: \(s_{\min} = \max(0, (x_{\max} - \sqrt{E_0}) / q_{\max})\).

Step 3: Bounded Search

  1. Filter the scale table to only those in the range \([s_{\min}, s_{\max}]\).

  2. Initialize best_s = s_0 and min_E = E_0.

  3. For each \(s\) in the filtered set:

    • Fast-Fail Check: Calculate clipping error \(H(s) = \sum \max(|x_i| - q_{\max} \cdot s, 0)^2\). If \(H(s) > \text{min\_E}\), skip.

    • Otherwise, compute full quantization error \(E(s)\).

    • If \(E(s) < \text{min\_E}\), update min_E = E(s) and best_s = s.

  4. Return best_s.

Efficiency

  • Sorting \(K\) elements (where \(K\) is typically 16 or 32) takes negligible time compared to memory bandwidth and repeated quantization math.

  • Search space reduction: The bounds usually restrict candidates to a window of 4 to 8 scales around \(s_0\). The fast-fail check further skips full evaluations, bringing the heavy \(Q(\cdot)\) operations down to an absolute minimum.

Previous Next

© Copyright 2026, Qwantize contributors.

Built with Sphinx using a theme provided by Read the Docs.