Custom Codebook Optimization

This page derives an algorithm for learning an optimal 4-bit codebook from data, replacing the fixed FP4 E2M1 values with a custom set of 16 quantization levels that minimize total quantization error.

1. Problem Statement

Let \(W \in \mathbb{R}^{M \times K}\) be a weight matrix, partitioned along the \(K\) dimension into blocks of size \(b\). For row \(m\) and column-block \(j\), denote the block as \(x \in \mathbb{R}^b\) where \(x_i = W_{m, jb+i}\).

We seek a codebook \(\mathcal{C} = \{c_0, c_1, \ldots, c_{15}\}\) of 16 values (4 bits), per-block scales \(s_j \in \mathcal{S}\), and assignments \(k_{j,i} \in \{0, \ldots, 15\}\) that minimize the total Sum of Squared Errors:

\[E(\mathcal{C}, \{s_j\}, \{k_{j,i}\}) = \sum_{j=1}^{MK/b} \sum_{i=1}^{b} \left(x_{j,i} - s_j \cdot c_{k_{j,i}}\right)^2\]

This is a joint optimization over:

  • The 16 codebook values \(\mathcal{C}\)

  • The per-block scales \(\{s_j\}\)

  • The per-element code assignments \(\{k_{j,i}\}\)

The fixed FP4 E2M1 codebook \(\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}\) is one feasible solution. We aim to find a codebook that achieves lower error for a given weight matrix.

2. Sign Reduction

Weights are distributed symmetrically around zero, so we use a sign bit and restrict the codebook to non-negative values:

\[\mathcal{C} = \mathcal{C}^+ \cup (-\mathcal{C}^+), \quad \mathcal{C}^+ = \{0, c_1, c_2, \ldots, c_7\}, \quad 0 < c_1 < c_2 < \cdots < c_7\]

This gives 16 codes: \(\{+0, -0, \pm c_1, \ldots, \pm c_7\}\).

The quantization of \(x_i\) becomes: $\(\hat{x}_i = \text{sign}(x_i) \cdot s_j \cdot Q^+(|x_i|/s_j)\)$

where \(Q^+(\cdot)\) maps to the nearest value in \(\mathcal{C}^+\). The total error decomposes as:

\[E = \sum_{j,i} \left(|x_{j,i}| - s_j \cdot Q^+\!\left(\frac{|x_{j,i}|}{s_j}\right)\right)^2\]

We only need to find 7 positive codebook values (plus zero). The sign is handled separately.

3. Scale Normalization

Each block has a per-block scale \(s_j\), so the quantization operates on the normalized magnitudes \(y_{j,i} = |x_{j,i}|/s_j\). Substituting into the error:

\[E = \sum_j s_j^2 \sum_{i=1}^{b} \left(y_{j,i} - Q^+(y_{j,i})\right)^2\]

The codebook \(\mathcal{C}^+\) determines the quantizer \(Q^+\), and \(y_{j,i} \in [0, c_7]\) (values above \(c_7\) clip). The factor \(s_j^2\) weights blocks by their magnitude, but for a fixed codebook, minimizing the inner sum for each block independently is equivalent to minimizing the total.

Scale invariance

The key observation: for the naive scale \(s_j = \text{snap}(\max_i|x_{j,i}| / c_7)\), the normalized values \(y_{j,i} = |x_{j,i}|/s_j\) all lie in \([0, c_7]\) with the block maximum near \(c_7\). The distribution of normalized values is approximately independent of the block’s absolute scale. This means we can pool normalized values across all blocks to learn a single codebook.

4. The Optimization Problem

Define the normalized magnitudes for all blocks:

\[y_{j,i} = \frac{|x_{j,i}|}{\max_k|x_{j,k}|}, \quad y_{j,i} \in [0, 1]\]

where we normalize by the block maximum (rather than the snapped scale) to obtain a clean \([0,1]\) distribution independent of the scale format.

Pool all normalized magnitudes from all blocks:

\[\mathcal{Y} = \{y_{j,i} : i = 1, \ldots, b, \; j = 1, \ldots, MK/b\}\]

We seek 7 positive centers \(\tilde{c}_1 < \cdots < \tilde{c}_7\) in \([0, 1]\) that minimize:

\[\min_{\tilde{c}_1, \ldots, \tilde{c}_7} \sum_{y \in \mathcal{Y},\, y > d_0} \min_k \left(y - \tilde{c}_k\right)^2\]

where \(d_0 = \tilde{c}_1 / 2\) is the decision boundary between zero and the smallest positive value. Values \(y \le d_0\) are assigned to \(c_0 = 0\) and excluded from the positive-cluster optimization.

This is 1D k-means (Lloyd’s algorithm) with \(k = 7\) on the non-zero portion of \(\mathcal{Y}\).

Algorithm

  1. Normalize: For each block, compute \(y_{j,i} = |x_{j,i}| / \max_k|x_{j,k}|\).

  2. Pool: Collect all \(y_{j,i}\) values into a single set \(\mathcal{Y}\).

  3. Initialize: Set initial centers \(\tilde{c}_1, \ldots, \tilde{c}_7\) at evenly-spaced quantiles of the non-zero values in \(\mathcal{Y}\).

  4. Iterate (Lloyd’s algorithm):

    • Compute decision boundary \(d_0 = \tilde{c}_1 / 2\) and inter-center boundaries \(d_k = (\tilde{c}_k + \tilde{c}_{k+1})/2\).

    • Assign: each \(y > d_0\) to its nearest center.

    • Update: \(\tilde{c}_k \leftarrow \text{mean}(S_k)\) where \(S_k\) is the set of values assigned to center \(k\).

    • Repeat until convergence.

  5. Rescale: Multiply by \(c_7^{\text{target}}\) (e.g., 6.0) to obtain the final codebook in the standard scale range: \(c_k = \tilde{c}_k \cdot c_7^{\text{target}}\).

Why this works

The fixed FP4 E2M1 codebook \(\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}\) allocates half its positive values to the range \([0, 1.5]\) (linear spacing) and the other half to \([2, 6]\) (exponential spacing). This is a compromise designed for general-purpose hardware.

The learned codebook adapts to the actual distribution of normalized weight magnitudes. If the distribution is approximately uniform in \([0, 1]\) (as is typical for Gaussian-like weights after normalization), the optimal codebook will be more evenly spaced — placing more resolution where there is more probability mass.

Complexity

  • Normalization: \(O(MK)\) — one pass over the weight matrix.

  • Sorting for initialization: \(O(N \log N)\) where \(N = MK\).

  • Each k-means iteration: \(O(N)\) (binary search in 7 sorted centers is \(O(1)\)).

  • Total iterations: Typically \(< 100\) for 1D k-means.

For a \(2560 \times 9728\) weight matrix with block size 32, \(N \approx 25\text{M}\) values. The entire optimization runs in seconds on CPU.

5. Per-Block Scale Selection

Given the custom codebook \(\mathcal{C}^+\), per-block scale selection follows the same framework as the Optimal Scale Search:

  • Naive: \(s = \text{snap}(\max_i|x_i| / c_7)\)

  • SSE-Optimal: Bounded search minimizing \(E(s) = \sum_i (x_i - s \cdot Q(x_i/s))^2\), using clipping and dead-zone bounds with \(q_{\max} = c_7\) and \(d_0 = c_1/2\)

  • Hessian-Optimal: Same bounded search minimizing \(r^T H r\) with the custom codebook

The bounds from Optimal Scale Search apply directly — only the codebook values and boundaries change.

6. Discussion

vs. FP4 E2M1. The fixed FP4 E2M1 codebook \(\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}\) is designed for hardware decode efficiency (the bit pattern maps directly to a floating-point value). A learned codebook requires a lookup table for decode, but can achieve lower quantization error by adapting to the actual weight distribution.

Weight-dependent. The optimal codebook depends on the weight distribution. Different layers (or layer types) may benefit from different codebooks. In practice, one could learn a single codebook per model, per layer type, or per individual layer — trading off storage (16 values per codebook) against quantization quality.

Relationship to scalar k-means. This derivation shows that optimal block-scaled 4-bit quantization with a learned codebook reduces to 1D k-means on the pooled normalized magnitudes. Scale normalization provides invariance to per-block magnitude, reducing the problem from a joint codebook-and-scale optimization to a single scalar quantization problem.