Custom Codebook Optimization
This page derives an algorithm for learning an optimal 4-bit codebook from data, replacing the fixed FP4 E2M1 values with a custom set of 16 quantization levels that minimize total quantization error.
1. Problem Statement
Let \(W \in \mathbb{R}^{M \times K}\) be a weight matrix, partitioned along the \(K\) dimension into blocks of size \(b\). For row \(m\) and column-block \(j\), denote the block as \(x \in \mathbb{R}^b\) where \(x_i = W_{m, jb+i}\).
We seek a codebook \(\mathcal{C} = \{c_0, c_1, \ldots, c_{15}\}\) of 16 values (4 bits), per-block scales \(s_j \in \mathcal{S}\), and assignments \(k_{j,i} \in \{0, \ldots, 15\}\) that minimize the total Sum of Squared Errors:
This is a joint optimization over:
The 16 codebook values \(\mathcal{C}\)
The per-block scales \(\{s_j\}\)
The per-element code assignments \(\{k_{j,i}\}\)
The fixed FP4 E2M1 codebook \(\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}\) is one feasible solution. We aim to find a codebook that achieves lower error for a given weight matrix.
2. Sign Reduction
Weights are distributed symmetrically around zero, so we use a sign bit and restrict the codebook to non-negative values:
This gives 16 codes: \(\{+0, -0, \pm c_1, \ldots, \pm c_7\}\).
The quantization of \(x_i\) becomes: $\(\hat{x}_i = \text{sign}(x_i) \cdot s_j \cdot Q^+(|x_i|/s_j)\)$
where \(Q^+(\cdot)\) maps to the nearest value in \(\mathcal{C}^+\). The total error decomposes as:
We only need to find 7 positive codebook values (plus zero). The sign is handled separately.
3. Scale Normalization
Each block has a per-block scale \(s_j\), so the quantization operates on the normalized magnitudes \(y_{j,i} = |x_{j,i}|/s_j\). Substituting into the error:
The codebook \(\mathcal{C}^+\) determines the quantizer \(Q^+\), and \(y_{j,i} \in [0, c_7]\) (values above \(c_7\) clip). The factor \(s_j^2\) weights blocks by their magnitude, but for a fixed codebook, minimizing the inner sum for each block independently is equivalent to minimizing the total.
Scale invariance
The key observation: for the naive scale \(s_j = \text{snap}(\max_i|x_{j,i}| / c_7)\), the normalized values \(y_{j,i} = |x_{j,i}|/s_j\) all lie in \([0, c_7]\) with the block maximum near \(c_7\). The distribution of normalized values is approximately independent of the block’s absolute scale. This means we can pool normalized values across all blocks to learn a single codebook.
4. The Optimization Problem
Define the normalized magnitudes for all blocks:
where we normalize by the block maximum (rather than the snapped scale) to obtain a clean \([0,1]\) distribution independent of the scale format.
Pool all normalized magnitudes from all blocks:
We seek 7 positive centers \(\tilde{c}_1 < \cdots < \tilde{c}_7\) in \([0, 1]\) that minimize:
where \(d_0 = \tilde{c}_1 / 2\) is the decision boundary between zero and the smallest positive value. Values \(y \le d_0\) are assigned to \(c_0 = 0\) and excluded from the positive-cluster optimization.
This is 1D k-means (Lloyd’s algorithm) with \(k = 7\) on the non-zero portion of \(\mathcal{Y}\).
Algorithm
Normalize: For each block, compute \(y_{j,i} = |x_{j,i}| / \max_k|x_{j,k}|\).
Pool: Collect all \(y_{j,i}\) values into a single set \(\mathcal{Y}\).
Initialize: Set initial centers \(\tilde{c}_1, \ldots, \tilde{c}_7\) at evenly-spaced quantiles of the non-zero values in \(\mathcal{Y}\).
Iterate (Lloyd’s algorithm):
Compute decision boundary \(d_0 = \tilde{c}_1 / 2\) and inter-center boundaries \(d_k = (\tilde{c}_k + \tilde{c}_{k+1})/2\).
Assign: each \(y > d_0\) to its nearest center.
Update: \(\tilde{c}_k \leftarrow \text{mean}(S_k)\) where \(S_k\) is the set of values assigned to center \(k\).
Repeat until convergence.
Rescale: Multiply by \(c_7^{\text{target}}\) (e.g., 6.0) to obtain the final codebook in the standard scale range: \(c_k = \tilde{c}_k \cdot c_7^{\text{target}}\).
Why this works
The fixed FP4 E2M1 codebook \(\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}\) allocates half its positive values to the range \([0, 1.5]\) (linear spacing) and the other half to \([2, 6]\) (exponential spacing). This is a compromise designed for general-purpose hardware.
The learned codebook adapts to the actual distribution of normalized weight magnitudes. If the distribution is approximately uniform in \([0, 1]\) (as is typical for Gaussian-like weights after normalization), the optimal codebook will be more evenly spaced — placing more resolution where there is more probability mass.
Complexity
Normalization: \(O(MK)\) — one pass over the weight matrix.
Sorting for initialization: \(O(N \log N)\) where \(N = MK\).
Each k-means iteration: \(O(N)\) (binary search in 7 sorted centers is \(O(1)\)).
Total iterations: Typically \(< 100\) for 1D k-means.
For a \(2560 \times 9728\) weight matrix with block size 32, \(N \approx 25\text{M}\) values. The entire optimization runs in seconds on CPU.
5. Per-Block Scale Selection
Given the custom codebook \(\mathcal{C}^+\), per-block scale selection follows the same framework as the Optimal Scale Search:
Naive: \(s = \text{snap}(\max_i|x_i| / c_7)\)
SSE-Optimal: Bounded search minimizing \(E(s) = \sum_i (x_i - s \cdot Q(x_i/s))^2\), using clipping and dead-zone bounds with \(q_{\max} = c_7\) and \(d_0 = c_1/2\)
Hessian-Optimal: Same bounded search minimizing \(r^T H r\) with the custom codebook
The bounds from Optimal Scale Search apply directly — only the codebook values and boundaries change.
6. Discussion
vs. FP4 E2M1. The fixed FP4 E2M1 codebook \(\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}\) is designed for hardware decode efficiency (the bit pattern maps directly to a floating-point value). A learned codebook requires a lookup table for decode, but can achieve lower quantization error by adapting to the actual weight distribution.
Weight-dependent. The optimal codebook depends on the weight distribution. Different layers (or layer types) may benefit from different codebooks. In practice, one could learn a single codebook per model, per layer type, or per individual layer — trading off storage (16 values per codebook) against quantization quality.
Relationship to scalar k-means. This derivation shows that optimal block-scaled 4-bit quantization with a learned codebook reduces to 1D k-means on the pooled normalized magnitudes. Scale normalization provides invariance to per-block magnitude, reducing the problem from a joint codebook-and-scale optimization to a single scalar quantization problem.