# Custom Codebook Optimization

This page derives an algorithm for learning an optimal 4-bit codebook from data, replacing the fixed FP4 E2M1 values with a custom set of 16 quantization levels that minimize total quantization error.

## 1. Problem Statement

Let $W \in \mathbb{R}^{M \times K}$ be a weight matrix, partitioned along the $K$ dimension into blocks of size $b$.
For row $m$ and column-block $j$, denote the block as $x \in \mathbb{R}^b$ where $x_i = W_{m, jb+i}$.

We seek a **codebook** $\mathcal{C} = \{c_0, c_1, \ldots, c_{15}\}$ of 16 values (4 bits), per-block scales $s_j \in \mathcal{S}$, and assignments $k_{j,i} \in \{0, \ldots, 15\}$ that minimize the total Sum of Squared Errors:

$$E(\mathcal{C}, \{s_j\}, \{k_{j,i}\}) = \sum_{j=1}^{MK/b} \sum_{i=1}^{b} \left(x_{j,i} - s_j \cdot c_{k_{j,i}}\right)^2$$

This is a joint optimization over:
- The 16 codebook values $\mathcal{C}$
- The per-block scales $\{s_j\}$
- The per-element code assignments $\{k_{j,i}\}$

The fixed FP4 E2M1 codebook $\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}$ is one feasible solution.
We aim to find a codebook that achieves lower error for a given weight matrix.

## 2. Sign Reduction

Weights are distributed symmetrically around zero, so we use a **sign bit** and restrict
the codebook to non-negative values:

$$\mathcal{C} = \mathcal{C}^+ \cup (-\mathcal{C}^+), \quad \mathcal{C}^+ = \{0, c_1, c_2, \ldots, c_7\}, \quad 0 < c_1 < c_2 < \cdots < c_7$$

This gives 16 codes: $\{+0, -0, \pm c_1, \ldots, \pm c_7\}$.

The quantization of $x_i$ becomes:
$$\hat{x}_i = \text{sign}(x_i) \cdot s_j \cdot Q^+(|x_i|/s_j)$$

where $Q^+(\cdot)$ maps to the nearest value in $\mathcal{C}^+$. The total error decomposes as:

$$E = \sum_{j,i} \left(|x_{j,i}| - s_j \cdot Q^+\!\left(\frac{|x_{j,i}|}{s_j}\right)\right)^2$$

**We only need to find 7 positive codebook values** (plus zero). The sign is handled separately.

## 3. Scale Normalization

Each block has a per-block scale $s_j$, so the quantization operates on the **normalized
magnitudes** $y_{j,i} = |x_{j,i}|/s_j$. Substituting into the error:

$$E = \sum_j s_j^2 \sum_{i=1}^{b} \left(y_{j,i} - Q^+(y_{j,i})\right)^2$$

The codebook $\mathcal{C}^+$ determines the quantizer $Q^+$, and $y_{j,i} \in [0, c_7]$
(values above $c_7$ clip). The factor $s_j^2$ weights blocks by their magnitude, but for
a fixed codebook, minimizing the inner sum for each block independently is equivalent to
minimizing the total.

### Scale invariance

The key observation: for the naive scale $s_j = \text{snap}(\max_i|x_{j,i}| / c_7)$, the
normalized values $y_{j,i} = |x_{j,i}|/s_j$ all lie in $[0, c_7]$ with the block maximum
near $c_7$. **The distribution of normalized values is approximately independent of the
block's absolute scale.** This means we can pool normalized values across all blocks to
learn a single codebook.

## 4. The Optimization Problem

Define the normalized magnitudes for all blocks:

$$y_{j,i} = \frac{|x_{j,i}|}{\max_k|x_{j,k}|}, \quad y_{j,i} \in [0, 1]$$

where we normalize by the block maximum (rather than the snapped scale) to obtain a
clean $[0,1]$ distribution independent of the scale format.

Pool all normalized magnitudes from all blocks:

$$\mathcal{Y} = \{y_{j,i} : i = 1, \ldots, b, \; j = 1, \ldots, MK/b\}$$

We seek 7 positive centers $\tilde{c}_1 < \cdots < \tilde{c}_7$ in $[0, 1]$ that minimize:

$$\min_{\tilde{c}_1, \ldots, \tilde{c}_7} \sum_{y \in \mathcal{Y},\, y > d_0} \min_k \left(y - \tilde{c}_k\right)^2$$

where $d_0 = \tilde{c}_1 / 2$ is the decision boundary between zero and the smallest positive
value. Values $y \le d_0$ are assigned to $c_0 = 0$ and excluded from the positive-cluster
optimization.

This is **1D k-means** (Lloyd's algorithm) with $k = 7$ on the non-zero portion of $\mathcal{Y}$.

### Algorithm

1. **Normalize**: For each block, compute $y_{j,i} = |x_{j,i}| / \max_k|x_{j,k}|$.
2. **Pool**: Collect all $y_{j,i}$ values into a single set $\mathcal{Y}$.
3. **Initialize**: Set initial centers $\tilde{c}_1, \ldots, \tilde{c}_7$ at evenly-spaced
   quantiles of the non-zero values in $\mathcal{Y}$.
4. **Iterate** (Lloyd's algorithm):
   - Compute decision boundary $d_0 = \tilde{c}_1 / 2$ and inter-center boundaries
     $d_k = (\tilde{c}_k + \tilde{c}_{k+1})/2$.
   - **Assign**: each $y > d_0$ to its nearest center.
   - **Update**: $\tilde{c}_k \leftarrow \text{mean}(S_k)$ where $S_k$ is the set of values
     assigned to center $k$.
   - Repeat until convergence.
5. **Rescale**: Multiply by $c_7^{\text{target}}$ (e.g., 6.0) to obtain the final codebook
   in the standard scale range: $c_k = \tilde{c}_k \cdot c_7^{\text{target}}$.

### Why this works

The fixed FP4 E2M1 codebook $\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}$ allocates half its positive
values to the range $[0, 1.5]$ (linear spacing) and the other half to $[2, 6]$ (exponential
spacing). This is a compromise designed for general-purpose hardware.

The learned codebook adapts to the **actual distribution of normalized weight magnitudes**.
If the distribution is approximately uniform in $[0, 1]$ (as is typical for Gaussian-like
weights after normalization), the optimal codebook will be more evenly spaced — placing
more resolution where there is more probability mass.

### Complexity

- **Normalization**: $O(MK)$ — one pass over the weight matrix.
- **Sorting for initialization**: $O(N \log N)$ where $N = MK$.
- **Each k-means iteration**: $O(N)$ (binary search in 7 sorted centers is $O(1)$).
- **Total iterations**: Typically $< 100$ for 1D k-means.

For a $2560 \times 9728$ weight matrix with block size 32, $N \approx 25\text{M}$ values.
The entire optimization runs in seconds on CPU.

## 5. Per-Block Scale Selection

Given the custom codebook $\mathcal{C}^+$, per-block scale selection follows the same framework
as the [Optimal Scale Search](optimal_scale_search.md):

- **Naive**: $s = \text{snap}(\max_i|x_i| / c_7)$
- **SSE-Optimal**: Bounded search minimizing $E(s) = \sum_i (x_i - s \cdot Q(x_i/s))^2$,
  using clipping and dead-zone bounds with $q_{\max} = c_7$ and $d_0 = c_1/2$
- **Hessian-Optimal**: Same bounded search minimizing $r^T H r$ with the custom codebook

The bounds from [Optimal Scale Search](optimal_scale_search.md) apply directly — only the
codebook values and boundaries change.

## 6. Discussion

**vs. FP4 E2M1.** The fixed FP4 E2M1 codebook $\{0, 0.5, 1, 1.5, 2, 3, 4, 6\}$ is designed
for hardware decode efficiency (the bit pattern maps directly to a floating-point value).
A learned codebook requires a lookup table for decode, but can achieve lower quantization error
by adapting to the actual weight distribution.

**Weight-dependent.** The optimal codebook depends on the weight distribution. Different layers
(or layer types) may benefit from different codebooks. In practice, one could learn a single
codebook per model, per layer type, or per individual layer — trading off storage (16 values
per codebook) against quantization quality.

**Relationship to scalar k-means.** This derivation shows that optimal block-scaled 4-bit
quantization with a learned codebook reduces to 1D k-means on the pooled normalized magnitudes.
Scale normalization provides invariance to per-block magnitude, reducing the problem from a
joint codebook-and-scale optimization to a single scalar quantization problem.