INT8

Reference Implementation

qwantize.int8.reference.int8_naive(W, dim=-1, return_dequant=False)[source]

Naive INT8 quantization: s = FP8_E4M3(max|x_i|) / 127 per block.

The per-block amax is snapped to FP8 E4M3 and divided by 127 to get the effective scale. This keeps the stored amax within FP8 range.

Parameters:

W – Input tensor. W.shape[dim] must be in {32, 64, 128, 256}.
dim – Dimension along which to quantize (default: -1).
return_dequant – If True, also return the dequantized tensor.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True.

scales: Per-block effective scales (FP8 amax / 127). Shape is W.shape with dimension dim removed.
quants: Integer values in [-127, 127]. Same shape as W.
dequant: quants * scales broadcast. Same shape as W.

qwantize.int8.reference.int8_optimal(W, dim=-1, return_dequant=False)[source]

SSE-Optimal INT8 quantization via bounded search over FP8 E4M3 scales.

The effective scale grid is {a / 127 : a in FP8_E4M3_positive}, giving 126 discrete candidates. Uses clipping and dead-zone bounds to prune the search, with a fast-fail clipping check per candidate.

Parameters:

W – Input tensor. W.shape[dim] must be in {32, 64, 128, 256}.
dim – Dimension along which to quantize (default: -1).
return_dequant – If True, also return the dequantized tensor.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True.

scales: Per-block optimal effective scales. Shape is W.shape with dimension dim removed.
quants: Integer values in [-127, 127]. Same shape as W.
dequant: quants * scales broadcast. Same shape as W.

qwantize.int8.reference.int8_optimal_hessian(W, dim=-1, return_dequant=False, X=None, H_blocks=None)[source]

Hessian-aware optimal INT8 scale search over FP8 E4M3 candidates.

Like int8_optimal(), searches over FP8 E4M3 scale candidates using SSE bounds for pruning, but selects the scale minimizing the Hessian-weighted error (x - sq)^T H (x - sq) instead of raw SSE.

Parameters:

W – Input tensor. W.shape[dim] must be in {32, 64, 128, 256}.
dim – Dimension along which to quantize (default: -1).
return_dequant – If True, also return the dequantized tensor.
X – Activation tensor of shape (T, K). H computed as X_j^T @ X_j.
H_blocks – Pre-computed block Hessians of shape (num_col_blocks, bs, bs). If provided, X is ignored.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True.

scales: Per-block optimal effective scales. Shape is W.shape with dimension dim removed.
quants: Integer values in [-127, 127]. Same shape as W.
dequant: quants * scales broadcast. Same shape as W.

qwantize.int8.reference.int8_dequantize(scales, quants, dim=-1)[source]

Dequantize INT8: dequant = quants * scales.

Parameters:

scales – Per-block effective scales (FP8 amax / 127). Shape is the original W.shape with dimension dim removed.
quants – Integer values in [-127, 127]. Same shape as the original W.
dim – Block dimension in quants (default: -1).

Returns:

Dequantized tensor with the same shape as quants.

qwantize.int8.reference.int8_quantize(x, s)[source]

Symmetric INT8 quantization: round(clamp(x / s, -127, 127)).

Parameters:

x – Input tensor of shape (..., block_size).
s – Per-block effective scale of shape (..., 1), broadcastable to x.

Returns:

Integer-valued tensor in [-127, 127], same shape as x.

qwantize.int8.reference.int8_dequantize_block(quants, s)[source]

Dequantize INT8 block values: dequant = quants * s.

Parameters:

quants – Integer-valued tensor of shape (..., block_size).
s – Per-block effective scale of shape (..., 1), broadcastable.

Returns:

Dequantized tensor with the same shape as quants.

qwantize.int8.reference.compute_block_sse(x, s)[source]

Compute per-block sum of squared INT8 quantization error.

Parameters:

x – Block values of shape (num_blocks, block_size).
s – Per-block effective scales of shape (num_blocks,) or (num_blocks, 1).

Returns:

Tensor of shape (num_blocks,) with the SSE for each block.

Constants

qwantize.int8.Q_MAX = 127 – Maximum symmetric INT8 magnitude
qwantize.int8.D_0 = 0.5 – Dead-zone boundary for rounding to zero
qwantize.int8.VALID_BLOCK_SIZES = (32, 64, 128, 256) – Supported block sizes