INT8
Reference Implementation
- qwantize.int8.reference.int8_naive(W, dim=-1, return_dequant=False)[source]
Naive INT8 quantization:
s = FP8_E4M3(max|x_i|) / 127per block.The per-block amax is snapped to FP8 E4M3 and divided by 127 to get the effective scale. This keeps the stored amax within FP8 range.
- Parameters:
W – Input tensor.
W.shape[dim]must be in{32, 64, 128, 256}.dim – Dimension along which to quantize (default: -1).
return_dequant – If
True, also return the dequantized tensor.
- Returns:
(scales, quants)by default, or(scales, quants, dequant)if return_dequant isTrue.scales: Per-block effective scales (FP8 amax / 127). Shape is W.shape with dimension dim removed.
quants: Integer values in
[-127, 127]. Same shape as W.dequant:
quants * scalesbroadcast. Same shape as W.
- qwantize.int8.reference.int8_optimal(W, dim=-1, return_dequant=False)[source]
SSE-Optimal INT8 quantization via bounded search over FP8 E4M3 scales.
The effective scale grid is
{a / 127 : a in FP8_E4M3_positive}, giving 126 discrete candidates. Uses clipping and dead-zone bounds to prune the search, with a fast-fail clipping check per candidate.- Parameters:
W – Input tensor.
W.shape[dim]must be in{32, 64, 128, 256}.dim – Dimension along which to quantize (default: -1).
return_dequant – If
True, also return the dequantized tensor.
- Returns:
(scales, quants)by default, or(scales, quants, dequant)if return_dequant isTrue.scales: Per-block optimal effective scales. Shape is W.shape with dimension dim removed.
quants: Integer values in
[-127, 127]. Same shape as W.dequant:
quants * scalesbroadcast. Same shape as W.
- qwantize.int8.reference.int8_optimal_hessian(W, dim=-1, return_dequant=False, X=None, H_blocks=None)[source]
Hessian-aware optimal INT8 scale search over FP8 E4M3 candidates.
Like
int8_optimal(), searches over FP8 E4M3 scale candidates using SSE bounds for pruning, but selects the scale minimizing the Hessian-weighted error(x - sq)^T H (x - sq)instead of raw SSE.- Parameters:
W – Input tensor.
W.shape[dim]must be in{32, 64, 128, 256}.dim – Dimension along which to quantize (default: -1).
return_dequant – If
True, also return the dequantized tensor.X – Activation tensor of shape
(T, K). H computed asX_j^T @ X_j.H_blocks – Pre-computed block Hessians of shape
(num_col_blocks, bs, bs). If provided, X is ignored.
- Returns:
(scales, quants)by default, or(scales, quants, dequant)if return_dequant isTrue.scales: Per-block optimal effective scales. Shape is W.shape with dimension dim removed.
quants: Integer values in
[-127, 127]. Same shape as W.dequant:
quants * scalesbroadcast. Same shape as W.
- qwantize.int8.reference.int8_dequantize(scales, quants, dim=-1)[source]
Dequantize INT8:
dequant = quants * scales.- Parameters:
scales – Per-block effective scales (FP8 amax / 127). Shape is the original W.shape with dimension dim removed.
quants – Integer values in
[-127, 127]. Same shape as the original W.dim – Block dimension in quants (default: -1).
- Returns:
Dequantized tensor with the same shape as quants.
- qwantize.int8.reference.int8_quantize(x, s)[source]
Symmetric INT8 quantization:
round(clamp(x / s, -127, 127)).- Parameters:
x – Input tensor of shape
(..., block_size).s – Per-block effective scale of shape
(..., 1), broadcastable to x.
- Returns:
Integer-valued tensor in
[-127, 127], same shape as x.
- qwantize.int8.reference.int8_dequantize_block(quants, s)[source]
Dequantize INT8 block values:
dequant = quants * s.- Parameters:
quants – Integer-valued tensor of shape
(..., block_size).s – Per-block effective scale of shape
(..., 1), broadcastable.
- Returns:
Dequantized tensor with the same shape as quants.
- qwantize.int8.reference.compute_block_sse(x, s)[source]
Compute per-block sum of squared INT8 quantization error.
- Parameters:
x – Block values of shape
(num_blocks, block_size).s – Per-block effective scales of shape
(num_blocks,)or(num_blocks, 1).
- Returns:
Tensor of shape
(num_blocks,)with the SSE for each block.
Constants
qwantize.int8.Q_MAX = 127– Maximum symmetric INT8 magnitudeqwantize.int8.D_0 = 0.5– Dead-zone boundary for rounding to zeroqwantize.int8.VALID_BLOCK_SIZES = (32, 64, 128, 256)– Supported block sizes