INT8

Reference Implementation

qwantize.int8.reference.int8_naive(W, dim=-1, return_dequant=False)[source]

Naive INT8 quantization: s = FP8_E4M3(max|x_i|) / 127 per block.

The per-block amax is snapped to FP8 E4M3 and divided by 127 to get the effective scale. This keeps the stored amax within FP8 range.

Parameters:
  • W – Input tensor. W.shape[dim] must be in {32, 64, 128, 256}.

  • dim – Dimension along which to quantize (default: -1).

  • return_dequant – If True, also return the dequantized tensor.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True.

  • scales: Per-block effective scales (FP8 amax / 127). Shape is W.shape with dimension dim removed.

  • quants: Integer values in [-127, 127]. Same shape as W.

  • dequant: quants * scales broadcast. Same shape as W.

qwantize.int8.reference.int8_optimal(W, dim=-1, return_dequant=False)[source]

SSE-Optimal INT8 quantization via bounded search over FP8 E4M3 scales.

The effective scale grid is {a / 127 : a in FP8_E4M3_positive}, giving 126 discrete candidates. Uses clipping and dead-zone bounds to prune the search, with a fast-fail clipping check per candidate.

Parameters:
  • W – Input tensor. W.shape[dim] must be in {32, 64, 128, 256}.

  • dim – Dimension along which to quantize (default: -1).

  • return_dequant – If True, also return the dequantized tensor.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True.

  • scales: Per-block optimal effective scales. Shape is W.shape with dimension dim removed.

  • quants: Integer values in [-127, 127]. Same shape as W.

  • dequant: quants * scales broadcast. Same shape as W.

qwantize.int8.reference.int8_optimal_hessian(W, dim=-1, return_dequant=False, X=None, H_blocks=None)[source]

Hessian-aware optimal INT8 scale search over FP8 E4M3 candidates.

Like int8_optimal(), searches over FP8 E4M3 scale candidates using SSE bounds for pruning, but selects the scale minimizing the Hessian-weighted error (x - sq)^T H (x - sq) instead of raw SSE.

Parameters:
  • W – Input tensor. W.shape[dim] must be in {32, 64, 128, 256}.

  • dim – Dimension along which to quantize (default: -1).

  • return_dequant – If True, also return the dequantized tensor.

  • X – Activation tensor of shape (T, K). H computed as X_j^T @ X_j.

  • H_blocks – Pre-computed block Hessians of shape (num_col_blocks, bs, bs). If provided, X is ignored.

Returns:

(scales, quants) by default, or (scales, quants, dequant) if return_dequant is True.

  • scales: Per-block optimal effective scales. Shape is W.shape with dimension dim removed.

  • quants: Integer values in [-127, 127]. Same shape as W.

  • dequant: quants * scales broadcast. Same shape as W.

qwantize.int8.reference.int8_dequantize(scales, quants, dim=-1)[source]

Dequantize INT8: dequant = quants * scales.

Parameters:
  • scales – Per-block effective scales (FP8 amax / 127). Shape is the original W.shape with dimension dim removed.

  • quants – Integer values in [-127, 127]. Same shape as the original W.

  • dim – Block dimension in quants (default: -1).

Returns:

Dequantized tensor with the same shape as quants.

qwantize.int8.reference.int8_quantize(x, s)[source]

Symmetric INT8 quantization: round(clamp(x / s, -127, 127)).

Parameters:
  • x – Input tensor of shape (..., block_size).

  • s – Per-block effective scale of shape (..., 1), broadcastable to x.

Returns:

Integer-valued tensor in [-127, 127], same shape as x.

qwantize.int8.reference.int8_dequantize_block(quants, s)[source]

Dequantize INT8 block values: dequant = quants * s.

Parameters:
  • quants – Integer-valued tensor of shape (..., block_size).

  • s – Per-block effective scale of shape (..., 1), broadcastable.

Returns:

Dequantized tensor with the same shape as quants.

qwantize.int8.reference.compute_block_sse(x, s)[source]

Compute per-block sum of squared INT8 quantization error.

Parameters:
  • x – Block values of shape (num_blocks, block_size).

  • s – Per-block effective scales of shape (num_blocks,) or (num_blocks, 1).

Returns:

Tensor of shape (num_blocks,) with the SSE for each block.

Constants

  • qwantize.int8.Q_MAX = 127 – Maximum symmetric INT8 magnitude

  • qwantize.int8.D_0 = 0.5 – Dead-zone boundary for rounding to zero

  • qwantize.int8.VALID_BLOCK_SIZES = (32, 64, 128, 256) – Supported block sizes