模型量化介绍-谢先斌的博客

模型量化介绍

发布时间： 2025-02-01 更新时间： 2025-06-08 总字数：1876 阅读时间：4m 作者：谢先斌 IP上海

模型量化（Model Quantization）是一种通过降低神经网络模型中参数和激活值的数值精度（如从 32 位浮点数转换为 8 位整数），以减小模型体积、提升计算效率并降低功耗的技术。它是深度学习模型压缩和优化的核心方法之一，尤其适用于在资源受限的设备（如手机、嵌入式设备）上部署模型。

核心原理

精度降低：
- 原始模型通常使用 32 位浮点数（float32）存储权重和激活值
- 量化后，数值会被映射到更低精度的表示（如int8、uint8甚至4位整数），大幅减少存储和计算资源需求
映射过程：
- 通过缩放因子（scale）和零点（zero point）将浮点数值范围线性映射到整数范围
- 例如：将 [-1.0, 1.0] 的浮点数映射到 0~255 的 8 位整数

量化的主要优势

减小模型体积：
- float32 -> int8 量化可减少 75%的存储空间
- 例如，100MB 的模型可压缩到 25MB 以下
加速推理：低精度运算（如整数计算）在硬件（如 CPU、GPU、NPU）上的速度通常快于浮点运算
降低功耗：整数运算的能耗远低于浮点运算，适合移动端和物联网设备
硬件兼容性：许多边缘设备（如手机、摄像头）的芯片专门优化了低精度计算

量化方法分类

训练后量化（Post-Training Quantization, PTQ）：
- 对已训练好的模型直接进行量化，无需重新训练
- 速度快，但可能损失一定精度
- 适用场景：快速部署，对精度要求不极端敏感的任务
量化感知训练（Quantization-Aware Training, QAT）：
- 在模型训练过程中模拟量化过程，让模型适应低精度表示
- 精度损失较小，但需要重新训练，耗时较长
- 适用场景：对精度要求较高的任务（如目标检测、语义分割）

Quantization Types/量化类型

type	source	description
F64	Wikipedia	64-bit standard IEEE 754 double-precision floating-point number.
I64	GH	64-bit fixed-width integer number.
F32	Wikipedia	32-bit standard IEEE 754 single-precision floating-point number.
I32	GH	32-bit fixed-width integer number.
F16	Wikipedia	16-bit standard IEEE 754 half-precision floating-point number.
BF16	Wikipedia	16-bit shortened version of the 32-bit IEEE 754 single-precision floating-point number.
I16	GH	16-bit fixed-width integer number.
Q8_0	GH	8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q8_1	GH	8-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today)
Q8_K	GH	8-bit quantization (q). Each block has 256 weights. Only used for quantizing intermediate results. All 2-6 bit dot products are implemented for this quantization type. Weight formula: w = q * block_scale.
I8	GH	8-bit fixed-width integer number.
Q6_K	GH	6-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(8-bit), resulting in 6.5625 bits-per-weight.
Q5_0	GH	5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q5_1	GH	5-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).
Q5_K	GH	5-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 5.5 bits-per-weight.
Q4_0	GH	4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale. Legacy quantization method (not used widely as of today).
Q4_1	GH	4-bit round-to-nearest quantization (q). Each block has 32 weights. Weight formula: w = q * block_scale + block_minimum. Legacy quantization method (not used widely as of today).
Q4_K	GH	4-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 4.5 bits-per-weight.
Q3_K	GH	3-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: w = q * block_scale(6-bit), resulting. 3.4375 bits-per-weight.
Q2_K	GH	2-bit quantization (q). Super-blocks with 16 blocks, each block has 16 weight. Weight formula: w = q * block_scale(4-bit) + block_min(4-bit), resulting in 2.625 bits-per-weight.
IQ4_NL	GH	4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix.
IQ4_XS	HF	4-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 4.25 bits-per-weight.
IQ3_S	HF	3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.44 bits-per-weight.
IQ3_XXS	HF	3-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 3.06 bits-per-weight.
IQ2_XXS	HF	2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.06 bits-per-weight.
IQ2_S	HF	2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.5 bits-per-weight.
IQ2_XS	HF	2-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 2.31 bits-per-weight.
IQ1_S	HF	1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.56 bits-per-weight.
IQ1_M	GH	1-bit quantization (q). Super-blocks with 256 weights. Weight w is obtained using super_block_scale & importance matrix, resulting in 1.75 bits-per-weight.

Q4_K_M

Q4_K_M 是一种用于大型语言模型（LLM）的量化格式，特别是在 llama.cpp 项目及其派生项目中非常流行
- 4 位量化： Q4 指的是模型权重被量化到 4 位整数
  - 在保持一定精度的前提下，大幅减小模型大小和内存占用的常用选择
- K-量化 (K-quantization) 的应用： K 指的是它使用了 llama.cpp 中的 K-量化方案。
  - 权重被分成小块进行量化，每个块都有自己的量化参数（如缩放因子和零点）
- M 在 K-量化中通常表示一种中等平衡的量化配置
具体来说，Q4_K_M 和 Q4_K_S 都是 4 位 K-量化，但它们在量化块的大小、量化组的分布等方面可能有所不同，导致在模型大小和推理速度之间有不同的权衡
通常，Q4_K_M 会在性能和精度之间提供一个比较好的折衷

量化带来的挑战

精度损失：低精度可能导致模型输出误差，尤其在极端值或敏感任务中
动态范围适配：如何选择合适的缩放因子和零点，以最小化信息损失
硬件支持差异：不同硬件对量化格式的支持可能不同（如是否支持int4）

应用场景

移动端部署：如手机 APP 中的图像分类、语音识别
边缘计算：无人机、智能摄像头等设备的实时推理
大规模服务：降低服务器计算成本，提升响应速度

工具支持

TensorFlow：TensorFlow Lite、TensorFlow Model Optimization Toolkit
PyTorch：PyTorch Quantization（支持 QAT 和 PTQ）
ONNX：通过 ONNX Runtime 支持量化模型推理

总结

模型量化通过权衡精度与效率，让深度学习模型更轻量、更高效，是实际应用中不可或缺的优化手段。选择合适的量化策略（如 PTQ 或 QAT）需结合任务需求、硬件条件和精度容忍度综合考虑。

模型量化介绍

核心原理

量化的主要优势

量化方法分类

Quantization Types/量化类型

Q4_K_M

量化带来的挑战

应用场景

工具支持

总结

参考

Cookie Notice!