https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/
https://old.reddit.com/r/LocalLLaMA/comments/14gjz8h/i_have_multiple_doubts_about_kquant_models_and/
https://huggingface.co/docs/hub/gguf
K表示K quant 量化方法。(和k-means clustering quants是两个不同的东西..反正很玄就完事了)
K quant不是聚类, k-means clustering 是k均值聚类
在 llama.cpp
或 GGUF(GQuant统一格式)中,”k-quant” 指的是量化方法,但它与k-means聚类并不直接相关。虽然这两种方法都涉及到某种形式的分组或简化,但它们是不同的技术。
-
k-quant(量化)通常指的是一种量化方法,将模型参数(例如权重)从高精度(如浮点数)减少到低精度(如整数或定点数),从而减少内存和计算需求。在神经网络的上下文中,k-quant通常指的是向量量化或低位量化,其中你将模型的权重近似为少量的值或“桶”(k个不同的值)。
-
k-means聚类是一种聚类算法,它基于相似性将数据分为k个不同的组。虽然k-means涉及将数据分为k个组,但它并不具备量化的目标——量化关注的是降低数据的精度,同时保留关键特征。
总而言之,k-quant在llama.cpp
或GGUF的上下文中,更接近于量化技术,而不是像k-means这样的聚类算法。它们有相似的分组(桶或簇)概念,但它们的目的和实现方式有显著不同。
GGUF的Q8_K
, 和原来GGML的Q8_0
基本一致。以8bits存参数来量化。
Q4_K_M
,以4bits存参数量化。New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K
引用推文:
reference:
https://x.com/rohanpaul_ai/status/1782371166021144995?mx=2
K-quants (models will be identified as “q3_K_S” or “q3_K_L” and so on )
The k-quant system uses value representations of different bit widths depending on the chosen quant method. First, the model’s weights are divided into blocks of 32: with each block having a scaling factor based on the largest weight value, i.e., the highest gradient magnitude.
Depending on the selected quant-method, the most important weights are quantized to a higher-precision data type, while the rest are assigned to a lower-precision type. For example, the q2_k quant method converts the largest weights to 4-bit integers and the remaining weights to 2-bit. Alternatively, however, the q5_0 and q8_0 quant methods convert all weights to 5-bit and 8-bit integer representations respectively.
Examples of model named with the k-quant
method
q2_K => Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
q3_K_S => Uses GGML_TYPE_Q3_K for all tensors
q3_K_M => Uses GGML_TYPE_Q4_K for the attention.wv
, attention.wo
, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
q3_K_L => Uses GGML_TYPE_Q5_K for the attention.wv
, attention.wo
, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
-
attention.wv
(Value Weight Matrix): In the context of the transformer’s attention, input embeddings are transformed into three different vectors: key (K), query (Q), and value (V) vectors. Theattention.wv
refers to the weights that linearly transform input embeddings into value vectors. These value vectors are then used, along with the output from the key and query vectors, to compute the attention scores that determine how much each part of the input should be attended to in producing the output. -
attention.wo
(Output Weight Matrix): After computing the attention scores and using them to create a weighted combination of the value vectors, the result is passed through this final linear transformation represented byattention.wo
. This transformation is applied to the aggregated output of the attention mechanism before it is sent to subsequent layers or for further processing within the same layer. Essentially,attention.wo
shapes the output of the attention mechanism to ensure it is suitable for the model’s next steps.
▶️ The bitsandbytes library quantizes on the fly (to 8-bit or 4-bit) which is also knows as dynamic quantization
▶️ And there’s some other formats like AWQ: Activation-aware Weight Quantization - which is a quantization method similar to GPTQ. It protects salient weights by observing activations rather than the weights themselves. AWQ achieves excellent quantization performance, especially for instruction-tuned LMs and multi-modal LMs.
There are several differences between AWQ and GPTQ as methods but the most important one is that AWQ assumes that not all weights are equally important for an LLM’s performance. For AWQ, best to use the vLLM package