AI-Master-Book
  • about AI-Master-Book
  • AI Master Book
    • 이상치 탐지 with Python
    • 베이지안 뉴럴네트워크 (BNN) with Python
    • 그래프 뉴럴네트워크 (GNN) with Python
    • 데이터 마케팅 분석 with Python
  • LLM MASTER BOOK
    • OpenAI API 쿡북 with Python
    • 기초부터 심화까지 RAG 쿡북 with Python
    • MCP 에이전트 쿡북 with Python
  • LLMs
    • OpenAI API
      • 1️⃣ChatCompletion
      • 2️⃣DALL-E
      • 3️⃣Text to Speech
      • 4️⃣Video to Transcripts
      • 5️⃣Assistants API
      • 6️⃣Prompt Engineering
      • 7️⃣OpenAI New GPT-4o
    • LangChain
      • LangChain Basic
        • 1️⃣Basic Modules
        • 2️⃣Model I/O
        • 3️⃣Prompts
        • 4️⃣Chains
        • 5️⃣Agents
        • 6️⃣Tools
        • 7️⃣Memory
      • LangChain Intermediate
        • 1️⃣OpenAI LLM
        • 2️⃣Prompt Template
        • 3️⃣Retrieval
        • 4️⃣RAG ChatBot
        • 5️⃣RAG with Gemini
        • 6️⃣New Huggingface-LangChain
        • 7️⃣Huggingface Hub
        • 8️⃣SQL Agent & Chain
        • 9️⃣Expression Language(LCEL)
        • 🔟Llama3-8B with LangChain
      • LangChain Advanced
        • 1️⃣LLM Evaluation
        • 2️⃣RAG Evaluation with RAGAS
        • 3️⃣LangChain with RAGAS
        • 4️⃣RAG Paradigms
        • 5️⃣LangChain: Advance Techniques
        • 6️⃣LangChain with NeMo-Guardrails
        • 7️⃣LangChain vs. LlamaIndex
        • 8️⃣LangChain LCEL vs. LangGraph
    • LlamaIndex
      • LlamaIndex Basic
        • 1️⃣Introduction
        • 2️⃣Customization
        • 3️⃣Data Connectors
        • 4️⃣Documents & Nodes
        • 5️⃣Naive RAG
        • 6️⃣Advanced RAG
        • 7️⃣Llama3-8B with LlamaIndex
        • 8️⃣LlmaPack
      • LlamaIndex Intermediate
        • 1️⃣QueryEngine
        • 2️⃣Agent
        • 3️⃣Evaluation
        • 4️⃣Evaluation-Driven Development
        • 5️⃣Fine-tuning
        • 6️⃣Prompt Compression with LLMLingua
      • LlamaIndex Advanced
        • 1️⃣Agentic RAG: Router Engine
        • 2️⃣Agentic RAG: Tool Calling
        • 3️⃣Building Agent Reasoning Loop
        • 4️⃣Building Multi-document Agent
    • Hugging Face
      • Huggingface Basic
        • 1️⃣Datasets
        • 2️⃣Tokenizer
        • 3️⃣Sentence Embeddings
        • 4️⃣Transformers
        • 5️⃣Sentence Transformers
        • 6️⃣Evaluate
        • 7️⃣Diffusers
      • Huggingface Tasks
        • NLP
          • 1️⃣Sentiment Analysis
          • 2️⃣Zero-shot Classification
          • 3️⃣Aspect-Based Sentiment Analysis
          • 4️⃣Feature Extraction
          • 5️⃣Intent Classification
          • 6️⃣Topic Modeling: BERTopic
          • 7️⃣NER: Token Classification
          • 8️⃣Summarization
          • 9️⃣Translation
          • 🔟Text Generation
        • Audio & Tabular
          • 1️⃣Text-to-Speech: TTS
          • 2️⃣Speech Recognition: Whisper
          • 3️⃣Audio Classification
          • 4️⃣Tabular Qustaion & Answering
        • Vision & Multimodal
          • 1️⃣Image-to-Text
          • 2️⃣Text to Image
          • 3️⃣Image to Image
          • 4️⃣Text or Image-to-Video
          • 5️⃣Depth Estimation
          • 6️⃣Image Classification
          • 7️⃣Object Detection
          • 8️⃣Segmentatio
      • Huggingface Optimization
        • 1️⃣Accelerator
        • 2️⃣Bitsandbytes
        • 3️⃣Flash Attention
        • 4️⃣Quantization
        • 5️⃣Safetensors
        • 6️⃣Optimum-ONNX
        • 7️⃣Optimum-NVIDIA
        • 8️⃣Optimum-Intel
      • Huggingface Fine-tuning
        • 1️⃣Transformer Fine-tuning
        • 2️⃣PEFT Fine-tuning
        • 3️⃣PEFT: Fine-tuning with QLoRA
        • 4️⃣PEFT: Fine-tuning Phi-2 with QLoRA
        • 5️⃣Axoltl Fine-tuning with QLoRA
        • 6️⃣TRL: RLHF Alignment Fine-tuning
        • 7️⃣TRL: DPO Fine-tuning with Phi-3-4k-instruct
        • 8️⃣TRL: ORPO Fine-tuning with Llama3-8B
        • 9️⃣Convert GGUF gemma-2b with llama.cpp
        • 🔟Apple Silicon Fine-tuning Gemma-2B with MLX
        • 🔢LLM Mergekit
    • Agentic LLM
      • Agentic LLM
        • 1️⃣Basic Agentic LLM
        • 2️⃣Multi-agent with CrewAI
        • 3️⃣LangGraph: Multi-agent Basic
        • 4️⃣LangGraph: Agentic RAG with LangChain
        • 5️⃣LangGraph: Agentic RAG with Llama3-8B by Groq
      • Autonomous Agent
        • 1️⃣LLM Autonomous Agent?
        • 2️⃣AutoGPT: Worldcup Winner Search with LangChain
        • 3️⃣BabyAGI: Weather Report with LangChain
        • 4️⃣AutoGen: Writing Blog Post with LangChain
        • 5️⃣LangChain: Autonomous-agent Debates with Tools
        • 6️⃣CAMEL Role-playing Autonomous Cooperative Agents
        • 7️⃣LangChain: Two-player Harry Potter D&D based CAMEL
        • 8️⃣LangChain: Multi-agent Bid for K-Pop Debate
        • 9️⃣LangChain: Multi-agent Authoritarian Speaker Selection
        • 🔟LangChain: Multi-Agent Simulated Environment with PettingZoo
    • Multimodal
      • 1️⃣PaliGemma: Open Vision LLM
      • 2️⃣FLUX.1: Generative Image
    • Building LLM
      • 1️⃣DSPy
      • 2️⃣DSPy RAG
      • 3️⃣DSPy with LangChain
      • 4️⃣Mamba
      • 5️⃣Mamba RAG with LangChain
      • 7️⃣PostgreSQL VectorDB with pgvorco.rs
Powered by GitBook
On this page
  • Quantization 양자화 기본 이론
  • Pytorch: Data Types and Sizes 실습
  • ML Models with Different Data Types
  • Model Casting: float16
  • Model Casting: bfloat16
  • LLM Models in Different Data Types
  • LLM 양자화 주요 방법론:
  • Fine-tune Quantization 종류
  • Quantization Theory 실습
  • Quantize the model (8-bit precision)
  1. LLMs
  2. Hugging Face
  3. Huggingface Optimization

Quantization

PreviousFlash AttentionNextSafetensors

Last updated 1 year ago

Quantization 양자화 기본 이론

  1. 기본은 모델의 파라미터 구성을 압축하는 것 입니다.

  1. 압축 시에 연결을 제거하는 가지치기 Pruning 방법론 사용합니다.

  1. Knowledge Distillation (지식 전이) 방식은 원래 모델(Instructor)을 사용하여 더 작은 모델(Student)을 훈련모델을 압축 할 수 있습니다.

  1. Quantization의 기본 원리는 모델 파라미터의 데이터 형식을 더 작게 만드는 것 입니다. (Low Precision)

  1. 실수형(Floating Point)는 bit 별로 FP32, F16, BF16으로 표현되며, 정수형(Integer)는 int32, int16, int8으로 표현됩니다.

Pytorch: Data Types and Sizes 실습

머신러닝 모델의 매개변수를 저장하는 데 사용되는 일반적인 데이터 유형에 대해 알아봅니다.

import torch

Integers

# Information of `8-bit unsigned integer`
torch.iinfo(torch.uint8)
> iinfo(min=0, max=255, dtype=uint8)
# Information of `8-bit (signed) integer`
torch.iinfo(torch.int8)
>iinfo(min=-128, max=127, dtype=int8)
### Information of `16-bit (signed) integer`
torch.iinfo(torch.int16)
> iinfo(min=-32768, max=32767, dtype=int16)
### Information of `32-bit (signed) integer`
torch.iinfo(torch.int32)
> iinfo(min=-2.14748e+09, max=2.14748e+09, dtype=int32)
### Information of `64-bit (signed) integer`
torch.iinfo(torch.int64)
> iinfo(min=-9.22337e+18, max=9.22337e+18, dtype=int64)

Floating Points

# by default, python stores float data in fp64
value = 1/3
format(value, '.60f')
> '0.333333333333333314829616256247390992939472198486328125000000'
# 64-bit floating point
tensor_fp64 = torch.tensor(value, dtype = torch.float64)
print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")
> fp64 tensor: 0.333333333333333314829616256247390992939472198486328125000000
tensor_fp32 = torch.tensor(value, dtype = torch.float32)
tensor_fp16 = torch.tensor(value, dtype = torch.float16)
tensor_bf16 = torch.tensor(value, dtype = torch.bfloat16)
print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")
print(f"fp32 tensor: {format(tensor_fp32.item(), '.60f')}")
print(f"fp16 tensor: {format(tensor_fp16.item(), '.60f')}")
print(f"bf16 tensor: {format(tensor_bf16.item(), '.60f')}")
> fp64 tensor: 0.333333333333333314829616256247390992939472198486328125000000
fp32 tensor: 0.333333343267440795898437500000000000000000000000000000000000
fp16 tensor: 0.333251953125000000000000000000000000000000000000000000000000
bf16 tensor: 0.333984375000000000000000000000000000000000000000000000000000
# Information of `16-bit brain floating point`
torch.finfo(torch.bfloat16)
> finfo(resolution=0.01, min=-3.38953e+38, max=3.38953e+38, eps=0.0078125, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=bfloat16)
### Information of `16-bit floating point`
torch.finfo(torch.float16)
> finfo(resolution=0.001, min=-65504, max=65504, eps=0.000976562, smallest_normal=6.10352e-05, tiny=6.10352e-05, dtype=float16)
# Information of `32-bit floating point`
torch.finfo(torch.float32)
> finfo(resolution=1e-06, min=-3.40282e+38, max=3.40282e+38, eps=1.19209e-07, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=float32)
### Information of `64-bit floating point`
torch.finfo(torch.float64)
> finfo(resolution=1e-15, min=-1.79769e+308, max=1.79769e+308, eps=2.22045e-16, smallest_normal=2.22507e-308, tiny=2.22507e-308, dtype=float64)

Downcasting

# random pytorch tensor: float32, size=1000
tensor_fp32 = torch.rand(1000, dtype = torch.float32)
# first 5 elements of the random tensor
tensor_fp32[:5]
> tensor([0.6183, 0.6456, 0.0934, 0.3034, 0.5854])
# downcast the tensor to bfloat16 using the "to" method
tensor_fp32_to_bf16 = tensor_fp32.to(dtype = torch.bfloat16)
tensor_fp32_to_bf16[:5]
> tensor([0.6172, 0.6445, 0.0933, 0.3027, 0.5859], dtype=torch.bfloat16)
# tensor_fp32 x tensor_fp32
m_float32 = torch.dot(tensor_fp32, tensor_fp32)
m_float32
> tensor(344.0361)
# tensor_fp32_to_bf16 x tensor_fp32_to_bf16
m_bfloat16 = torch.dot(tensor_fp32_to_bf16, tensor_fp32_to_bf16)
m_bfloat16
> tensor(344., dtype=torch.bfloat16)

ML Models with Different Data Types

다양한 데이터 유형으로 ML 모델을 로드합니다.

import torch
import torch.nn as nn
import requests
from PIL import Image

import warnings
# Ignore specific UserWarnings related to max_length in transformers
warnings.filterwarnings("ignore", 
    message=".*Using the model-agnostic default `max_length`.*")
class DummyModel(nn.Module):
  """
  A dummy model that consists of an embedding layer
  with two blocks of a linear layer followed by a layer
  norm layer.
  """
  def __init__(self):
    super().__init__()

    torch.manual_seed(123)

    self.token_embedding = nn.Embedding(2, 2)

    # Block 1
    self.linear_1 = nn.Linear(2, 2)
    self.layernorm_1 = nn.LayerNorm(2)

    # Block 2
    self.linear_2 = nn.Linear(2, 2)
    self.layernorm_2 = nn.LayerNorm(2)

    self.head = nn.Linear(2, 2)

  def forward(self, x):
    hidden_states = self.token_embedding(x)

    # Block 1
    hidden_states = self.linear_1(hidden_states)
    hidden_states = self.layernorm_1(hidden_states)

    # Block 2
    hidden_states = self.linear_2(hidden_states)
    hidden_states = self.layernorm_2(hidden_states)

    logits = self.head(hidden_states)
    return logits
model = DummyModel()
model
DummyModel(
  (token_embedding): Embedding(2, 2)
  (linear_1): Linear(in_features=2, out_features=2, bias=True)
  (layernorm_1): LayerNorm((2,), eps=1e-05, elementwise_affine=True)
  (linear_2): Linear(in_features=2, out_features=2, bias=True)
  (layernorm_2): LayerNorm((2,), eps=1e-05, elementwise_affine=True)
  (head): Linear(in_features=2, out_features=2, bias=True)
)

모델에 있는 매개변수의 데이터 유형을 검사하는 함수를 만듭니다.

def print_param_dtype(model):
    for name, param in model.named_parameters():
        print(f"{name} is loaded in {param.dtype}")
print_param_dtype(model)
> token_embedding.weight is loaded in torch.float32
linear_1.weight is loaded in torch.float32
linear_1.bias is loaded in torch.float32
layernorm_1.weight is loaded in torch.float32
layernorm_1.bias is loaded in torch.float32
linear_2.weight is loaded in torch.float32
linear_2.bias is loaded in torch.float32
layernorm_2.weight is loaded in torch.float32
layernorm_2.bias is loaded in torch.float32
head.weight is loaded in torch.float32
head.bias is loaded in torch.float32

Model Casting: float16

Cast the model into a different precision.

# float 16
model_fp16 = DummyModel().half()

매개변수의 데이터 유형을 검사합니다.

print_param_dtype(model_fp16)
> token_embedding.weight is loaded in torch.float16
linear_1.weight is loaded in torch.float16
linear_1.bias is loaded in torch.float16
layernorm_1.weight is loaded in torch.float16
layernorm_1.bias is loaded in torch.float16
linear_2.weight is loaded in torch.float16
linear_2.bias is loaded in torch.float16
layernorm_2.weight is loaded in torch.float16
layernorm_2.bias is loaded in torch.float16
head.weight is loaded in torch.float16
head.bias is loaded in torch.float16
model_fp16
> DummyModel(
  (token_embedding): Embedding(2, 2)
  (linear_1): Linear(in_features=2, out_features=2, bias=True)
  (layernorm_1): LayerNorm((2,), eps=1e-05, elementwise_affine=True)
  (linear_2): Linear(in_features=2, out_features=2, bias=True)
  (layernorm_2): LayerNorm((2,), eps=1e-05, elementwise_affine=True)
  (head): Linear(in_features=2, out_features=2, bias=True)
)

모델을 사용하여 간단한 추론을 실행합니다.

import torch

dummy_input = torch.LongTensor([[1, 0], [0, 1]])

# inference using float32 model
logits_fp32 = model(dummy_input)

logits_fp32
> tensor([[[-0.6872,  0.7132],
         [-0.6872,  0.7132]],

        [[-0.6872,  0.7132],
         [-0.6872,  0.7132]]], grad_fn=<ViewBackward0>)
# inference using float16 model
try:
    logits_fp16 = model_fp16(dummy_input)
    
except Exception as error:
    print("\033[91m", type(error).__name__, ": ", error, "\033[0m")
> [91m RuntimeError :  "LayerNormKernelImpl" not implemented for 'Half' 

Model Casting: bfloat16

from copy import deepcopy

model_bf16 = deepcopy(model)
model_bf16 = model_bf16.to(torch.bfloat16)
print_param_dtype(model_bf16)
logits_bf16 = model_bf16(dummy_input)
> token_embedding.weight is loaded in torch.bfloat16
linear_1.weight is loaded in torch.bfloat16
linear_1.bias is loaded in torch.bfloat16
layernorm_1.weight is loaded in torch.bfloat16
layernorm_1.bias is loaded in torch.bfloat16
linear_2.weight is loaded in torch.bfloat16
linear_2.bias is loaded in torch.bfloat16
layernorm_2.weight is loaded in torch.bfloat16
layernorm_2.bias is loaded in torch.bfloat16
head.weight is loaded in torch.bfloat16
head.bias is loaded in torch.bfloat16

이제 logits_fp32와 logits_bf16의 차이를 비교해 보겠습니다.

mean_diff = torch.abs(logits_bf16 - logits_fp32).mean().item()
max_diff = torch.abs(logits_bf16 - logits_fp32).max().item()

print(f"Mean diff: {mean_diff} | Max diff: {max_diff}")
> Mean diff: 0.000997886061668396 | Max diff: 0.0016907453536987305

LLM Models in Different Data Types

from transformers import BlipForConditionalGeneration

model_name = "Salesforce/blip-image-captioning-base"
model = BlipForConditionalGeneration.from_pretrained(model_name)
# inspect the default data types of the model

# print_param_dtype(model)

모델의 메모리 사용량을 확인합니다.

fp32_mem_footprint = model.get_memory_footprint()

print("Footprint of the fp32 model in bytes: ",
      fp32_mem_footprint)
print("Footprint of the fp32 model in MBs: ", 
      fp32_mem_footprint/1e+6)
> Footprint of the fp32 model in bytes:  989660400
Footprint of the fp32 model in MBs:  989.6604

bfloat16에서 동일한 모델을 로드합니다.

model_bf16 = BlipForConditionalGeneration.from_pretrained(
                                               model_name,
                               torch_dtype=torch.bfloat16
)
> bf16_mem_footprint = model_bf16.get_memory_footprint()
# Get the relative difference
relative_diff = bf16_mem_footprint / fp32_mem_footprint

print("Footprint of the bf16 model in MBs: ", 
      bf16_mem_footprint/1e+6)
print(f"Relative diff: {relative_diff}")
> Footprint of the bf16 model in MBs:  494.832248
Relative diff: 0.5000020693967345

Model Performance: float32 vs bfloat16

이제 두 모델의 생성 결과를 비교해 보겠습니다.

def get_generation(model, processor, image, dtype):
  inputs = processor(image, return_tensors="pt").to(dtype)
  out = model.generate(**inputs)
  return processor.decode(out[0], skip_special_tokens=True)


def load_image(img_url):
    image = Image.open(requests.get(
        img_url, stream=True).raw).convert('RGB')

    return image
from transformers import BlipProcessor

processor = BlipProcessor.from_pretrained(model_name)
from IPython.display import display

img_url = 'https://storage.googleapis.com/\
sfr-vision-language-research/BLIP/demo.jpg'

image = load_image(img_url)
display(image.resize((500, 350)))
results_fp32 = get_generation(model, 
                              processor, 
                              image, 
                              torch.float32)
print("fp32 Model Results:\n", results_fp32)
> fp32 Model Results:
 a woman sitting on the beach with her dog
results_bf16 = get_generation(model_bf16, 
                              processor, 
                              image, 
                              torch.bfloat16)
print("bf16 Model Results:\n", results_bf16)
> bf16 Model Results:
 a woman sitting on the beach with a dog

Default Data Type

  • 허깅페이스 트랜스포머 라이브러리의 경우, 모델을 로드할 기본 데이터 유형은 float32입니다.

  • '기본 데이터 유형'을 원하는 것으로 설정할 수 있습니다.

desired_dtype = torch.bfloat16
torch.set_default_dtype(desired_dtype)
dummy_model_bf16 = DummyModel()
print_param_dtype(dummy_model_bf16)
> token_embedding.weight is loaded in torch.bfloat16
linear_1.weight is loaded in torch.bfloat16
linear_1.bias is loaded in torch.bfloat16
layernorm_1.weight is loaded in torch.bfloat16
layernorm_1.bias is loaded in torch.bfloat16
linear_2.weight is loaded in torch.bfloat16
linear_2.bias is loaded in torch.bfloat16
layernorm_2.weight is loaded in torch.bfloat16
layernorm_2.bias is loaded in torch.bfloat16
head.weight is loaded in torch.bfloat16
head.bias is loaded in torch.bfloat16

마찬가지로 기본 데이터 유형을 float32로 재설정할 수 있습니다.

torch.set_default_dtype(torch.float32)
print_param_dtype(dummy_model_bf16)
> token_embedding.weight is loaded in torch.bfloat16
linear_1.weight is loaded in torch.bfloat16
linear_1.bias is loaded in torch.bfloat16
layernorm_1.weight is loaded in torch.bfloat16
layernorm_1.bias is loaded in torch.bfloat16
linear_2.weight is loaded in torch.bfloat16
linear_2.bias is loaded in torch.bfloat16
layernorm_2.weight is loaded in torch.bfloat16
layernorm_2.bias is loaded in torch.bfloat16
head.weight is loaded in torch.bfloat16
head.bias is loaded in torch.bfloat16

LLM 양자화 주요 방법론:

  1. Linear Quantization (선형 양자화)

  2. LLM.INT8 (8-bit만 해당t)

  3. QLoRA(4-bit만 해당)

  4. HQQ(최대 2-bit)

Fine-tune Quantization 종류

Fine-tuning Quantization Model

  • 양자화에서 정확도가 그대로 재현

  • 특정 사용 사례 및 애플리케이션에 맞게 모델의 Fine-tuning 동시에 가능

QAT: Fine-tune with Quantization Aware Training

  • 정량화된 버전이 최적의 성능을 발휘할 수 있도록 모델을 미세 조정 합니다.

  • Post Training Quantization(PTQ) 기법과는 호환되지 않습니다.

  • Linear Quantization(선형 양자화) 방법은 PTQ의 예입니다.

PEFT(Parameters efficient fine-tuning)

  • 전체 미세 조정과 동일한 성능을 유지하면서 모델의 학습 가능한 매개변수 수를 대폭 줄일 수 있습니다.

  • 대표적으로 PEFT +QLoRA 활용: https://pytorch.org/blog/finetune-llms/

QLoRA 방법론

  1. QLoRA는 사전 학습 된 기본 가중치(그림의 blue 컬러)를 4비트 정밀도로 정량화합니다.

  2. Low Rank Adaptor(LoRA) 가중치의 정밀도(그림의 orang 컬러)와 일치합니다.

  3. 모델은 사전 학습된 가중치(blue)와 어댑터 가중치(orange)의 활성화를 추가할 수 있습니다.

  4. 이 두 활성화의 합은 네트워크의 다음 계층에 입력으로 제공될 수 있습니다.

-> QLoRA를 사용하면 더 효율적인 미세 조정 + 더 작은 모델을 얻을 수 있습니다!

Quantization Theory 실습

Linear Quantization(선형 양자화)에 대하여 알아보겠습니다.

#%pip install sentencepiece
#%pip install quanto==0.0.11
import torch

def named_module_tensors(module, recurse=False):
    for named_parameter in module.named_parameters(recurse=recurse):
      name, val = named_parameter
      flag = True
      if hasattr(val,"_data") or hasattr(val,"_scale"):
        if hasattr(val,"_data"):
          yield name + "._data", val._data
        if hasattr(val,"_scale"):
          yield name + "._scale", val._scale
      else:
        yield named_parameter

    for named_buffer in module.named_buffers(recurse=recurse):
      yield named_buffer

def dtype_byte_size(dtype):
    """
    Returns the size (in bytes) occupied by one parameter of type `dtype`.
    """
    import re
    if dtype == torch.bool:
        return 1 / 8
    bit_search = re.search(r"[^\d](\d+)$", str(dtype))
    if bit_search is None:
        raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
    bit_size = int(bit_search.groups()[0])
    return bit_size // 8

def compute_module_sizes(model):
    """
    Compute the size of each submodule of a given model.
    """
    from collections import defaultdict
    module_sizes = defaultdict(int)
    for name, tensor in named_module_tensors(model, recurse=True):
      size = tensor.numel() * dtype_byte_size(tensor.dtype)
      name_parts = name.split(".")
      for idx in range(len(name_parts) + 1):
        module_sizes[".".join(name_parts[:idx])] += size

    return module_sizes

Without Quantization

model_name = "google/flan-t5-small"
import sentencepiece as spm
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")
input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
module_sizes = compute_module_sizes(model)

print(f"The model size is {module_sizes[''] * 1e-9} GB")
The model size is 0.307844608 GB

Quantize the model (8-bit precision)

from quanto import quantize, freeze
import torch

quantize(model, weights=torch.int8, activations=None)
print(model)

Freeze the model

freeze(model)
module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")
The model size is 0.12682868 GB

Try running inference on the quantized model

input_text = "Hello, my name is "
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Linear Quantization과 Downcasting 방법의 차이점을 요약하면 다음과 같습니다:

  • 모델을 다운캐스팅할 때는 모델의 파라미터를 보다 간결한 데이터 유형(bfloat16)으로 변환합니다. 추론하는 동안 모델은 이 데이터 유형으로 계산을 수행하고 활성화는 이 데이터 유형으로 이루어집니다. 다운캐스팅은 bfloat16 데이터 유형으로 작동할 수 있지만 더 작은 데이터 유형에서는 모델 성능이 저하될 수 있으며, 정수 데이터 유형(이 단원의 int8과 같은)으로 변환하는 경우에는 작동하지 않을 수 있습니다.

  • linear quantization를 사용하여 추론 중에 압축된 데이터 유형에서 원래의 FP32 데이터 유형으로 다시 변환함으로써 양자화된 모델이 원래 모델에 훨씬 가까운 성능을 유지할 수 있도록 했습니다. 따라서 모델이 예측을 할 때 행렬 곱셈은 FP32로, 활성화는 FP32로 수행합니다. 이를 통해 이 예제에서는 int8과 같이 bfloat16보다 작은 데이터 유형으로 모델을 정량화할 수 있습니다.

/images/lora-qlora-and-qa-lora-efficient-adaptability-in-large-language-models-through-low-rank-matrix-factorization.gif
4️⃣