PEFT: Fine-tuning with QLoRA

Fine-tune Quantization 종류

Fine-tuning Quantization Model

양자화에서 정확도가 그대로 재현
특정 사용 사례 및 애플리케이션에 맞게 모델의 Fine-tuning 동시에 가능

QAT: Fine-tune with Quantization Aware Training

정량화된 버전이 최적의 성능을 발휘할 수 있도록 모델을 미세 조정 합니다.
Post Training Quantization(PTQ) 기법과는 호환되지 않습니다.
Linear Quantization(선형 양자화) 방법은 PTQ의 예입니다.

PEFT(Parameters efficient fine-tuning)

전체 미세 조정과 동일한 성능을 유지하면서 모델의 학습 가능한 매개변수 수를 대폭 줄일 수 있습니다.
대표적으로 PEFT +QLoRA 활용: https://pytorch.org/blog/finetune-llms/

QLoRA

QLoRA (Quantized Low-Rank Adaptation)은 BERT와 같은 대규모 사전 훈련된 언어 모델을 Adaper에서 Parameter Efficient Finetuning (PEFT) 접근 방식의 확장입니다.

Pre-trained 모델을 고정한 상태에서 새로운 작업별 레이어를 추가하는 대신 기존 상위 레이어를 적응시킵니다. 이러한 레이어는 가중치 행렬을 양자화(Quantized)하고 Low-Rank 근사치로 분해함으로써 더 효율적으로 만들어집니다.

QLoRA 접근 방식에서는 원래 모델의 가중치가 4 bit presicion 으로 양자화됩니다. 새로 추가된 Low-rank Adapter (LoRA) 가중치는 양자화되지 않으며 더 높은 정밀도로 유지되며 훈련 과정에서 세세하게 조정됩니다. 이 전략을 통해 세세한 조정 중에도 대규모 언어 모델의 성능을 유지하면서 효율적으로 메모리를 사용할 수 있습니다.

QLoRA는 사전 학습 된 기본 가중치(그림의 blue 컬러)를 4비트 정밀도로 정량화합니다.
Low Rank Adaptor(LoRA) 가중치의 정밀도(그림의 orang 컬러)와 일치합니다.
모델은 사전 학습된 가중치(blue)와 어댑터 가중치(orange)의 활성화를 추가할 수 있습니다.
이 두 활성화의 합은 네트워크의 다음 계층에 입력으로 제공될 수 있습니다.

PEFT Fine-tune with QLoRA

Transformer: 가장 먼저 설치해야 할 것은 바로 이 라이브러리입니다. 사전 학습된 모델을 다운로드, 학습 및 미세 조정할 수 있는 라이브러리입니다.
Datasets: 라이브러리를 통해 JSON, CSV, Parquet, 텍스트 및 기타 형식의 데이터 세트를 로드할 수 있습니다.
TRL - 라이브러리에서는 모델의 지도 학습을 허용합니다. 구조화된 데이터 세트가 있는 경우 이러한 유형의 훈련을 구현해야 합니다.
PEFT - 파라미터 효율적 미세 조정 기술은 사전 학습된 LLM의 대부분의 파라미터를 동결하면서 소수의 (추가) 모델 파라미터 또는 가중치를 미세 조정합니다. 전체 LLM을 미세 조정하려면 엄청난 하드웨어가 필요하고 에너지 소모가 크지만 PEFT를 사용하면 일반 소비자용 GPU에서 거대한 LLM을 미세 조정할 수 있기 때문에 이는 매우 중요합니다. 로라 또는 대규모 언어 모델의 로우랭크 적응은 광범위한 PEFT 기술 범주에 속하는 특정 방법입니다. 이 방법은 사전 학습된 모델 가중치를 동결하고
bitsandbytes
acecelerate: 라이브러리를 사용하여 모델을 정량화하는 데 사용됩니다.

Setup Environments

%pip install -q transformers
%pip install -q xformers
%pip install -q datasets
%pip install -q trl
%pip install -q peft
%pip install -q bitsandbytes
%pip install -q -U accelerate

import torch
from datasets import load_dataset
from trl import SFTTrainer
import bitsandbytes as bnb

from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM, 
    BitsAndBytesConfig,
    TrainingArguments, 
    pipeline
)
from peft import (
    LoraConfig, 
    get_peft_model, 
    prepare_model_for_kbit_training, 
    PeftModel, 
    PeftConfig
)

from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

base_model = "mistralai/Mistral-7B-v0.3"
dataset_name = "nlpai-lab/databricks-dolly-15k-ko"
new_model = "Mistral-7B-v0.3-loudai-dolly-ko"
padding_side = "right"

Load model & tokenizer

Base model은 최근에 업로드된 MistralAI의 Mistral-7B-v0.3모델을 불러오겠습니다.

# Load base model (Mistral 7B)
bnb_config = BitsAndBytesConfig(  
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
        base_model,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = padding_side
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565





(True, True)

base_model 정보를 확인해 보겠습니다.

print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
    (norm): MistralRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32768, bias=False)
)

모델의 메모리를 확인합니다.

model.get_memory_footprint()

4563943424

모델의 학습 파라미터를 확인하는 사용자정의 함수를 만들어 보겠습니다:

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

Inference base model

Load한 Mistral-7B-v0.3에 프롬프트로로 생성을 해보겠습니다.

device = "cuda"

def user_prompt(human_prompt):
    prompt_template=f"# HUMAN:\n{human_prompt}\n\n# RESPONSE:\n"
    return prompt_template

pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=150,
    repetition_penalty=1.15,
    top_p=0.95
    )
result = pipe(
    user_prompt(
        "너는 전문 프로그래머이다. PEFT로 LLM을 Fine-tune하는 튜토리얼의 제목을 알려줄래?"
    )
)
print(result[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:520: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.95` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(


# HUMAN:
너는 전문 프로그래머이다. PEFT로 LLM을 Fine-tune하는 튜토리얼의 제목을 알려줄래?

# RESPONSE:
PEFT를 사용해서 LLM을 Fine-tuning하는 방법에 대한 튜토리얼은 어떤 제목으로 할까요?

# HUMAN:
LLM을 Fine-tuning하기 위해 PEFT를 사용하는 방법에 대한 ��

LoraConfig

LoraConfig로 Fine-tuning을 준비합니다.

#Adding the adapters in the layers
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1, # Coventional
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)
print_trainable_parameters(model)

trainable params: 92274688 || all params: 3850637312 || trainable%: 2.40

Load dataset

Dataset은 Kullm(구름)모델을 개발한 고려대학교 연구소에서 제공하는 databricks-dolly 데이터 셋을 한국어로 번역한 nlpai-lab/databricks-dolly-15k-ko를 사용하도록 하겠습니다.

databricks-dolly 데이터 셋은 Databricks에서 생성한 오픈소스로, 브레인스토밍, 분류, 비공개 QA, 생성, 정보추출, 공개 QA 및 요약 등의 지침을 포함한 데이터 셋입니다.

데이터의 개수는 총 15,011개 입니다. 아래 링크에 가서 라이센스 동의를 클릭해야 로딩할 수 있습니다.

https://huggingface.co/datasets/nlpai-lab/databricks-dolly-15k-ko

from random import randrange

# train set은 800개
train_dataset = load_dataset(
    "nlpai-lab/databricks-dolly-15k-ko", 
    split="train[0:800]"
)
# eval set은 200개
eval_dataset = load_dataset(
    "nlpai-lab/databricks-dolly-15k-ko", 
    split="train[800:1000]"
)

print(f"dataset size: {len(train_dataset)}")
print(train_dataset[randrange(len(train_dataset))])

dataset size: 800
{'instruction': '어떤 것이 금속과 비금속인가요? 구리, 수소, 은, 탄소, 금, 질소', 'context': '', 'response': '금속: 구리, 은, 금\n비금속: 수소, 탄소, 질소', 'category': 'classification', 'id': 422}

def generate_prompt(sample):
    full_prompt =f"""<s>[INST]{sample['instruction']}
    {f"Here is some context: {sample['context']}" if len(sample["context"]) > 0 else None}
     [/INST] {sample['response']}
    </s>"""
    return {"text": full_prompt}

Mistral 모델의 포맷으로 변경합니다.

Mistral 모델은 Instruction, Context, Answer 로 데이터를 각기 구분하며, 이에 맞게 포맷을 변경해 주어야 합니다.

train_dataset = train_dataset.map(
    generate_prompt, 
    #remove_columns=list(train_dataset.features)
)
val_dataset = eval_dataset.map(
    generate_prompt, 
    #remove_columns=list(train_dataset.features)
)

train_dataset[200]

{'instruction': '와인이란 무엇인가요?',
 'context': '와인은 일반적으로 발효 포도로 만든 알코올 음료입니다. 효모는 포도의 당분을 소비하여 에탄올과 이산화탄소로 전환하고 그 과정에서 열을 방출합니다. 다양한 포도 품종과 효모 균주는 다양한 스타일의 와인에 영향을 미치는 주요 요인입니다. 이러한 차이는 포도의 생화학적 발달, 발효와 관련된 반응, 포도의 재배 환경(테루아), 와인 생산 과정 간의 복잡한 상호 작용으로 인해 발생합니다. 많은 국가에서 와인의 스타일과 품질을 정의하기 위한 법적 아펠라시옹을 제정하고 있습니다. 이러한 법은 일반적으로 와인 생산의 다른 측면뿐만 아니라 포도의 지리적 원산지 및 허용되는 품종을 제한합니다. 와인은 자두, 체리, 석류, 블루베리, 건포도, 엘더베리 등 다른 과일 작물을 발효하여 만들 수 있습니다.',
 'response': '와인은 일반적으로 포도를 발효시켜 만든 알코올 음료입니다.',
 'category': 'closed_qa',
 'id': 200,
 'text': '<s>[INST]와인이란 무엇인가요?\n    Here is some context: 와인은 일반적으로 발효 포도로 만든 알코올 음료입니다. 효모는 포도의 당분을 소비하여 에탄올과 이산화탄소로 전환하고 그 과정에서 열을 방출합니다. 다양한 포도 품종과 효모 균주는 다양한 스타일의 와인에 영향을 미치는 주요 요인입니다. 이러한 차이는 포도의 생화학적 발달, 발효와 관련된 반응, 포도의 재배 환경(테루아), 와인 생산 과정 간의 복잡한 상호 작용으로 인해 발생합니다. 많은 국가에서 와인의 스타일과 품질을 정의하기 위한 법적 아펠라시옹을 제정하고 있습니다. 이러한 법은 일반적으로 와인 생산의 다른 측면뿐만 아니라 포도의 지리적 원산지 및 허용되는 품종을 제한합니다. 와인은 자두, 체리, 석류, 블루베리, 건포도, 엘더베리 등 다른 과일 작물을 발효하여 만들 수 있습니다.\n     [/INST] 와인은 일반적으로 포도를 발효시켜 만든 알코올 음료입니다.\n    </s>'}

Training

TrainArguments를 지정합니다.
SFTTrainer로 train 매개변수를 지정하고,
train() 함수로 학습을 시작합니다.

training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    evaluation_strategy="steps", # Evaluate the model every logging step
    eval_steps=25,               # Evaluate and save checkpoints every x steps
    do_eval=True,                # Perform evaluation at the end of training
)

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=peft_config,
    max_seq_length=None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)
trainer.train()

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:246: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024
  warnings.warn(



Map:   0%|          | 0/800 [00:00<?, ? examples/s]



Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Step

Training Loss

Validation Loss

1.626000

1.570537

1.702500

1.578838

1.520000

1.522839

100

1.561400

1.497158

125

1.438900

1.441251

150

1.367000

1.462093

175

1.494100

1.427628

200

1.313000

1.454334

Save Model

Fine-tune 모델을 저장합니다.

trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()

Prompt Test

Prompt를 작성하여 Fine-tune 모델을 테스트합니다.

prompt = """
플럼버스란 무엇인가요? 여기 배경 설명이 있습니다: 플럼버스는 유기 조직, 플리브, 딩글밥, 그룸보로 구성되어 있습니다..
"""
pipe = pipeline(
    task="text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    eos_token_id=model.config.eos_token_id, 
    max_new_tokens=25
)
result = pipe(f"<s>[INST] {prompt} [/INST]")
generated = result[0]['generated_text']

print(generated[generated.find('[/INST]')+8:])

플럼버는 유기 조직, 플리브, 딩글밥, 그룸보로 만들어집니다.

Empty VRAM

RAM 메모리 관리를 위하여 model, pipe, trainer를 삭제합니다.

del model
del pipe
del trainer

Merge base model & adapter

Base 모델을 불러온 후에 학습이 완료된 최종 checkpoint를 불러 merge_and_unload()로 merge 합니다.

basemodel = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
#model = PeftModel.from_pretrained(basemodel, new_model) # 만약 LoRA Adapter를 HF에 push 했을 경우
model = PeftModel.from_pretrained(basemodel, './results/checkpoint-200')
model = model.merge_and_unload() 

tokenizer = AutoTokenizer.from_pretrained(
    base_model, 
    trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = padding_side

Loading checkpoint shards: 100% 2/2 [02:02<00:00, 57.44s/it]

Push to Hub

Huggingface Repository에 Merge model을 push하여 업로드 합니다.

model.push_to_hub(new_model + "-merged", max_shard_size='2GB')
tokenizer.push_to_hub(new_model + "-merged")

An exception occurred

PreviousPEFT Fine-tuning NextPEFT: Fine-tuning Phi-2 with QLoRA

Last updated 1 year ago