PaliGemma: Open Vision LLM

PreviousMultimodal NextFLUX.1: Generative Image

Last updated 1 year ago

PaliGemma: Open Vision LLM

PaliGemma: Open Vision Language Model

2024년 5월 Google에서 출시한 PaliGemma는 대규모 다중 모달 모델(LMM)입니다. 시각적 질문 답변(VQA)에 PaliGemma를 사용하여 이미지에서 물체를 감지하거나 세그먼테이션 마스크를 생성할 수도 있습니다.

PaliGemma(Github)는 이미지 인코더인 SigLIP-So400m과 텍스트 디코더인 Gemma-2B로 구성된 아키텍처를 갖춘 비전 언어 모델 제품군입니다. SigLIP은 이미지와 텍스트를 모두 이해할 수 있는 최첨단 모델입니다.

CLIP과 마찬가지로 공동으로 학습된 이미지 인코더와 텍스트 인코더로 구성됩니다. PaLI-3와 마찬가지로, 결합된 PaliGemma 모델은 이미지-텍스트 데이터에 대해 사전 학습된 후 캡션 또는 참조 세그멘테이션과 같은 다운스트림 작업에서 쉽게 미세 조정할 수 있습니다.

Gemma는 텍스트 생성을 위한 디코더 전용 모델입니다. 선형 어댑터를 사용하여 SigLIP의 이미지 인코더를 Gemma와 결합하면 강력한 비전 언어 모델인 PaliGemma를 만들 수 있습니다.

Setup Environments

%pip install -q -U accelerate bitsandbytes

from huggingface_hub import notebook_login

notebook_login()

AutoProcessoer Inference

PaliGemmaForConditionalGeneration 클래스를 사용하여 출시된 모델 중 어떤 것이든 추론할 수 있습니다.

내장된 프로세서로 프롬프트와 이미지를 전처리한 다음 전처리된 입력을 전달하여 생성하기만 하면 됩니다.

import torch
import numpy as np
from PIL import Image
import requests

from transformers import Trainer, AutoProcessor, PaliGemmaForConditionalGeneration

model_id = "google/paligemma-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

image_file = "http://www.gjnews.com/data/newsThumb/1668667541ADD_thumb580.jpg"

raw_image = Image.open(requests.get(image_file, stream=True).raw)
display(raw_image)

prompt = "몇마리의 개가 있어?"

inputs = processor(
    prompt, 
    raw_image, 
    return_tensors="pt"
)
output = model.generate(**inputs, max_new_tokens=20)

print(processor.decode(output[0], skip_special_tokens=True)[len(prompt):])

You're using a GemmaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

세마리

4-bit Model

양자화 된 4-bit Quanitized 모델로 Inference 가능합니다.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"":0}
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

prompt = "왼쪽부터 순서대로 개의 견종은 뭐야?"

inputs = processor(
    prompt, 
    raw_image, 
    return_tensors="pt"
)
output = model.generate(**inputs, max_new_tokens=20)

print(processor.decode(output[0], skip_special_tokens=True)[len(prompt):])

You're using a GemmaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/generation/utils.py:1637: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(



개종은 래브라도, 테리어, 개바라도입니다.

prompt = "개를 object detection과 segmentation으로 검출해줄래?"

inputs = processor(
    prompt, 
    raw_image, 
    return_tensors="pt"
)
output = model.generate(**inputs, max_new_tokens=20)

print(processor.decode(output[0], skip_special_tokens=True)[len(prompt):])

사진 속 개들은 오토바이, 트럭, 자동차, 헬멧을

Fine-tuning

Load Dataset

이 예제에서는 VQAv2 데이터 집합을 사용하여 이미지에 대한 질문에 답하기 위해 모델을 미세 조정하겠습니다. 데이터 집합을 로드해 보겠습니다. 질문, 객관식_답변 및 이미지 열만 사용할 것이므로 나머지 열도 제거하겠습니다. 또한 데이터 집합을 분할합니다.

from datasets import load_dataset 

ds = load_dataset(
    'HuggingFaceM4/VQAv2', 
    split="train"
) 
cols_remove = ["question_type", "answers", "answer_type", "image_id", "question_id"] 

ds = ds.remove_columns(cols_remove)
ds = ds.train_test_split(test_size=0.1)

train_ds = ds["train"]
val_ds = ds["test"]

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/datasets/load.py:1461: FutureWarning: The repository for HuggingFaceM4/VQAv2 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/HuggingFaceM4/VQAv2
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Repo card metadata block was not found. Setting CardData to empty.

PaliGemmeaProcessor

from transformers import PaliGemmaProcessor 

model_id = "google/paligemma-3b-pt-224"
processor = PaliGemmaProcessor.from_pretrained(model_id)

Tokenizer

시각적 질문에 답하도록 PaliGemma를 조건 짓는 프롬프트 템플릿을 만들겠습니다. 토큰화 도구가 입력을 패드화하므로, 라벨의 패드를 토큰화 도구의 패드 토큰과 이미지 토큰이 아닌 다른 것으로 설정해야 합니다.

import torch
device = "cuda"

image_token = processor.tokenizer.convert_tokens_to_ids("<image>")
def collate_fn(examples):
    texts = ["answer " + example["question"] for example in examples] # questnion + answer text 설정
    labels= [example['multiple_choice_answer'] for example in examples] # labeling 설정, 여기서는 multiple_choice_anwer로
    images = [example["image"].convert("RGB") for example in examples] # 이미지를 RGB 포맷 변환 후 입력
  
    # tokenize를 `processor()` 메서스를 통해 변환
    tokens = processor(text=texts, images=images, suffix=labels,
                    return_tensors="pt", padding="longest")
    tokens = tokens.to(torch.bfloat16).to(device)
    return tokens

Model Load

모델을 직접 로드하거나 QLoRA용 4비트 모델을 로드할 수 있습니다. 모델을 로드한 후 이미지 인코더와 프로젝터를 고정하고 디코더만 미세 조정합니다.

이미지가 모델이 사전 학습된 데이터 세트에 없는 특정 도메인 내에 있는 경우 이미지 인코더를 고정하는 것을 건너뛸 수 있습니다.

여기서는 Option1. 16bit 모델을 Fine-tuning 하겠습니다.

Option 1. 16bit PaliGemma Load

16bit 모델을 불러오는 경우

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16
).to(device)

for param in model.vision_tower.parameters():
    param.requires_grad = False

for param in model.multi_modal_projector.parameters():
    param.requires_grad = True

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Option 2. 4 bit QLoRA Model Load

4bit 양자화 된 QLoRA 모델을 불러오는 경우

%pip install peft

from transformers import BitsAndBytesConfig
from peft import get_peft_model, LoraConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_type=torch.bfloat16
)

lora_config = LoraConfig(
    r=8, 
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, 
    quantization_config=bnb_config, 
    device_map={"":0}
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Unused kwargs: ['bnb_4bit_compute_type']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


trainable params: 11,298,816 || all params: 2,934,765,296 || trainable%: 0.3850

TrainingArgments 설정

이제 Trainer의 TrainingArgment를 초기화하겠습니다. QLoRA를 미세 조정할 경우, 최적화 프로그램을 paged_adamw_8bit로 설정하세요.

from transformers import TrainingArguments, Trainer

args=TrainingArguments(
    num_train_epochs=2,
    remove_unused_columns=False,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    warmup_steps=2,
    learning_rate=2e-5,
    weight_decay=1e-6,
    adam_beta2=0.999,
    logging_steps=100,
    #optim="paged_adamw_8bit", # QLoRA Fine-tuning은 "paged_adam_8bit, 
    optim="adamw_hf", #일반 Fine-tuning은 "adamw_hf"
    save_strategy="steps",
    save_steps=1000,
    push_to_hub=True,
    save_total_limit=1,
    fp16=True, # GPU v100은 bf16bit를 지원하지 않는다.
    # bf16=True,
    report_to=["tensorboard"],
    dataloader_pin_memory=False,
    output_dir='./paligemma'
)

Training

트레이너를 초기화하고, 데이터 세트, 데이터 대조 함수 및 트레이닝 인수를 전달한 다음 train()을 호출하여 트레이닝을 시작합니다.

trainer = Trainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=collate_fn,
    args=args
)
trainer.train()