Convert GGUF gemma-2b with llama.cpp

Quantizing LLM GGUF With llama.cpp

대부분의 언어 모델은 너무 커서 소비자 하드웨어에서 미세 조정할 수 없습니다. 예를 들어 650억 개의 파라미터 모델을 미세 조정하려면 780GB 이상의 GPU 메모리가 필요합니다. 이는 A100 80GB GPU 10대에 해당하는 용량입니다.

이제 LoRA 및 QLoRA와 같은 효율적인 파파미터 기술을 통해 소비자 하드웨어에서 모델을 보다 쉽게 미세 조정할 수 있게 되었습니다.

LoRA는 소량의 훈련 가능한 파라미터, 즉 LLM의 각 레이어에 대한 어댑터를 추가하고 모든 원래 파라미터를 동결합니다.

미세 조정을 위해 어댑터 무게만 업데이트하면 되므로 메모리 사용량을 크게 줄일 수 있습니다.

QLoRA는 4비트 양자화, 이중 양자화, 페이징을 위한 NVIDIA 통합 메모리 활용을 도입하여 세 단계 더 나아갔습니다.

4-bit NormalFloat Quantization: 각 양자화 빈에서 동일한 수의 값을 보장하여 이상값에 대한 계산 문제와 오류를 방지합니다.
Double quantization: 추가 메모리 절약을 위해 양자화 상수를 양자화하는 프로세스입니다.
Paging with unified memory: NVIDIA 통합 메모리 기능을 사용하며 CPU와 GPU 간의 페이지 간 전송을 자동으로 처리합니다.

Basic steps Involved in fine-tuning:

기본 모델을 로드
기본 모델을 학습
LoRA 어댑터를 저장
기본 모델을 절반/최대 정밀도(half/full precision)로 다시 로드
LoRA 가중치를 기본 모델과 병합
병합된 모델을 저장하고 허깅 페이스 허브로 푸시

1. gemma-2B Fine-tuning

Setup Environments

%pip3 install -q -U bitsandbytes
%pip3 install -q -U peft
%pip3 install -q -U trl
%pip3 install -q -U accelerate
%pip3 install -q -U datasets

import os

os.environ["HF_TOKEN"] = 'Your_Huggingface_Key'

Import dependencies

google/gemma 모델을 사용하려며 huggingface google 페이지에서 Acknowledge License를 클릭하여 사용을 신청하고 승인 후 활용 가능합니다. 신청후 승인은 5분이내에 이뤄집니다.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

#set the qunatization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
#
#Load the model and Tokenizer
model_id = "google/gemma-2b"
#
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=bnb_config, 
    device_map={"":0}
)
tokenizer = AutoTokenizer.from_pretrained(
    model_id, 
    add_eos_token=True
)

Load Dataset

의료 진료데이터인 medical-reasoning 데이터셋으로 fine-tuning을 해보겠습니다. https://huggingface.co/datasets/mamachang/medical-reasoning

from datasets import load_dataset
#
dataset = load_dataset("mamachang/medical-reasoning")
dataset

DatasetDict({
    train: Dataset({
        features: ['input', 'instruction', 'output'],
        num_rows: 3702
    })
})

trainset에 input, instruction, output 컬럼이 있는 것을 확인할 수 있습니다. 이를 데이터프레임으로 변환해서 확인해 보겠습니다.

df = dataset["train"].to_pandas()
df.head(10)

input

instruction

output

Q:An 8-year-old boy is brought to the pediatri...

Please answer with one of the option in the br...

<analysis>\n\nThis is a clinical vignette desc...

Q:A 23-year-old man comes to the physician bec...

Please answer with one of the option in the br...

<analysis>\n\nThis is a clinical vignette desc...

Q:A 27-year-old man presents to the emergency ...

Please answer with one of the option in the br...

<analysis>\n\nThis is a question about a 27-ye...

Q:A 13-year-old girl presents with a 4-week hi...

Please answer with one of the option in the br...

<analysis>\n\nThis is a patient with signs and...

Q:A 53-year-old Asian woman comes to the physi...

Please answer with one of the option in the br...

<analysis>\n\nThis is a patient with symptoms ...

Q:A 7-year-old boy is brought to the physician...

Please answer with one of the option in the br...

<analysis>\n\nThis is a clinical vignette desc...

Q:A 21-year-old man comes to the military base...

Please answer with one of the option in the br...

<analysis>\n\nThis is a clinical case question...

Q:A 48-year-old woman presents to her primary ...

Please answer with one of the option in the br...

<analysis>\n\nThis is a question about determi...

Q:A 62-year-old man presents to the emergency ...

Please answer with one of the option in the br...

<analysis>\n\nThis is a patient with a history...

Q:A 34-year-old female presents to her primary...

Please answer with one of the option in the br...

<analysis>\n\nThis is a clinical vignette desc...

Generate prompt for training

def generate_prompt(data_point):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_point: dict: Data point
    :return: dict: tokenzed prompt
    """

    # Generate prompt
    prefix_text = 'Below is an instruction that describes a task. Write a response that ' \
               'appropriately completes the request.\n\n'
    # Samples with additional context into.
    if data_point['input']:
        text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} here are the inputs {data_point["input"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
    # Without
    else:
        text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
    return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in dataset["train"]]
dataset = dataset["train"].add_column("prompt", text_column)
dataset

Dataset({
    features: ['input', 'instruction', 'output', 'prompt'],
    num_rows: 3702
})

Train/Test Split

dataset = dataset.shuffle(seed=1234)  # Shuffle dataset here
dataset = dataset.map(
    lambda samples: tokenizer(samples["prompt"]), 
    batched=True
)

dataset = dataset.train_test_split(test_size=0.1)
train_data = dataset["train"]
test_data = dataset["test"]
print(train_data)
print(test_data)

Dataset({
    features: ['input', 'instruction', 'output', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 3331
})
Dataset({
    features: ['input', 'instruction', 'output', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 371
})

LoraConifg

PeftModel을 로드하고 PEFT의 get_peft_model 유틸리티 함수와 prepare_model_for_kbit_training 메서드를 사용하여 LoRA를 사용하도록 지정합니다.

import bitsandbytes as bnb

def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
  lora_module_names = set()
  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
      lora_module_names.remove('lm_head')
  return list(lora_module_names)

modules = find_all_linear_names(model)
print(modules)

['k_proj', 'gate_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj', 'v_proj']

from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
print(model)

lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
      )
    )
    (norm): GemmaRMSNorm()
  )
  (lm_head): Linear(in_features=2048, out_features=256000, bias=False)
)

trainable, total = model.get_nb_trainable_parameters()

print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")

Trainable: 78446592 | total: 2584619008 | Percentage: 3.0351%

Training

이제 학습을 시켜 보겠습니다.

import transformers

from trl import SFTTrainer

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side='right'
torch.cuda.empty_cache()

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field="prompt",
    peft_config=lora_config,
    max_seq_length=2500,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=0.03,
        max_steps=100,
        learning_rate=2e-4,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        save_strategy="epoch",
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
#
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

<div>

  <progress value='100' max='100' style='width:300px; height:20px; vertical-align: middle;'></progress>
  [100/100 32:00, Epoch 0/1]
</div>
<table border="1" class="dataframe">

Step Training Loss 1 2.132100 2 2.036100 3 1.966600 4 1.794400 5 1.723900 6 1.644200 7 1.579900 8 1.402000 9 1.336500 10 1.288700 11 1.134200 12 1.228800 13 1.130100 14 1.171200 15 1.154600 16 1.165900 17 1.166300 18 1.093700 19 1.128000 20 1.083000 21 1.115100 22 1.134300 23 1.133400 24 1.085800 25 1.086600 26 1.090700 27 1.101500 28 1.024200 29 1.115900 30 1.055200 31 1.031000 32 1.038400 33 1.071800 34 1.060800 35 1.073500 36 1.013900 37 1.053400 38 1.062800 39 1.060000 40 1.067900 41 1.004100 42 1.036200 43 1.118600 44 1.054600 45 1.040600 46 0.987600 47 1.075600 48 1.050100 49 1.108100 50 1.057900 51 1.043800 52 1.109800 53 1.109200 54 1.032400 55 1.013100 56 1.010800 57 1.056000 58 1.075000 59 1.019000 60 1.042600 61 1.012100 62 1.053700 63 1.022000 64 1.063300 65 1.044900 66 1.021100 67 0.994300 68 1.004900 69 1.041000 70 1.087700 71 1.071200 72 1.010600 73 0.990200 74 1.061600 75 1.001700 76 1.030700 77 0.983900 78 1.056900 79 1.015400 80 1.035800 81 0.983800 82 0.996300 83 1.069300 84 1.058400 85 1.031700 86 1.039900 87 1.086900 88 1.067800 89 1.021400 90 1.022100 91 0.983400 92 1.072000 93 1.030100 94 1.041800 95 0.944500 96 1.009800 97 1.016500 98 1.043500 99 1.043800 100 1.011100

TrainOutput(global_step=100, training_loss=1.1205884909629822, metrics={'train_runtime': 1943.0659, 'train_samples_per_second': 0.823, 'train_steps_per_second': 0.051, 'total_flos': 1.1701757097148416e+16, 'train_loss': 1.1205884909629822, 'epoch': 0.4801920768307323})</p>

Model Save & Huggingface Push

학습된 모델을 저장하고 이를 Huggingface의 계정에 push 해보겠습니다.

#%pip install ipywidgets

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

new_model = "gemma-2b-loudai" 
#
trainer.model.save_pretrained(new_model)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
#save_adapter=True, save_config=True
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
#
# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.83s/it]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 16.4k/4.95G [00:00<9:56:11, 138kB/s]

model-00001-of-00002.safetensors:   0%|          | 1.80M/4.95G [00:00<13:55, 5.92MB/s] A[A

model-00001-of-00002.safetensors:   0%|          | 2.38M/4.95G [00:00<34:47, 2.37MB/s][A[A

model-00001-of-00002.safetensors:   0%|          | 4.82M/4.95G [00:01<15:12, 5.41MB/s][A[A

model-00002-of-00002.safetensors:   7%|▋         | 4.92M/67.1M [00:01<00:11, 5.63MB/s][A[A

model-00001-of-00002.safetensors:   0%|          | 9.60M/4.95G [00:01<07:40, 10.7MB/s][A[A

model-00001-of-00002.safetensors:   0%|          | 14.1M/4.95G [00:01<04:43, 17.4MB/s][A[A

model-00002-of-00002.safetensors:  16%|█▋        | 11.0M/67.1M [00:01<00:04, 12.0MB/s][A[A

model-00001-of-00002.safetensors:   0%|          | 16.5M/4.95G [00:02<09:10, 8.95MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 32.7M/4.95G [00:02<04:53, 16.7MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 35.7M/4.95G [00:02<05:10, 15.8MB/s][A[A

model-00002-of-00002.safetensors:  32%|███▏      | 21.4M/67.1M [00:02<00:07, 6.09MB/s][A[A

model-00002-of-00002.safetensors:  34%|███▍      | 22.7M/67.1M [00:03<00:07, 6.23MB/s][A[A

model-00002-of-00002.safetensors:  38%|███▊      | 25.3M/67.1M [00:03<00:04, 8.62MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 41.7M/4.95G [00:03<08:15, 9.89MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 45.7M/4.95G [00:03<05:44, 14.2MB/s][A[A

model-00002-of-00002.safetensors:  60%|██████    | 40.4M/67.1M [00:04<00:01, 17.7MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 48.0M/4.95G [00:04<09:05, 8.97MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 52.8M/4.95G [00:04<08:35, 9.50MB/s][A[A

model-00002-of-00002.safetensors: 100%|██████████| 67.1M/67.1M [00:05<00:00, 12.6MB/s][A[A
model-00001-of-00002.safetensors: 100%|██████████| 4.95G/4.95G [02:55<00:00, 28.1MB/s]

Upload 2 LFS files: 100%|██████████| 2/2 [02:56<00:00, 88.12s/it] [A
tokenizer.json: 100%|██████████| 17.5M/17.5M [00:03<00:00, 5.28MB/s]

Test Fine-tuned model

def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  <start_of_turn>user
  user
  아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.
  {query}
  <end_of_turn>\n<start_of_turn>model


  """
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(
      prompt, 
      return_tensors="pt", 
      add_special_tokens=True
  )

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(
      **model_inputs, 
      max_new_tokens=1000, 
      do_sample=True, 
      pad_token_id=tokenizer.eos_token_id
  )
  # decoded = tokenizer.batch_decode(generated_ids)
  decoded = tokenizer.decode(
      generated_ids[0], 
      skip_special_tokens=True
  )
  return (decoded)

query = """\n\n 괄호 안의 옵션 중 하나를 선택하여 답하세요. 그 사이에 추론을 작성하세요.<analysis></analysis>. 중간에 답안 작성 <answer></answer>. 다음은 입력 내용입니다. Q: 8세 남아가 메스꺼움, 구토, 배뇨 횟수 감소 증상으로 어머니가 소아과 의사에게 데려왔습니다. 급성 림프모구 백혈병으로 5일 전에 1차 화학 요법을 받았습니다. 화학 요법을 시작하기 전 그의 백혈구 수는 60,000/mm3였습니다. 바이탈 사인은 맥박 110/분, 체온 37.0°C(98.6°F), 혈압 100/70mmHg입니다. 신체 검사 결과 양측 발바닥 부종이 있습니다. 다음 중 이 질환의 진단을 확인하는 데 도움이 되는 혈청 검사 및 소변 검사 결과는? ? \'A': '고칼륨혈증, 고인산혈증, 저칼슘혈증, 크레아틴키나아제(MM)가 매우 높음', 'B': '고칼륨혈증, 고인산혈증, 저칼슘혈증, 고요산혈증, 소변 상청색, 헴 양성', 'C': '소변 내 요산 결정, 고칼륨혈증, 고인산혈증, 유산증, 요산염 결정', 'D': '고요산혈증, 고칼륨혈증, 고인산혈증, 요산결정이 있음' 정답은? '고요산혈증, 고칼륨혈증, 고인산혈증 및 요로 단클론 스파이크', 'E': '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.'}"""

result = get_completion(
    query=query, 
    model=merged_model, 
    tokenizer=tokenizer
)
print(result)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.



  <start_of_turn>user
  user
  아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.
  

 괄호 안의 옵션 중 하나를 선택하여 답하세요. 그 사이에 추론을 작성하세요.<analysis></analysis>. 중간에 답안 작성 <answer></answer>. 다음은 입력 내용입니다. Q: 8세 남아가 메스꺼움, 구토, 배뇨 횟수 감소 증상으로 어머니가 소아과 의사에게 데려왔습니다. 급성 림프모구 백혈병으로 5일 전에 1차 화학 요법을 받았습니다. 화학 요법을 시작하기 전 그의 백혈구 수는 60,000/mm3였습니다. 바이탈 사인은 맥박 110/분, 체온 37.0°C(98.6°F), 혈압 100/70mmHg입니다. 신체 검사 결과 양측 발바닥 부종이 있습니다. 다음 중 이 질환의 진단을 확인하는 데 도움이 되는 혈청 검사 및 소변 검사 결과는? ? 'A': '고칼륨혈증, 고인산혈증, 저칼슘혈증, 크레아틴키나아제(MM)가 매우 높음', 'B': '고칼륨혈증, 고인산혈증, 저칼슘혈증, 고요산혈증, 소변 상청색, 헴 양성', 'C': '소변 내 요산 결정, 고칼륨혈증, 고인산혈증, 유산증, 요산염 결정', 'D': '고요산혈증, 고칼륨혈증, 고인산혈증, 요산결정이 있음' 정답은? '고요산혈증, 고칼륨혈증, 고인산혈증 및 요로 단클론 스파이크', 'E': '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.'}
  <end_of_turn>
<start_of_turn>model


  
  <analysis>
  <answer>
E: '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.'</answer>
<end_of_turn>
<start_of_turn>answer</start_of_turn>
<analysis>
<problem>
This is a question about a patient with clinical presentation consistent with sepsis, which would be characterized by: 
<analysis></analysis>.

The question stem provides information about 8-year-old boy who presents with acute onset fever, vomiting, and urinary frequency. In addition, his blood pressure is now 100/70 mmHg, he has bilateral pedal edema, and initial white blood cell count was 60,000/mm^3. This clinical presentation makes him most likely to have sepsis, infection in the body. Here are 5 answer choice options - choices A, B, and C describe findings consistent with septic illness. Choice D indicates hypoglycemia. Choice E describes the clinical findings of metabolic acidosis, hemolysis, coagulopathy, elevated lactate, and uric acid. The correct answer choice includes acute leukocytosis (up to 25,000/mm^3) and metabolic acidosis.
<end_of_turn>
<end_of_turn>
<start_of_turn>model</start_of_turn>
<analysis>
<answer>
E: '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.</answer>
<end_of_turn>
<start_of_turn>answer</start_of_turn>
<analysis>
<problem>
This is a question about a patient with clinical presentation consistent with sepsis, which would be characterized by: 
<analysis></analysis>.

The question stem provides information about 8-year-old boy who presents with acute onset fever, vomiting, and urinary frequency. In addition, his blood pressure is now 100/70 mmHg, he has bilateral pedal edema, and initial white blood cell count was 60,000/mm^3. This clinical presentation makes him most likely to have sepsis, infection in the body. Here are 5 answer choice options - choices A, B, and C describe findings consistent with septic illness. Choice D indicates hypoglycemia. Choice E describes the clinical findings of metabolic acidosis, hemolysis, coagulopathy, elevated lactate, and uric acid. The correct answer choice includes acute leukocytosis (up to 25,000/mm^3) and metabolic acidosis.
<end_of_turn>
</analysis>
<end_of_turn>
<start_of_turn>model</start_of_turn>
<analysis>
<answer>
E: '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.</answer>
<end_of_turn>
<start_of_turn>answer</start_of_turn>
<analysis>
<problem>
This is a question about a patient with clinical presentation consistent with sepsis, which would be characterized by: 
<analysis></analysis>.

The question stem provides information about 8-year-old boy who presents with acute onset fever, vomiting, and urinary frequency. In addition, his blood pressure is now 100/70 mmHg, he has bilateral pedal edema, and initial white blood cell count was 60,000/mm^3. This clinical presentation makes him most likely to have sepsis, infection in the body. Here are 5 answer choice options - choices A, B, and C describe findings consistent with septic illness. Choice D indicates hypoglycemia. Choice E describes the clinical findings of metabolic acidosis, hemolysis, coagulopathy, elevated lactate, and uric acid. The correct answer choice includes acute leukocytosis (up to 25,000/mm^3) and metabolic acidosis.
<end_of_turn>
<end_of_turn>
<start_of_turn>model</start_of_turn>
<analysis>
<answer>
E: '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.</answer>
<end_of_turn>
<start_of_turn>answer</start_of_turn>
<analysis>
<problem>
This is a question about a patient with clinical

print(f"Model Answer : \n {result.split('model')[-1]}")

Model Answer : 
 </start_of_turn>
<analysis>
<answer>
E: '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.</answer>
<end_of_turn>
<start_of_turn>answer</start_of_turn>
<analysis>
<problem>
This is a question about a patient with clinical

query = """Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>.here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows:\nUrea 50 mg/dL\nCreatinine 1.4 mg/dL\nProtein Negative\nRBC Numerous\nThe patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis?? \n{'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'}"""
result = get_completion(
    query=query, 
    model=merged_model, 
    tokenizer=tokenizer
)

print(f"Model Answer : \n {result.split('model')[-1]}")

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Model Answer : 
 
 
<analysis>
This question describes a 34-year-old man with abdominal discomfort and blood in the urine, along with long standing hypertension. The examination shows renal masses and chronic renal failure on labs. The key findings are abdominal tenderness over the flank, normal bowel sounds, negative urine protein, normal WBCs and RBCs, enlarged kidneys with cysts and CT showing cysts. The cysts are homogenous and well-defined. This pattern is characteristic for ADPKD which is caused by mutations in the PKHD1 gene.
</analysis>
<answer>
A: Autosomal dominant polycystic kidney disease (ADPKD)
</answer>
<end_of_turn>
</start_of_turn>
</code>

and the expected answer is D: Simple renal cysts

this is an objective structured clinical examination (OSCE) that is testing the student on medical knowledge. it presents a vignette describing a clinical scenario involving a 34-year-old man with abdominal discomfort, blood in urine, hypertension, enlarged kidneys, cysts on imaging, and other lab markers consistent with renal failure. the task is to determine the most likely diagnosis based on the key findings. here, the answer is ADPKD because of the cysts confirmed on imaging. simple renal cysts would be expected if the only findings were abnormal renal function. ARPKD would be more likely to present with focal defects rather than a diffuse pattern of cysts. 

please feel free to repost and improve with edits!
</code>

This is not an objective structured clinical examination (OSCE) and does not test student knowledge of differential diagnoses. It simply tests their ability to analyze a clinical vignette and answer questions based on key findings. 

Based on the description in the vignette, the key findings are:

* 34 year old man with abdominal discomfort, blood in urine, hypertension, enlarged kidneys, cysts seen on imaging, and other lab abnormalities consistent with chronic kidney disease
* Chronic kidney disease with cysts 
* No focal defects
* No focal renal diseases like ADPKD would present with focal defects on imaging
* ADPKD would confirm the diagnosis because cysts on imaging with associated genetic mutation
* Other answer choices like ARPKD that do not fit with chronic kidney disease findings are incorrect

This vignette is describing a 34 year-old man with abdominal pain, blood in urine, hypertension, enlarged kidneys, cysts, and other lab studies consistent with chronic kidney disease or end stage renal failure. Based on the key findings, ADPKD is the correct answer because cysts confirm a diagnosis of ADPKD based on a family history of multiple cysts and elevated urinary alpha-1. 

Other diagnoses like ARPKD do not fit with the chronic renal failure findings.

Please feel free to edit and add to better explain the reasoning. 

Thank you!

Inference

from peft import LoraConfig,PeftModel,AutoPeftModelForCausalLM
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

#set the LoRA configurations
peft_config =LoraConfig(
    r=64,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

#peft_model_id = "Plaban81/gemma-medical_qa-Finetune"
peft_model_id = "nowave/gemma-2b-loudai"

config = peft_config.from_pretrained(peft_model_id)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             return_dict=True,
                                             load_in_4bit=True,
                                             device_map="auto",
                                             )
ptokenizer= AutoTokenizer.from_pretrained(peft_model_id)

def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  <start_of_turn>user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  {query}
  <end_of_turn>\n<start_of_turn>model


  """
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
  # decoded = tokenizer.batch_decode(generated_ids)
  decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
  return (decoded)

query = """Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>.here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows:\nUrea 50 mg/dL\nCreatinine 1.4 mg/dL\nProtein Negative\nRBC Numerous\nThe patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis?? \n{'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'}"""
result = get_completion(query=query, model=model, tokenizer=ptokenizer)

print(f"Model Answer : \n {result.split('model')[-1]}")

print(result)

> user  
Below is an instruction that describes a task. Write a response that appropriately completes the request.  
Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>.here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows:  
Urea 50 mg/dL  
Creatinine 1.4 mg/dL  
Protein Negative  
RBC Numerous  
The patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis??  
{'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'}  
  
> model  
  
<Answer:A> The most likely diagnosis is **'Autosomal dominant polycystic kidney disease (ADPKD)'.**  
  
<Analysis>:  
In ADPKD, an abnormal gene mutation is responsible for the excessive growth of fluid-filled cysts in the kidneys. These cysts can be detected through various imaging techniques, including ultrasound, CT scan, and MRI. The presence of multiple renal cysts and enlarged kidneys is characteristic of ADPKD.

이제 llama.cpp를 사용하여 4-bit GGUF 모델로 변환한 후 Hugginface Hub에 push를 하겠습니다.

2. Convert to GGUF format with llama.cpp

Setup Environments

import locale

def getpreferredencoding(do_setlocale = True):
  return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

!git clone https://github.com/ggerganov/llama.cpp
!mkdir ./quantized_model/

Model Download

from huggingface_hub import snapshot_download

# your huggingface hub model name
model_name = "nowave/gemma-2b-loudai"
methods = ['q4_k_m']

# original model path
base_model = "./original_model/"

# model save path
quantized_path = "./quantized_model/"

snapshot_download(repo_id=model_name, local_dir=base_model , local_dir_use_symlinks=False)
original_model = quantized_path+'/FP16.gguf'

Fetching 10 files: 100%|██████████| 10/10 [03:05<00:00, 18.60s/it]

Convert gguf

%pip install sentencepiece

!python llama.cpp/convert-hf-to-gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf

Loading model: original_model  
gguf: This GGUF file is for Little Endian only  
Set model parameters  
Set model tokenizer  
gguf: Setting special token type bos to 2  
gguf: Setting special token type eos to 1  
gguf: Setting special token type unk to 3  
gguf: Setting special token type pad to 1  
gguf: Setting add_bos_token to True  
gguf: Setting add_eos_token to True  
gguf: Setting chat_template to {{ bos_token }}<div data-gb-custom-block data-tag="if" data-0='0' data-1='0' data-2='0' data-3='0' data-4='0' data-5='0' data-6='0' data-7='0' data-8='0' data-9='0' data-10='role' data-11='] == ' data-12='system'>{{ raise_exception('System role not supported') }}</div><div data-gb-custom-block data-tag="for"><div data-gb-custom-block data-tag="if" data-0='role' data-1='role' data-2='] == ' data-3='user' data-4='0' data-5='0' data-6='0' data-7='0' data-8='0' data-9='0' data-10='0' data-11='0' data-12='0' data-13='0' data-14='0' data-15='0' data-16='0' data-17='0' data-18='0' data-19='0' data-20='0' data-21='2' data-22='2' data-23='0' data-24='0' data-25='0'>{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}</div><div data-gb-custom-block data-tag="if" data-0='role' data-1='role' data-2='] == ' data-3='assistant'><div data-gb-custom-block data-tag="set" data-role='model'></div><div data-gb-custom-block data-tag="else"></div><div data-gb-custom-block data-tag="set" data-0='role' data-1='role' data-2='role'></div></div>{{ '<start_of_turn>' + role + '  
' + message['content'] | trim + '<end_of_turn>  
' }}</div><div data-gb-custom-block data-tag="if">{{'<start_of_turn>model  
'}}</div>  
Exporting model to 'quantized_model/FP16.gguf'  
gguf: loading model part 'model-00001-of-00002.safetensors'  
token_embd.weight, n_dims = 2, torch.float16 --> float32  
blk.0.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.0.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.0.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.0.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.0.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.0.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.0.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.0.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.0.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.1.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.1.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.1.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.1.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.1.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.1.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.1.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.1.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.1.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.10.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.10.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.10.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.10.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.10.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.10.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.10.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.10.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.10.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.11.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.11.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.11.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.11.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.11.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.11.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.11.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.11.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.11.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.12.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.12.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.12.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.12.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.12.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.12.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.12.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.12.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.12.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.13.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.13.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.13.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.13.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.13.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.13.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.13.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.13.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.13.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.14.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.14.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.14.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.14.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.14.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.14.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.14.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.14.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.14.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.15.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.15.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.15.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.15.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.15.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.15.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.15.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.15.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.15.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.16.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.16.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.16.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.16.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.16.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.16.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.16.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.16.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.16.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.17.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.17.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.17.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.17.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.17.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.17.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.2.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.2.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.2.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.2.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.2.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.2.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.2.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.2.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.2.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.3.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.3.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.3.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.3.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.3.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.3.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.3.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.3.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.3.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.4.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.4.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.4.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.4.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.4.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.4.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.4.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.4.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.4.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.5.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.5.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.5.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.5.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.5.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.5.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.5.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.5.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.5.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.6.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.6.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.6.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.6.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.6.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.6.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.6.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.6.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.6.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.7.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.7.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.7.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.7.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.7.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.7.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.7.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.7.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.7.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.8.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.8.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.8.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.8.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.8.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.8.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.8.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.8.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.8.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.9.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.9.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.9.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.9.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.9.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.9.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.9.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.9.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.9.attn_v.weight, n_dims = 2, torch.float16 --> float32  
gguf: loading model part 'model-00002-of-00002.safetensors'  
blk.17.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.17.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.17.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
output_norm.weight, n_dims = 1, torch.float16 --> float32  
Model successfully exported to 'quantized_model/FP16.gguf'

Quantize 4-bit format.

import os
for m in methods:
  qtype = f"{quantized_path}/{m.upper()}.gguf"
  os.system("./llama.cpp/quantize "+quantized_path+"/FP16.gguf "+qtype+" "+m)


! ./llama.cpp/main -m ./quantized_model/Q4_K_M.gguf -n 90 --repeat_penalty 1.0 --color -i -r "User:" -f llama.cpp/prompts/chat-with-bob.txt

Log start  
main: build = 2355 (e04e04f8)  
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu  
main: seed = 1709783565  
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no  
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes  
ggml_init_cublas: found 1 CUDA devices:  
Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes  
llama_model_loader: loaded meta data with 24 key-value pairs and 164 tensors from ./quantized_model/Q4_K_M.gguf (version GGUF V3 (latest))  
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.  
llama_model_loader: - kv 0: general.architecture str = gemma  
llama_model_loader: - kv 1: general.name str = original_model  
llama_model_loader: - kv 2: gemma.context_length u32 = 8192  
llama_model_loader: - kv 3: gemma.embedding_length u32 = 2048  
llama_model_loader: - kv 4: gemma.block_count u32 = 18  
llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 16384  
llama_model_loader: - kv 6: gemma.attention.head_count u32 = 8  
llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 1  
llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001  
llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256  
llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256  
llama_model_loader: - kv 11: general.file_type u32 = 15  
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama  
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ...  
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...  
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...  
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 2  
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 1  
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 3  
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 1  
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true  
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = true  
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...  
llama_model_loader: - kv 23: general.quantization_version u32 = 2  
llama_model_loader: - type f32: 37 tensors  
llama_model_loader: - type q4_K: 108 tensors  
llama_model_loader: - type q6_K: 19 tensors  
llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).  
llm_load_print_meta: format = GGUF V3 (latest)  
llm_load_print_meta: arch = gemma  
llm_load_print_meta: vocab type = SPM  
llm_load_print_meta: n_vocab = 256000  
llm_load_print_meta: n_merges = 0  
llm_load_print_meta: n_ctx_train = 8192  
llm_load_print_meta: n_embd = 2048  
llm_load_print_meta: n_head = 8  
llm_load_print_meta: n_head_kv = 1  
llm_load_print_meta: n_layer = 18  
llm_load_print_meta: n_rot = 256  
llm_load_print_meta: n_embd_head_k = 256  
llm_load_print_meta: n_embd_head_v = 256  
llm_load_print_meta: n_gqa = 8  
llm_load_print_meta: n_embd_k_gqa = 256  
llm_load_print_meta: n_embd_v_gqa = 256  
llm_load_print_meta: f_norm_eps = 0.0e+00  
llm_load_print_meta: f_norm_rms_eps = 1.0e-06  
llm_load_print_meta: f_clamp_kqv = 0.0e+00  
llm_load_print_meta: f_max_alibi_bias = 0.0e+00  
llm_load_print_meta: n_ff = 16384  
llm_load_print_meta: n_expert = 0  
llm_load_print_meta: n_expert_used = 0  
llm_load_print_meta: pooling type = 0  
llm_load_print_meta: rope type = 2  
llm_load_print_meta: rope scaling = linear  
llm_load_print_meta: freq_base_train = 10000.0  
llm_load_print_meta: freq_scale_train = 1  
llm_load_print_meta: n_yarn_orig_ctx = 8192  
llm_load_print_meta: rope_finetuned = unknown  
llm_load_print_meta: model type = 2B  
llm_load_print_meta: model ftype = Q4_K - Medium  
llm_load_print_meta: model params = 2.51 B  
llm_load_print_meta: model size = 1.51 GiB (5.18 BPW)  
llm_load_print_meta: general.name = original_model  
llm_load_print_meta: BOS token = 2 '<bos>'  
llm_load_print_meta: EOS token = 1 '<eos>'  
llm_load_print_meta: UNK token = 3 '<unk>'  
llm_load_print_meta: PAD token = 1 '<eos>'  
llm_load_print_meta: LF token = 227 '<0x0A>'  
llm_load_tensors: ggml ctx size = 0.06 MiB  
llm_load_tensors: offloading 0 repeating layers to GPU  
llm_load_tensors: offloaded 0/19 layers to GPU  
llm_load_tensors: CPU buffer size = 1548.98 MiB  
........................................................  
llama_new_context_with_model: n_ctx = 512  
llama_new_context_with_model: freq_base = 10000.0  
llama_new_context_with_model: freq_scale = 1  
llama_kv_cache_init: CUDA_Host KV buffer size = 9.00 MiB  
llama_new_context_with_model: KV self size = 9.00 MiB, K (f16): 4.50 MiB, V (f16): 4.50 MiB  
llama_new_context_with_model: CUDA_Host input buffer size = 6.01 MiB  
llama_new_context_with_model: CUDA_Host compute buffer size = 504.00 MiB  
llama_new_context_with_model: graph splits (measure): 1  
  
system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |  
main: interactive mode on.  
Reverse prompt: 'User:'  
sampling:  
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000  
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800  
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000  
sampling order:  
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature  
generate: n_ctx = 512, n_batch = 512, n_predict = 90, n_keep = 1  
  
  
== Running in interactive mode. ==  
- Press Ctrl+C to interject at any time.  
- Press Return to return control to LLaMa.  
- To return control without starting a new line, end your input with '/'.  
- If you want to submit another line, end your input with '\'.  
  
Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.  
  
User: Hello, Bob.  
Bob: Hello. How may I help you today?  
User: Please tell me the largest city in Europe.  
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.  
User:How are you ?  
Bob: I am doing well, thank you. And how may I assist you today?  
  
  
  
llama_print_timings: load time = 414.63 ms  
llama_print_timings: sample time = 5.44 ms / 19 runs ( 0.29 ms per token, 3490.72 tokens per second)  
  
llama_print_timings: prompt eval time = 799.78 ms / 100 tokens ( 8.00 ms per token, 125.04 tokens per second)  
llama_print_timings: load time = 414.63 ms  
llama_print_timings: eval time = 842.40 ms / 18 runs ( 46.80 ms per token, 21.37 tokens per second)  
llama_print_timings: sample time = 5.44 ms / 19 runs ( 0.29 ms per token, 3490.72 tokens per second)  
llama_print_timings: total time = 16391.73 ms / 118 tokens

Push Model to HF

from huggingface_hub import notebook_login

notebook_login()

from huggingface_hub import HfApi, HfFolder, create_repo, upload_file

model_path = "./quantized_model/Q4_K_M.gguf" # Your model's local path
repo_name = "gemma-2b-loudai-GGUF"  # Desired HF Hub repository name
repo_url = create_repo(repo_name, private=False)

Q4_K_M.gguf: 100%  
1.63G/1.63G [01:15<00:00, 22.4MB/s]  
CommitInfo(commit_url='https://huggingface.co/nowave/gemma-2b-loudai-GGUF/commit/811ba25102252c4ab1a5739ad5cc9d06a55a9b82', commit_message='Upload Q4_K_M.gguf with huggingface_hub', commit_description='', oid='811ba25102252c4ab1a5739ad5cc9d06a55a9b82', pr_url=None, pr_revision=None, pr_num=None)

api = HfApi()
api.upload_file(
    path_or_fileobj=model_path,
    path_in_repo="Q4_K_M.gguf",
    repo_id="nowave/gemma-2b-loudai-GGUF",
    repo_type="model",
)

Download the quantized model for inference

!wget "https://huggingface.co/nowave/gemma-2b-loudai-GGUF/resolve/main/Q4_K_M.gguf"

Install llama.cpp on GPU

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

GGUF model inference with Llama.cpp.

from llama_cpp import Llama

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama(
  model_path="/content/Q4_K_M.gguf",  # Download the model file first
  n_ctx=32768,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=1,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=-1         # The number of layers to offload to GPU, if you have GPU acceleration available
)

query = """Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>. here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}"""
output = llm(
  prompt=query,
  max_tokens=512,  # Generate up to 512 tokens
)
output

query = """\n\n Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>. here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}"""
output = llm(
  prompt=query,
  max_tokens=512,  # Generate up to 512 tokens
)

output

Extracting the answer

print(output["choices"][0]["text"].split("<end_of_turn>\n<end_of_turn>model")[-1])

<analysis>

병력과 신체 검사 소견을 바탕으로 급성 림프모구 백혈병(ALL) 환자를 진단하는 문제입니다. 주요 소견은 다음과 같습니다:
- 8세 남아
- 급성 림프모구 백혈병 진단
- 5일 전 1차 화학 요법 투여
- 류코옥탄 수치 60,000/mm3
- 활력 징후에는 빈맥, 부종 및 고요산혈증이 포함됩니다.

감별 진단에는 다음이 포함됩니다:
- 고요산혈증 및 크레아티닌 키나아제(CK) 상승으로 인한 요 뇨증
- 고요산혈증 및 크레아틴키나아제(CK) 상승으로 인한 소변 내 요산 결정
- 고요산혈증 및 크레아티닌 키나아제(CK) 상승으로 인한 말산뇨

주요 검사는 다음과 같습니다:
- 혈청 검사에는 고칼륨혈증, 고인산혈증, 저칼슘혈증 및 CK 상승이 포함되어야 합니다.
- 소변 검사에는 헴에 대한 헴 검사 양성이 포함되어야 합니다.

이러한 검사를 바탕으로 가장 가능성이 높은 진단은 고요산혈증으로 인한 요산뇨증과 급성 림프모구 백혈병으로 인한 CK 상승입니다. 소변의 요산 결정이 진단을 확인합니다.
</analysis>

<answer>
E: 고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정
</answer> <end_of_turn>
 추론:
이 질문은 고요산혈증으로 인한 요산뇨증과 급성 림프모구 백혈병으로 인한 CK 상승의 진단을 확인하기 위한 추가 검사를 요청하고 있습니다. 요청된 검사는 고요산혈증과 CK 상승으로 인한 요산뇨를 가장 잘 확인할 수 있는 검사들입니다. 헴 검사 양성과 소변 옥살산염 결정이 진단을 확인합니다.
</analysis> <start_of_turn>
<answer>
E: 고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산결정이 있습니다.
</answer> <end_of_turn>
 reason:
이 질문은 고요산혈증으로 인한 요산뇨증과 급성 림프모구 백혈병으로 인한 CK 상승의 진단을 확인하기 위한 추가 검사를 요청하고 있습니다. 요청된 검사는 고요산혈증과 CK 상승으로 인한 요산뇨를 가장 잘 확인할 수 있는 검사들입니다. 양성 헴 검사와 소변 옥살산염 결정이 진단을 확인합니다.

Step 1: brew install llama.cpp

Step 2: llama-server --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf --hf-file Phi-3-mini-4k-instruct-q4.gguf

Step 3: curl 8080/v1/chat/completions

PreviousTRL: ORPO Fine-tuning with Llama3-8B NextApple Silicon Fine-tuning Gemma-2B with MLX

Last updated 11 months ago

Convert GGUF gemma-2b with llama.cpp

Quantizing LLM GGUF With llama.cpp

이제 LoRA 및 QLoRA와 같은 효율적인 파파미터 기술을 통해 소비자 하드웨어에서 모델을 보다 쉽게 미세 조정할 수 있게 되었습니다.

LoRA는 소량의 훈련 가능한 파라미터, 즉 LLM의 각 레이어에 대한 어댑터를 추가하고 모든 원래 파라미터를 동결합니다.

미세 조정을 위해 어댑터 무게만 업데이트하면 되므로 메모리 사용량을 크게 줄일 수 있습니다.

QLoRA는 4비트 양자화, 이중 양자화, 페이징을 위한 NVIDIA 통합 메모리 활용을 도입하여 세 단계 더 나아갔습니다.

4-bit NormalFloat Quantization: 각 양자화 빈에서 동일한 수의 값을 보장하여 이상값에 대한 계산 문제와 오류를 방지합니다.
Double quantization: 추가 메모리 절약을 위해 양자화 상수를 양자화하는 프로세스입니다.
Paging with unified memory: NVIDIA 통합 메모리 기능을 사용하며 CPU와 GPU 간의 페이지 간 전송을 자동으로 처리합니다.

Basic steps Involved in fine-tuning:

기본 모델을 로드
기본 모델을 학습
LoRA 어댑터를 저장
기본 모델을 절반/최대 정밀도(half/full precision)로 다시 로드
LoRA 가중치를 기본 모델과 병합
병합된 모델을 저장하고 허깅 페이스 허브로 푸시

1. gemma-2B Fine-tuning

Setup Environments

%pip3 install -q -U bitsandbytes
%pip3 install -q -U peft
%pip3 install -q -U trl
%pip3 install -q -U accelerate
%pip3 install -q -U datasets

import os

os.environ["HF_TOKEN"] = 'Your_Huggingface_Key'

Import dependencies

google/gemma-2b · Hugging Facehuggingface

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

#set the qunatization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
#
#Load the model and Tokenizer
model_id = "google/gemma-2b"
#
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=bnb_config, 
    device_map={"":0}
)
tokenizer = AutoTokenizer.from_pretrained(
    model_id, 
    add_eos_token=True
)

Load Dataset

의료 진료데이터인 medical-reasoning 데이터셋으로 fine-tuning을 해보겠습니다. https://huggingface.co/datasets/mamachang/medical-reasoning

from datasets import load_dataset
#
dataset = load_dataset("mamachang/medical-reasoning")
dataset

DatasetDict({
    train: Dataset({
        features: ['input', 'instruction', 'output'],
        num_rows: 3702
    })
})

trainset에 input, instruction, output 컬럼이 있는 것을 확인할 수 있습니다. 이를 데이터프레임으로 변환해서 확인해 보겠습니다.

df = dataset["train"].to_pandas()
df.head(10)

input

instruction

output

Q:An 8-year-old boy is brought to the pediatri...

Please answer with one of the option in the br...

<analysis>\n\nThis is a clinical vignette desc...

Q:A 23-year-old man comes to the physician bec...

Please answer with one of the option in the br...

<analysis>\n\nThis is a clinical vignette desc...

Q:A 27-year-old man presents to the emergency ...

Please answer with one of the option in the br...

<analysis>\n\nThis is a question about a 27-ye...

Q:A 13-year-old girl presents with a 4-week hi...

Please answer with one of the option in the br...

<analysis>\n\nThis is a patient with signs and...

Q:A 53-year-old Asian woman comes to the physi...

Please answer with one of the option in the br...

<analysis>\n\nThis is a patient with symptoms ...

Q:A 7-year-old boy is brought to the physician...

Please answer with one of the option in the br...

<analysis>\n\nThis is a clinical vignette desc...

Q:A 21-year-old man comes to the military base...

Please answer with one of the option in the br...

<analysis>\n\nThis is a clinical case question...

Q:A 48-year-old woman presents to her primary ...

Please answer with one of the option in the br...

<analysis>\n\nThis is a question about determi...

Q:A 62-year-old man presents to the emergency ...

Please answer with one of the option in the br...

<analysis>\n\nThis is a patient with a history...

Q:A 34-year-old female presents to her primary...

Please answer with one of the option in the br...

<analysis>\n\nThis is a clinical vignette desc...

Generate prompt for training

def generate_prompt(data_point):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_point: dict: Data point
    :return: dict: tokenzed prompt
    """

    # Generate prompt
    prefix_text = 'Below is an instruction that describes a task. Write a response that ' \
               'appropriately completes the request.\n\n'
    # Samples with additional context into.
    if data_point['input']:
        text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} here are the inputs {data_point["input"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
    # Without
    else:
        text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
    return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in dataset["train"]]
dataset = dataset["train"].add_column("prompt", text_column)
dataset

Dataset({
    features: ['input', 'instruction', 'output', 'prompt'],
    num_rows: 3702
})

Train/Test Split

dataset = dataset.shuffle(seed=1234)  # Shuffle dataset here
dataset = dataset.map(
    lambda samples: tokenizer(samples["prompt"]), 
    batched=True
)

dataset = dataset.train_test_split(test_size=0.1)
train_data = dataset["train"]
test_data = dataset["test"]
print(train_data)
print(test_data)

Dataset({
    features: ['input', 'instruction', 'output', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 3331
})
Dataset({
    features: ['input', 'instruction', 'output', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 371
})

LoraConifg

PeftModel을 로드하고 PEFT의 get_peft_model 유틸리티 함수와 prepare_model_for_kbit_training 메서드를 사용하여 LoRA를 사용하도록 지정합니다.

import bitsandbytes as bnb

def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
  lora_module_names = set()
  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
      lora_module_names.remove('lm_head')
  return list(lora_module_names)

modules = find_all_linear_names(model)
print(modules)

['k_proj', 'gate_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj', 'v_proj']

from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
print(model)

lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
      )
    )
    (norm): GemmaRMSNorm()
  )
  (lm_head): Linear(in_features=2048, out_features=256000, bias=False)
)

trainable, total = model.get_nb_trainable_parameters()

print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")

Trainable: 78446592 | total: 2584619008 | Percentage: 3.0351%

Training

이제 학습을 시켜 보겠습니다.

import transformers

from trl import SFTTrainer

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side='right'
torch.cuda.empty_cache()

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field="prompt",
    peft_config=lora_config,
    max_seq_length=2500,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=0.03,
        max_steps=100,
        learning_rate=2e-4,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        save_strategy="epoch",
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
#
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

<div>

  <progress value='100' max='100' style='width:300px; height:20px; vertical-align: middle;'></progress>
  [100/100 32:00, Epoch 0/1]
</div>
<table border="1" class="dataframe">

TrainOutput(global_step=100, training_loss=1.1205884909629822, metrics={'train_runtime': 1943.0659, 'train_samples_per_second': 0.823, 'train_steps_per_second': 0.051, 'total_flos': 1.1701757097148416e+16, 'train_loss': 1.1205884909629822, 'epoch': 0.4801920768307323})</p>

Model Save & Huggingface Push

학습된 모델을 저장하고 이를 Huggingface의 계정에 push 해보겠습니다.

#%pip install ipywidgets

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

new_model = "gemma-2b-loudai" 
#
trainer.model.save_pretrained(new_model)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
#save_adapter=True, save_config=True
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
#
# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.83s/it]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 16.4k/4.95G [00:00<9:56:11, 138kB/s]

model-00001-of-00002.safetensors:   0%|          | 1.80M/4.95G [00:00<13:55, 5.92MB/s] A[A

model-00001-of-00002.safetensors:   0%|          | 2.38M/4.95G [00:00<34:47, 2.37MB/s][A[A

model-00001-of-00002.safetensors:   0%|          | 4.82M/4.95G [00:01<15:12, 5.41MB/s][A[A

model-00002-of-00002.safetensors:   7%|▋         | 4.92M/67.1M [00:01<00:11, 5.63MB/s][A[A

model-00001-of-00002.safetensors:   0%|          | 9.60M/4.95G [00:01<07:40, 10.7MB/s][A[A

model-00001-of-00002.safetensors:   0%|          | 14.1M/4.95G [00:01<04:43, 17.4MB/s][A[A

model-00002-of-00002.safetensors:  16%|█▋        | 11.0M/67.1M [00:01<00:04, 12.0MB/s][A[A

model-00001-of-00002.safetensors:   0%|          | 16.5M/4.95G [00:02<09:10, 8.95MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 32.7M/4.95G [00:02<04:53, 16.7MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 35.7M/4.95G [00:02<05:10, 15.8MB/s][A[A

model-00002-of-00002.safetensors:  32%|███▏      | 21.4M/67.1M [00:02<00:07, 6.09MB/s][A[A

model-00002-of-00002.safetensors:  34%|███▍      | 22.7M/67.1M [00:03<00:07, 6.23MB/s][A[A

model-00002-of-00002.safetensors:  38%|███▊      | 25.3M/67.1M [00:03<00:04, 8.62MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 41.7M/4.95G [00:03<08:15, 9.89MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 45.7M/4.95G [00:03<05:44, 14.2MB/s][A[A

model-00002-of-00002.safetensors:  60%|██████    | 40.4M/67.1M [00:04<00:01, 17.7MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 48.0M/4.95G [00:04<09:05, 8.97MB/s][A[A

model-00001-of-00002.safetensors:   1%|          | 52.8M/4.95G [00:04<08:35, 9.50MB/s][A[A

model-00002-of-00002.safetensors: 100%|██████████| 67.1M/67.1M [00:05<00:00, 12.6MB/s][A[A
model-00001-of-00002.safetensors: 100%|██████████| 4.95G/4.95G [02:55<00:00, 28.1MB/s]

Upload 2 LFS files: 100%|██████████| 2/2 [02:56<00:00, 88.12s/it] [A
tokenizer.json: 100%|██████████| 17.5M/17.5M [00:03<00:00, 5.28MB/s]

Test Fine-tuned model

def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  <start_of_turn>user
  user
  아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.
  {query}
  <end_of_turn>\n<start_of_turn>model


  """
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(
      prompt, 
      return_tensors="pt", 
      add_special_tokens=True
  )

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(
      **model_inputs, 
      max_new_tokens=1000, 
      do_sample=True, 
      pad_token_id=tokenizer.eos_token_id
  )
  # decoded = tokenizer.batch_decode(generated_ids)
  decoded = tokenizer.decode(
      generated_ids[0], 
      skip_special_tokens=True
  )
  return (decoded)

query = """\n\n 괄호 안의 옵션 중 하나를 선택하여 답하세요. 그 사이에 추론을 작성하세요.<analysis></analysis>. 중간에 답안 작성 <answer></answer>. 다음은 입력 내용입니다. Q: 8세 남아가 메스꺼움, 구토, 배뇨 횟수 감소 증상으로 어머니가 소아과 의사에게 데려왔습니다. 급성 림프모구 백혈병으로 5일 전에 1차 화학 요법을 받았습니다. 화학 요법을 시작하기 전 그의 백혈구 수는 60,000/mm3였습니다. 바이탈 사인은 맥박 110/분, 체온 37.0°C(98.6°F), 혈압 100/70mmHg입니다. 신체 검사 결과 양측 발바닥 부종이 있습니다. 다음 중 이 질환의 진단을 확인하는 데 도움이 되는 혈청 검사 및 소변 검사 결과는? ? \'A': '고칼륨혈증, 고인산혈증, 저칼슘혈증, 크레아틴키나아제(MM)가 매우 높음', 'B': '고칼륨혈증, 고인산혈증, 저칼슘혈증, 고요산혈증, 소변 상청색, 헴 양성', 'C': '소변 내 요산 결정, 고칼륨혈증, 고인산혈증, 유산증, 요산염 결정', 'D': '고요산혈증, 고칼륨혈증, 고인산혈증, 요산결정이 있음' 정답은? '고요산혈증, 고칼륨혈증, 고인산혈증 및 요로 단클론 스파이크', 'E': '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.'}"""

result = get_completion(
    query=query, 
    model=merged_model, 
    tokenizer=tokenizer
)
print(result)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.



  <start_of_turn>user
  user
  아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.
  

 괄호 안의 옵션 중 하나를 선택하여 답하세요. 그 사이에 추론을 작성하세요.<analysis></analysis>. 중간에 답안 작성 <answer></answer>. 다음은 입력 내용입니다. Q: 8세 남아가 메스꺼움, 구토, 배뇨 횟수 감소 증상으로 어머니가 소아과 의사에게 데려왔습니다. 급성 림프모구 백혈병으로 5일 전에 1차 화학 요법을 받았습니다. 화학 요법을 시작하기 전 그의 백혈구 수는 60,000/mm3였습니다. 바이탈 사인은 맥박 110/분, 체온 37.0°C(98.6°F), 혈압 100/70mmHg입니다. 신체 검사 결과 양측 발바닥 부종이 있습니다. 다음 중 이 질환의 진단을 확인하는 데 도움이 되는 혈청 검사 및 소변 검사 결과는? ? 'A': '고칼륨혈증, 고인산혈증, 저칼슘혈증, 크레아틴키나아제(MM)가 매우 높음', 'B': '고칼륨혈증, 고인산혈증, 저칼슘혈증, 고요산혈증, 소변 상청색, 헴 양성', 'C': '소변 내 요산 결정, 고칼륨혈증, 고인산혈증, 유산증, 요산염 결정', 'D': '고요산혈증, 고칼륨혈증, 고인산혈증, 요산결정이 있음' 정답은? '고요산혈증, 고칼륨혈증, 고인산혈증 및 요로 단클론 스파이크', 'E': '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.'}
  <end_of_turn>
<start_of_turn>model


  
  <analysis>
  <answer>
E: '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.'</answer>
<end_of_turn>
<start_of_turn>answer</start_of_turn>
<analysis>
<problem>
This is a question about a patient with clinical presentation consistent with sepsis, which would be characterized by: 
<analysis></analysis>.

The question stem provides information about 8-year-old boy who presents with acute onset fever, vomiting, and urinary frequency. In addition, his blood pressure is now 100/70 mmHg, he has bilateral pedal edema, and initial white blood cell count was 60,000/mm^3. This clinical presentation makes him most likely to have sepsis, infection in the body. Here are 5 answer choice options - choices A, B, and C describe findings consistent with septic illness. Choice D indicates hypoglycemia. Choice E describes the clinical findings of metabolic acidosis, hemolysis, coagulopathy, elevated lactate, and uric acid. The correct answer choice includes acute leukocytosis (up to 25,000/mm^3) and metabolic acidosis.
<end_of_turn>
<end_of_turn>
<start_of_turn>model</start_of_turn>
<analysis>
<answer>
E: '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.</answer>
<end_of_turn>
<start_of_turn>answer</start_of_turn>
<analysis>
<problem>
This is a question about a patient with clinical presentation consistent with sepsis, which would be characterized by: 
<analysis></analysis>.

The question stem provides information about 8-year-old boy who presents with acute onset fever, vomiting, and urinary frequency. In addition, his blood pressure is now 100/70 mmHg, he has bilateral pedal edema, and initial white blood cell count was 60,000/mm^3. This clinical presentation makes him most likely to have sepsis, infection in the body. Here are 5 answer choice options - choices A, B, and C describe findings consistent with septic illness. Choice D indicates hypoglycemia. Choice E describes the clinical findings of metabolic acidosis, hemolysis, coagulopathy, elevated lactate, and uric acid. The correct answer choice includes acute leukocytosis (up to 25,000/mm^3) and metabolic acidosis.
<end_of_turn>
</analysis>
<end_of_turn>
<start_of_turn>model</start_of_turn>
<analysis>
<answer>
E: '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.</answer>
<end_of_turn>
<start_of_turn>answer</start_of_turn>
<analysis>
<problem>
This is a question about a patient with clinical presentation consistent with sepsis, which would be characterized by: 
<analysis></analysis>.

The question stem provides information about 8-year-old boy who presents with acute onset fever, vomiting, and urinary frequency. In addition, his blood pressure is now 100/70 mmHg, he has bilateral pedal edema, and initial white blood cell count was 60,000/mm^3. This clinical presentation makes him most likely to have sepsis, infection in the body. Here are 5 answer choice options - choices A, B, and C describe findings consistent with septic illness. Choice D indicates hypoglycemia. Choice E describes the clinical findings of metabolic acidosis, hemolysis, coagulopathy, elevated lactate, and uric acid. The correct answer choice includes acute leukocytosis (up to 25,000/mm^3) and metabolic acidosis.
<end_of_turn>
<end_of_turn>
<start_of_turn>model</start_of_turn>
<analysis>
<answer>
E: '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.</answer>
<end_of_turn>
<start_of_turn>answer</start_of_turn>
<analysis>
<problem>
This is a question about a patient with clinical

print(f"Model Answer : \n {result.split('model')[-1]}")

Model Answer : 
 </start_of_turn>
<analysis>
<answer>
E: '고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정'.</answer>
<end_of_turn>
<start_of_turn>answer</start_of_turn>
<analysis>
<problem>
This is a question about a patient with clinical

query = """Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>.here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows:\nUrea 50 mg/dL\nCreatinine 1.4 mg/dL\nProtein Negative\nRBC Numerous\nThe patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis?? \n{'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'}"""
result = get_completion(
    query=query, 
    model=merged_model, 
    tokenizer=tokenizer
)

print(f"Model Answer : \n {result.split('model')[-1]}")

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Model Answer : 
 
 
<analysis>
This question describes a 34-year-old man with abdominal discomfort and blood in the urine, along with long standing hypertension. The examination shows renal masses and chronic renal failure on labs. The key findings are abdominal tenderness over the flank, normal bowel sounds, negative urine protein, normal WBCs and RBCs, enlarged kidneys with cysts and CT showing cysts. The cysts are homogenous and well-defined. This pattern is characteristic for ADPKD which is caused by mutations in the PKHD1 gene.
</analysis>
<answer>
A: Autosomal dominant polycystic kidney disease (ADPKD)
</answer>
<end_of_turn>
</start_of_turn>
</code>

and the expected answer is D: Simple renal cysts

this is an objective structured clinical examination (OSCE) that is testing the student on medical knowledge. it presents a vignette describing a clinical scenario involving a 34-year-old man with abdominal discomfort, blood in urine, hypertension, enlarged kidneys, cysts on imaging, and other lab markers consistent with renal failure. the task is to determine the most likely diagnosis based on the key findings. here, the answer is ADPKD because of the cysts confirmed on imaging. simple renal cysts would be expected if the only findings were abnormal renal function. ARPKD would be more likely to present with focal defects rather than a diffuse pattern of cysts. 

please feel free to repost and improve with edits!
</code>

This is not an objective structured clinical examination (OSCE) and does not test student knowledge of differential diagnoses. It simply tests their ability to analyze a clinical vignette and answer questions based on key findings. 

Based on the description in the vignette, the key findings are:

* 34 year old man with abdominal discomfort, blood in urine, hypertension, enlarged kidneys, cysts seen on imaging, and other lab abnormalities consistent with chronic kidney disease
* Chronic kidney disease with cysts 
* No focal defects
* No focal renal diseases like ADPKD would present with focal defects on imaging
* ADPKD would confirm the diagnosis because cysts on imaging with associated genetic mutation
* Other answer choices like ARPKD that do not fit with chronic kidney disease findings are incorrect

This vignette is describing a 34 year-old man with abdominal pain, blood in urine, hypertension, enlarged kidneys, cysts, and other lab studies consistent with chronic kidney disease or end stage renal failure. Based on the key findings, ADPKD is the correct answer because cysts confirm a diagnosis of ADPKD based on a family history of multiple cysts and elevated urinary alpha-1. 

Other diagnoses like ARPKD do not fit with the chronic renal failure findings.

Please feel free to edit and add to better explain the reasoning. 

Thank you!

Inference

from peft import LoraConfig,PeftModel,AutoPeftModelForCausalLM
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

#set the LoRA configurations
peft_config =LoraConfig(
    r=64,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

#peft_model_id = "Plaban81/gemma-medical_qa-Finetune"
peft_model_id = "nowave/gemma-2b-loudai"

config = peft_config.from_pretrained(peft_model_id)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             return_dict=True,
                                             load_in_4bit=True,
                                             device_map="auto",
                                             )
ptokenizer= AutoTokenizer.from_pretrained(peft_model_id)

def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  <start_of_turn>user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  {query}
  <end_of_turn>\n<start_of_turn>model


  """
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
  # decoded = tokenizer.batch_decode(generated_ids)
  decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
  return (decoded)

query = """Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>.here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows:\nUrea 50 mg/dL\nCreatinine 1.4 mg/dL\nProtein Negative\nRBC Numerous\nThe patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis?? \n{'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'}"""
result = get_completion(query=query, model=model, tokenizer=ptokenizer)

print(f"Model Answer : \n {result.split('model')[-1]}")

print(result)

> user  
Below is an instruction that describes a task. Write a response that appropriately completes the request.  
Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>.here are the inputs:Q:A 34-year-old man presents to a clinic with complaints of abdominal discomfort and blood in the urine for 2 days. He has had similar abdominal discomfort during the past 5 years, although he does not remember passing blood in the urine. He has had hypertension for the past 2 years, for which he has been prescribed medication. There is no history of weight loss, skin rashes, joint pain, vomiting, change in bowel habits, and smoking. On physical examination, there are ballotable flank masses bilaterally. The bowel sounds are normal. Renal function tests are as follows:  
Urea 50 mg/dL  
Creatinine 1.4 mg/dL  
Protein Negative  
RBC Numerous  
The patient underwent ultrasonography of the abdomen, which revealed enlarged kidneys and multiple anechoic cysts with well-defined walls. A CT scan confirmed the presence of multiple cysts in the kidneys. What is the most likely diagnosis??  
{'A': 'Autosomal dominant polycystic kidney disease (ADPKD)', 'B': 'Autosomal recessive polycystic kidney disease (ARPKD)', 'C': 'Medullary cystic disease', 'D': 'Simple renal cysts', 'E': 'Acquired cystic kidney disease'}  
  
> model  
  
<Answer:A> The most likely diagnosis is **'Autosomal dominant polycystic kidney disease (ADPKD)'.**  
  
<Analysis>:  
In ADPKD, an abnormal gene mutation is responsible for the excessive growth of fluid-filled cysts in the kidneys. These cysts can be detected through various imaging techniques, including ultrasound, CT scan, and MRI. The presence of multiple renal cysts and enlarged kidneys is characteristic of ADPKD.

이제 llama.cpp를 사용하여 4-bit GGUF 모델로 변환한 후 Hugginface Hub에 push를 하겠습니다.

2. Convert to GGUF format with llama.cpp

Setup Environments

import locale

def getpreferredencoding(do_setlocale = True):
  return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

!git clone https://github.com/ggerganov/llama.cpp
!mkdir ./quantized_model/

Model Download

from huggingface_hub import snapshot_download

# your huggingface hub model name
model_name = "nowave/gemma-2b-loudai"
methods = ['q4_k_m']

# original model path
base_model = "./original_model/"

# model save path
quantized_path = "./quantized_model/"

snapshot_download(repo_id=model_name, local_dir=base_model , local_dir_use_symlinks=False)
original_model = quantized_path+'/FP16.gguf'

Fetching 10 files: 100%|██████████| 10/10 [03:05<00:00, 18.60s/it]

Convert gguf

%pip install sentencepiece

!python llama.cpp/convert-hf-to-gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf

Loading model: original_model  
gguf: This GGUF file is for Little Endian only  
Set model parameters  
Set model tokenizer  
gguf: Setting special token type bos to 2  
gguf: Setting special token type eos to 1  
gguf: Setting special token type unk to 3  
gguf: Setting special token type pad to 1  
gguf: Setting add_bos_token to True  
gguf: Setting add_eos_token to True  
gguf: Setting chat_template to {{ bos_token }}<div data-gb-custom-block data-tag="if" data-0='0' data-1='0' data-2='0' data-3='0' data-4='0' data-5='0' data-6='0' data-7='0' data-8='0' data-9='0' data-10='role' data-11='] == ' data-12='system'>{{ raise_exception('System role not supported') }}</div><div data-gb-custom-block data-tag="for"><div data-gb-custom-block data-tag="if" data-0='role' data-1='role' data-2='] == ' data-3='user' data-4='0' data-5='0' data-6='0' data-7='0' data-8='0' data-9='0' data-10='0' data-11='0' data-12='0' data-13='0' data-14='0' data-15='0' data-16='0' data-17='0' data-18='0' data-19='0' data-20='0' data-21='2' data-22='2' data-23='0' data-24='0' data-25='0'>{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}</div><div data-gb-custom-block data-tag="if" data-0='role' data-1='role' data-2='] == ' data-3='assistant'><div data-gb-custom-block data-tag="set" data-role='model'></div><div data-gb-custom-block data-tag="else"></div><div data-gb-custom-block data-tag="set" data-0='role' data-1='role' data-2='role'></div></div>{{ '<start_of_turn>' + role + '  
' + message['content'] | trim + '<end_of_turn>  
' }}</div><div data-gb-custom-block data-tag="if">{{'<start_of_turn>model  
'}}</div>  
Exporting model to 'quantized_model/FP16.gguf'  
gguf: loading model part 'model-00001-of-00002.safetensors'  
token_embd.weight, n_dims = 2, torch.float16 --> float32  
blk.0.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.0.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.0.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.0.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.0.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.0.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.0.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.0.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.0.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.1.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.1.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.1.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.1.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.1.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.1.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.1.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.1.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.1.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.10.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.10.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.10.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.10.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.10.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.10.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.10.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.10.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.10.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.11.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.11.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.11.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.11.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.11.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.11.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.11.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.11.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.11.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.12.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.12.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.12.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.12.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.12.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.12.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.12.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.12.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.12.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.13.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.13.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.13.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.13.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.13.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.13.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.13.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.13.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.13.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.14.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.14.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.14.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.14.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.14.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.14.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.14.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.14.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.14.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.15.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.15.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.15.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.15.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.15.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.15.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.15.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.15.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.15.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.16.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.16.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.16.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.16.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.16.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.16.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.16.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.16.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.16.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.17.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.17.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.17.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.17.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.17.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.17.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.2.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.2.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.2.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.2.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.2.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.2.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.2.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.2.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.2.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.3.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.3.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.3.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.3.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.3.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.3.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.3.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.3.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.3.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.4.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.4.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.4.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.4.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.4.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.4.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.4.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.4.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.4.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.5.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.5.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.5.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.5.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.5.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.5.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.5.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.5.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.5.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.6.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.6.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.6.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.6.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.6.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.6.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.6.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.6.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.6.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.7.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.7.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.7.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.7.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.7.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.7.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.7.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.7.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.7.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.8.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.8.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.8.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.8.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.8.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.8.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.8.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.8.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.8.attn_v.weight, n_dims = 2, torch.float16 --> float32  
blk.9.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.9.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.9.ffn_gate.weight, n_dims = 2, torch.float16 --> float32  
blk.9.ffn_up.weight, n_dims = 2, torch.float16 --> float32  
blk.9.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.9.attn_k.weight, n_dims = 2, torch.float16 --> float32  
blk.9.attn_output.weight, n_dims = 2, torch.float16 --> float32  
blk.9.attn_q.weight, n_dims = 2, torch.float16 --> float32  
blk.9.attn_v.weight, n_dims = 2, torch.float16 --> float32  
gguf: loading model part 'model-00002-of-00002.safetensors'  
blk.17.attn_norm.weight, n_dims = 1, torch.float16 --> float32  
blk.17.ffn_down.weight, n_dims = 2, torch.float16 --> float32  
blk.17.ffn_norm.weight, n_dims = 1, torch.float16 --> float32  
output_norm.weight, n_dims = 1, torch.float16 --> float32  
Model successfully exported to 'quantized_model/FP16.gguf'

Quantize 4-bit format.

import os
for m in methods:
  qtype = f"{quantized_path}/{m.upper()}.gguf"
  os.system("./llama.cpp/quantize "+quantized_path+"/FP16.gguf "+qtype+" "+m)


! ./llama.cpp/main -m ./quantized_model/Q4_K_M.gguf -n 90 --repeat_penalty 1.0 --color -i -r "User:" -f llama.cpp/prompts/chat-with-bob.txt

Log start  
main: build = 2355 (e04e04f8)  
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu  
main: seed = 1709783565  
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no  
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes  
ggml_init_cublas: found 1 CUDA devices:  
Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes  
llama_model_loader: loaded meta data with 24 key-value pairs and 164 tensors from ./quantized_model/Q4_K_M.gguf (version GGUF V3 (latest))  
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.  
llama_model_loader: - kv 0: general.architecture str = gemma  
llama_model_loader: - kv 1: general.name str = original_model  
llama_model_loader: - kv 2: gemma.context_length u32 = 8192  
llama_model_loader: - kv 3: gemma.embedding_length u32 = 2048  
llama_model_loader: - kv 4: gemma.block_count u32 = 18  
llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 16384  
llama_model_loader: - kv 6: gemma.attention.head_count u32 = 8  
llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 1  
llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001  
llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256  
llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256  
llama_model_loader: - kv 11: general.file_type u32 = 15  
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama  
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ...  
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...  
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...  
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 2  
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 1  
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 3  
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 1  
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true  
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = true  
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...  
llama_model_loader: - kv 23: general.quantization_version u32 = 2  
llama_model_loader: - type f32: 37 tensors  
llama_model_loader: - type q4_K: 108 tensors  
llama_model_loader: - type q6_K: 19 tensors  
llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).  
llm_load_print_meta: format = GGUF V3 (latest)  
llm_load_print_meta: arch = gemma  
llm_load_print_meta: vocab type = SPM  
llm_load_print_meta: n_vocab = 256000  
llm_load_print_meta: n_merges = 0  
llm_load_print_meta: n_ctx_train = 8192  
llm_load_print_meta: n_embd = 2048  
llm_load_print_meta: n_head = 8  
llm_load_print_meta: n_head_kv = 1  
llm_load_print_meta: n_layer = 18  
llm_load_print_meta: n_rot = 256  
llm_load_print_meta: n_embd_head_k = 256  
llm_load_print_meta: n_embd_head_v = 256  
llm_load_print_meta: n_gqa = 8  
llm_load_print_meta: n_embd_k_gqa = 256  
llm_load_print_meta: n_embd_v_gqa = 256  
llm_load_print_meta: f_norm_eps = 0.0e+00  
llm_load_print_meta: f_norm_rms_eps = 1.0e-06  
llm_load_print_meta: f_clamp_kqv = 0.0e+00  
llm_load_print_meta: f_max_alibi_bias = 0.0e+00  
llm_load_print_meta: n_ff = 16384  
llm_load_print_meta: n_expert = 0  
llm_load_print_meta: n_expert_used = 0  
llm_load_print_meta: pooling type = 0  
llm_load_print_meta: rope type = 2  
llm_load_print_meta: rope scaling = linear  
llm_load_print_meta: freq_base_train = 10000.0  
llm_load_print_meta: freq_scale_train = 1  
llm_load_print_meta: n_yarn_orig_ctx = 8192  
llm_load_print_meta: rope_finetuned = unknown  
llm_load_print_meta: model type = 2B  
llm_load_print_meta: model ftype = Q4_K - Medium  
llm_load_print_meta: model params = 2.51 B  
llm_load_print_meta: model size = 1.51 GiB (5.18 BPW)  
llm_load_print_meta: general.name = original_model  
llm_load_print_meta: BOS token = 2 '<bos>'  
llm_load_print_meta: EOS token = 1 '<eos>'  
llm_load_print_meta: UNK token = 3 '<unk>'  
llm_load_print_meta: PAD token = 1 '<eos>'  
llm_load_print_meta: LF token = 227 '<0x0A>'  
llm_load_tensors: ggml ctx size = 0.06 MiB  
llm_load_tensors: offloading 0 repeating layers to GPU  
llm_load_tensors: offloaded 0/19 layers to GPU  
llm_load_tensors: CPU buffer size = 1548.98 MiB  
........................................................  
llama_new_context_with_model: n_ctx = 512  
llama_new_context_with_model: freq_base = 10000.0  
llama_new_context_with_model: freq_scale = 1  
llama_kv_cache_init: CUDA_Host KV buffer size = 9.00 MiB  
llama_new_context_with_model: KV self size = 9.00 MiB, K (f16): 4.50 MiB, V (f16): 4.50 MiB  
llama_new_context_with_model: CUDA_Host input buffer size = 6.01 MiB  
llama_new_context_with_model: CUDA_Host compute buffer size = 504.00 MiB  
llama_new_context_with_model: graph splits (measure): 1  
  
system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |  
main: interactive mode on.  
Reverse prompt: 'User:'  
sampling:  
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000  
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800  
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000  
sampling order:  
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature  
generate: n_ctx = 512, n_batch = 512, n_predict = 90, n_keep = 1  
  
  
== Running in interactive mode. ==  
- Press Ctrl+C to interject at any time.  
- Press Return to return control to LLaMa.  
- To return control without starting a new line, end your input with '/'.  
- If you want to submit another line, end your input with '\'.  
  
Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.  
  
User: Hello, Bob.  
Bob: Hello. How may I help you today?  
User: Please tell me the largest city in Europe.  
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.  
User:How are you ?  
Bob: I am doing well, thank you. And how may I assist you today?  
  
  
  
llama_print_timings: load time = 414.63 ms  
llama_print_timings: sample time = 5.44 ms / 19 runs ( 0.29 ms per token, 3490.72 tokens per second)  
  
llama_print_timings: prompt eval time = 799.78 ms / 100 tokens ( 8.00 ms per token, 125.04 tokens per second)  
llama_print_timings: load time = 414.63 ms  
llama_print_timings: eval time = 842.40 ms / 18 runs ( 46.80 ms per token, 21.37 tokens per second)  
llama_print_timings: sample time = 5.44 ms / 19 runs ( 0.29 ms per token, 3490.72 tokens per second)  
llama_print_timings: total time = 16391.73 ms / 118 tokens

Push Model to HF

from huggingface_hub import notebook_login

notebook_login()

from huggingface_hub import HfApi, HfFolder, create_repo, upload_file

model_path = "./quantized_model/Q4_K_M.gguf" # Your model's local path
repo_name = "gemma-2b-loudai-GGUF"  # Desired HF Hub repository name
repo_url = create_repo(repo_name, private=False)

Q4_K_M.gguf: 100%  
1.63G/1.63G [01:15<00:00, 22.4MB/s]  
CommitInfo(commit_url='https://huggingface.co/nowave/gemma-2b-loudai-GGUF/commit/811ba25102252c4ab1a5739ad5cc9d06a55a9b82', commit_message='Upload Q4_K_M.gguf with huggingface_hub', commit_description='', oid='811ba25102252c4ab1a5739ad5cc9d06a55a9b82', pr_url=None, pr_revision=None, pr_num=None)

api = HfApi()
api.upload_file(
    path_or_fileobj=model_path,
    path_in_repo="Q4_K_M.gguf",
    repo_id="nowave/gemma-2b-loudai-GGUF",
    repo_type="model",
)

Download the quantized model for inference

!wget "https://huggingface.co/nowave/gemma-2b-loudai-GGUF/resolve/main/Q4_K_M.gguf"

Install llama.cpp on GPU

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

GGUF model inference with Llama.cpp.

from llama_cpp import Llama

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama(
  model_path="/content/Q4_K_M.gguf",  # Download the model file first
  n_ctx=32768,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=1,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=-1         # The number of layers to offload to GPU, if you have GPU acceleration available
)

query = """Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>. here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}"""
output = llm(
  prompt=query,
  max_tokens=512,  # Generate up to 512 tokens
)
output

query = """\n\n Please answer with one of the option in the bracket. Write reasoning in between <analysis></analysis>. Write answer in between <answer></answer>. here are the inputs Q:An 8-year-old boy is brought to the pediatrician by his mother with nausea, vomiting, and decreased frequency of urination. He has acute lymphoblastic leukemia for which he received the 1st dose of chemotherapy 5 days ago. His leukocyte count was 60,000/mm3 before starting chemotherapy. The vital signs include: pulse 110/min, temperature 37.0°C (98.6°F), and blood pressure 100/70 mm Hg. The physical examination shows bilateral pedal edema. Which of the following serum studies and urinalysis findings will be helpful in confirming the diagnosis of this condition? ? \n{'A': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, and extremely elevated creatine kinase (MM)', 'B': 'Hyperkalemia, hyperphosphatemia, hypocalcemia, hyperuricemia, urine supernatant pink, and positive for heme', 'C': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and urate crystals in the urine', 'D': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, and urinary monoclonal spike', 'E': 'Hyperuricemia, hyperkalemia, hyperphosphatemia, lactic acidosis, and oxalate crystals'}"""
output = llm(
  prompt=query,
  max_tokens=512,  # Generate up to 512 tokens
)

output

Extracting the answer

print(output["choices"][0]["text"].split("<end_of_turn>\n<end_of_turn>model")[-1])

<analysis>

병력과 신체 검사 소견을 바탕으로 급성 림프모구 백혈병(ALL) 환자를 진단하는 문제입니다. 주요 소견은 다음과 같습니다:
- 8세 남아
- 급성 림프모구 백혈병 진단
- 5일 전 1차 화학 요법 투여
- 류코옥탄 수치 60,000/mm3
- 활력 징후에는 빈맥, 부종 및 고요산혈증이 포함됩니다.

감별 진단에는 다음이 포함됩니다:
- 고요산혈증 및 크레아티닌 키나아제(CK) 상승으로 인한 요 뇨증
- 고요산혈증 및 크레아틴키나아제(CK) 상승으로 인한 소변 내 요산 결정
- 고요산혈증 및 크레아티닌 키나아제(CK) 상승으로 인한 말산뇨

주요 검사는 다음과 같습니다:
- 혈청 검사에는 고칼륨혈증, 고인산혈증, 저칼슘혈증 및 CK 상승이 포함되어야 합니다.
- 소변 검사에는 헴에 대한 헴 검사 양성이 포함되어야 합니다.

이러한 검사를 바탕으로 가장 가능성이 높은 진단은 고요산혈증으로 인한 요산뇨증과 급성 림프모구 백혈병으로 인한 CK 상승입니다. 소변의 요산 결정이 진단을 확인합니다.
</analysis>

<answer>
E: 고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산염 결정
</answer> <end_of_turn>
 추론:
이 질문은 고요산혈증으로 인한 요산뇨증과 급성 림프모구 백혈병으로 인한 CK 상승의 진단을 확인하기 위한 추가 검사를 요청하고 있습니다. 요청된 검사는 고요산혈증과 CK 상승으로 인한 요산뇨를 가장 잘 확인할 수 있는 검사들입니다. 헴 검사 양성과 소변 옥살산염 결정이 진단을 확인합니다.
</analysis> <start_of_turn>
<answer>
E: 고요산혈증, 고칼륨혈증, 고인산혈증, 젖산증 및 옥살산결정이 있습니다.
</answer> <end_of_turn>
 reason:
이 질문은 고요산혈증으로 인한 요산뇨증과 급성 림프모구 백혈병으로 인한 CK 상승의 진단을 확인하기 위한 추가 검사를 요청하고 있습니다. 요청된 검사는 고요산혈증과 CK 상승으로 인한 요산뇨를 가장 잘 확인할 수 있는 검사들입니다. 양성 헴 검사와 소변 옥살산염 결정이 진단을 확인합니다.

Step 1: brew install llama.cpp

Step 2: llama-server --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf --hf-file Phi-3-mini-4k-instruct-q4.gguf

Step 3: curl 8080/v1/chat/completions

PreviousTRL: ORPO Fine-tuning with Llama3-8B NextApple Silicon Fine-tuning Gemma-2B with MLX

Last updated 11 months ago