Fine-tuning

LLM Fine-tuning 개요

LLM Fine-tuning은 데이터 세트에 대해 모델 자체를 업데이트하여 다양한 방식으로 모델을 개선하는 것을 의미합니다. 여기에는 출력 품질 개선, 오차 감소, 더 많은 데이터의 총체적 기억, 지연 시간/비용 감소 등이 포함될 수 있습니다.

Fine-tuning 핵심은 모델 자체를 학습시키지 않고 추론 모드에서 모델을 사용하는 상황에 맞는 learning/retrieval agmentation을 중심으로 이루어집니다.

Fine-tuning은 외부 데이터로 모델을 'agument'하는 데에도 사용할 수 있지만, 미세 조정은 다양한 방식으로 검색 증강을 보완할 수 있습니다:

Embedding Finetuning의 이점

임베딩 모델을 미세 조정하면 데이터의 학습 분포에 대해 보다 의미 있는 임베딩 표현을 할 수 있습니다 --> retrieval의 성능 향상으로 이어집니다.

LLM Finetuning의 이점

주어진 데이터 세트에 대한 스타일 학습 허용
학습 데이터에서 덜 표현될 수 있는 DSL(예: SQL)을 학습할 수 있습니다.
즉각적인 엔지니어링을 통해 수정하기 어려운 환각/오류를 수정할 수 있습니다.
더 나은 모델(예: GPT-4)을 더 간단하고 저렴한 모델(예: gpt-3.5, Llama 2)로 증류할 수 있도록 허용합니다.

LlamaIndex로 Fine-tuning 하는 방법

Retrieval 성능 향상을 위한 임베딩 미세 조정하기
더 나은 Text-SQL 변환을 위한 Llama2 미세 조정하기
gpt-3.5-turbo를 gpt-4 distill로 미세 조정하기

GPT-3.5-turbo를 GPT-4로 distill하는 미세조정과 Retrieval 성능 향상을 위한 Embedding Fine-tuning 2개의 예시를 실습해 보겠습니다.

사례 1. Fine-Tuning gpt-3.5-turbo distill GPT-4

Fine-tuning을 통해 gpt-3.5-turbo는 gpt-4 학습 데이터에 대한 fine-turned를 통해 더 나은 반응을 출력할 수 있습니다.

Fine-tuning 단계:

DatasetGenerator를 사용하여 평가 데이터세트와 학습 데이터세트 모두에 대한 데이터 생성 자동화.
1단계에서 생성된 평가 데이터 세트를 사용하여 fine-tuning 전에 기본 모델 gpt-3.5-turbo를 평가합니다.
벡터 인덱스 쿼리 엔진을 구성하고 gpt-4를 호출하여 학습 데이터 세트를 기반으로 학습 데이터를 수집합니다.
콜백 핸들러인 OpenAIFineTuningHandler는 gpt-4로 전송된 모든 메시지를 응답과 함께 수집하고, 이러한 메시지를 OpenAI API 엔드포인트에서 미세 조정을 위해 사용할 수 있는 .jsonl(JSON 라인) 포맷으로 저장합니다.
gpt-3.5-turbo와 4단계에서 생성된 jsonl 파일을 전달하여 OpenAIFinetuneEngine이 구성되고, OpenAI에 미세 조정 호출을 전송하여 OpenAI에 미세 조정 작업 요청을 시작합니다.
OpenAI는 요청에 따라 미세 조정된 gpt-3.5-turbo 모델을 생성합니다.
1단계에서 생성된 평가 데이터 세트를 사용하여 fine-tuned 모델을 평가합니다.

Setup Environments

%pip install -q llama-index-llms-openai
%pip install -q llama-index-finetuning
%pip install -q llama-index-finetuning-callbacks

%pip install -q llama_index pypdf sentence-transformers ragas

import os
import openai
import logging, sys, os
import nest_asyncio
from dotenv import load_dotenv  

nest_asyncio.apply()

!echo "OPENAI_API_KEY=<Your OpenAI Key>" >> .env
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

예제 데이터셋을 다운로드 하겠습니다.

!mkdir fine-tune
!wget https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/4e9abe7b-fdc7-4cd2-8487-dc3a99f30e98.pdf -O ./fine-tune/nvidia-sec-10k-2022.pdf

Generate datasets

먼저 학습용 데이터 세트와 평가용 데이터 세트 두 개를 생성해 보겠습니다.

from llama_index.core import SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import DatasetGenerator

documents = SimpleDirectoryReader(
    input_files=["./fine-tune/nvidia-sec-10k-2022.pdf"]
).load_data()

# Shuffle the documents
import random

random.seed(42)
random.shuffle(documents)

gpt_35_llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)

Training dataset

question_gen_query = (
        "귀하는 교사/교수입니다. 귀하의 임무는"
        "퀴즈/시험을 설정하는 것입니다. NVIDIA SEC 10-K 서류에서 제공된 컨텍스트를 사용하여"
        "문맥에서 중요한 사실을 파악할 수 있는 하나의 질문을 구성하세요."
        "컨텍스트. 제공된 컨텍스트 정보로 질문을 제한합니다."
)

dataset_generator = DatasetGenerator.from_documents(
    documents[:50],
    question_gen_query=question_gen_query,
    llm=gpt_35_llm,
)

questions = dataset_generator.generate_questions_from_nodes(num=40)
print("Generated ", len(questions), " questions")

Generated  40  questions

with open("fine-tune/train_questions.txt", "w") as f:
    for question in questions:
        f.write(question + "\n")

생성한 질문을 저장한 train_questions.txt 파일을 확인해 보겠습니다.

!cat fine-tune/train_questions.txt

What factors could potentially impact NVIDIA's ability to accurately forecast demand for its products, and how might these factors affect the company's financial results?
어떤 조건에 따라 Restricted Stock Unit Award가 부여되는지 설명하시오.
어떤 조건을 충족해야만 Ten Percent Stockholder가 Incentive Stock Option을 받을 수 있는가요?
- What was the total comprehensive income for NVIDIA Corporation for the year ended January 29, 2023?
- What were the main components contributing to the changes in deferred revenue for NVIDIA during fiscal years 2023 and 2022?
어떤 조건들이 주어진 상황에서 Restricted Stock Units의 소유주가 주식을 판매하거나 이전해야 하는지 설명하십시오?
어떤 목적으로 데이터가 수집되고 처리되는지에 대한 정확한 이해가 필요한가요?
What factors, in addition to the Company's financial development, may impact the future value of NVIDIA's Common Stock according to the information provided in the SEC 10-K document?
어떻게 NVIDIA는 외환 환산을 다루고 있으며, 이로 인해 발생하는 손익이 현재까지 중요하지 않았다고 평가하고 있는가?
어떻게 중국의 데이터 처리 및 데이터 로컬라이제이션에 관한 법률이 NVIDIA의 비즈니스 활동에 어떤 영향을 미칠 수 있는가?
어떤 국가의 법률이 중요한 데이터의 이전을 규제하고 있으며, 이로 인해 NVIDIA의 비즁니스 운영에 부정적인 영향을 줄 수 있는 가능성이 있습니까?
- How much total cost has NVIDIA incurred for share repurchases since the inception of their program up to January 29, 2023?
어떤 국가로 이주할 경우, 회사가 법적 또는 행정적 이유로 해당 국가의 조건이 적용될 수 있다는 점을 이해하고 있습니까?
어떤 조치가 주주들에 의해 취해져야만 하며 서면 동의로 이루어질 수 없는 것인가요?
- What types of Awards are provided for in the NVIDIA Corporation Amended and Restated 2007 Equity Incentive Plan?
어떤 채널을 통해 NVIDIA가 주요 금융 정보를 투자자에게 공지하고 정보를 공개하는데 사용하는 것으로 언급되었습니까?
- How has NVIDIA supported its workforce during the COVID-19 pandemic in terms of health and safety measures?
어떤 요인들이 NVIDIA의 재무 상황에 영향을 미치고 있는 것으로 보이나요?
어떤 국가의 거주자는 Türkiye에서 일반 주식을 판매할 수 없으며, 이러한 규정을 준수해야 하는가?
What significant audit matter related to the valuation of inventories did PricewaterhouseCoopers LLP address in their report on NVIDIA's consolidated financial statements?
What was the total fair value of RSUs and PSUs during the fiscal year ended January 29, 2023, according to the NVIDIA SEC 10-K document?
어떤 조건하에 개인정보 수집 및 이용이 이루어지며, 이에 동의하지 않을 경우 어떤 결과가 발생하는가?
어떤 방법으로 NVIDIA가 Mellanox를 인수한 후에 발생한 비용을 처리하고 있는가?
What impact did inventory provisions have on NVIDIA's gross margin in fiscal year 2023 compared to fiscal year 2022?
What was the total amount returned to shareholders by NVIDIA in fiscal year 2023 through share repurchases and cash dividends?
What are the key components included in NVIDIA's Compute & Networking segment and Graphics segment as outlined in the segment information note?
어떤 날짜에 비상장주식을 보유한 비제휴자들에 의해 보유된 투표 주식의 시장 가치가 약 434.37억 달러로 추정되었습니까?
What was the aggregate market value of the voting stock held by non-affiliates of the registrant as of July 29, 2022, based on the closing sales price of the registrant's common stock reported by the Nasdaq Global Select Market?
어떤 요인들이 NVIDIA의 제품 수요에 영향을 미치고, 이로 인해 제품의 공급과 수요 사이에 불일치가 발생할 수 있는가?
- What factors contributed to the increase in research and development expenses for fiscal year 2023 according to the NVIDIA SEC 10-K document?
어떤 조치를 한국 거주자는 외화 계좌의 월간 잔고가 일정 금액을 초과할 때 취해야 하는가?
What factors contributed to the decrease in NVIDIA's effective tax rate in fiscal year 2023 compared to fiscal year 2022?
어떤 요인들이 NVIDIA의 수익률 및 재무 결과에 부정적인 영향을 미칠 수 있는가?
어떤 기준에 따라 Performance Goals가 설정되며, Performance Criteria는 어떻게 결정되는가?
어떤 조건이 "Performance Goals"을 결정하는 데 사용되는가?
어떤 조건하에 홍콩 거주자에게 부여된 Restricted Stock Units은 현금으로 지급되지 않으며, 주식으로만 지급되어야 하는가?
- What was the total aggregate principal amount of senior notes issued by NVIDIA in June 2021, March 2020, and September 2016, respectively?
어떤 항목에 따라서 관련 거래 및 이사 독립성에 대한 정보가 제공되며, 이 정보는 어디에서 찾을 수 있습니까?
어떤 조건 하에 참가자의 Option 또는 SAR가 즉시 종료되는가?
어떤 조건에서 참가자의 Continuous Service가 종료되면 Option 또는 SAR는 즉시 종료되며 참가자는 그 이후에 Option 또는 SAR를 행사할 수 없게 됩니까?

Eval Generation

이제 평가 데이터 세트를 만들기 위해 완전히 다른 문서 세트에 대한 질문을 생성해 보겠습니다.

ataset_generator = DatasetGenerator.from_documents(
    documents[
        50:
    ],  # since we generated ~1 question for 40 documents, we can skip the first 40
    question_gen_query=question_gen_query,
    llm=gpt_35_llm,
)

questions = dataset_generator.generate_questions_from_nodes(num=40)
print("Generated ", len(questions), " questions")

Generated  40  questions

with open("./fine-tune/eval_questions.txt", "w") as f:
    for question in questions:
        f.write(question + "\n")

!cat fine-tune/eval_questions.txt

어떻게 NVIDIA의 수요 예측이 잘못되었을 때 회사의 재무 결과에 부정적인 영향을 미칠 수 있는지 설명하십시오.
어떤 상황에서 Restricted Stock Unit Award가 부여되며, 해당 상황에서 어떤 조건이 적용되는가?
- What are the criteria for a Ten Percent Stockholder to be granted an Incentive Stock Option according to the provisions outlined in the NVIDIA SEC 10-K document?
What was the total comprehensive income for NVIDIA Corporation and its subsidiaries for the year ended January 29, 2023?
- What were the main components contributing to the changes in deferred revenue for NVIDIA during fiscal years 2023 and 2022?
어떤 조건 하에 Restricted Stock Units의 주식을 판매할 수 있는지에 대한 규정은 무엇인가요?
어떤 목적으로 회사, Schwab 및 기타 수신자들이 데이터를 전달, 보유, 사용 및 이전할 수 있는지에 대해 어떤 권한이 주어지나요?
What factors, other than the Company's financial development, may impact the future value of NVIDIA's Common Stock according to the information provided in the SEC 10-K document?
어떤 조건에서 NVIDIA는 잠재적인 손실에 대한 손실을 인식하고 어떻게 처리합니까?
어떻게 중국 법률이 데이터 처리 및 데이터 지역화에 관한 요구 사항을 규정하고 있으며, 이러한 요구 사항을 준수하는 데 어떤 영향을 미칠 수 있는가?
어떤 국가의 법률이 중요한 데이터의 이전을 규제하고 있으며, 이로 인해 NVIDIA의 비즈니스 운영에 부정적인 영향을 줄 수 있는 가능성이 있습니까?
What was the total cost of shares repurchased by NVIDIA during fiscal year 2023 and how many shares were repurchased in total?
어떤 국가로 이주할 경우, NVIDIA의 보상 조건은 어떻게 변화하게 되는가?
어떤 조치가 주주들에 의해 취해져야만 하며 서면 동의로 이루어질 수 없는 것인가요?
어떤 종류의 상장된 시장에서 주식 보상이 제공될 수 있습니까?
어떤 방법으로 NVIDIA는 주요 금융 정보를 투자자들에게 공지하고 있나요?
어떤 조치를 취하여 직원들이 재택근무를 할 때 지원받을 수 있었나요?
What impact did the termination of the Arm Share Purchase Agreement have on NVIDIA's financials in fiscal year 2023?
어떤 국가의 거주자는 Türkiye에서 일반 주식을 판매할 수 없으며, 이러한 규정을 준수해야 하는가요?
What significant judgments and considerations were involved in the valuation of inventories, specifically the provisions for excess or obsolete inventories and excess product purchase commitments, as outlined in the PricewaterhouseCoopers LLP report on NVIDIA's SEC 10-K document?
What is the total fair value of RSUs and PSUs as of their respective vesting dates for the years ended January 29, 2023, January 30, 2022, and January 31, 2021?
어떤 조건에서만 개인정보 처리가 이루어지고 있는가?
어떤 방법으로 NVIDIA가 Mellanox를 인수한 후에 발생한 비용을 처리하고 있는가?
What impact did inventory provisions have on NVIDIA's gross margin in fiscal year 2023 compared to fiscal year 2022?
What was the total amount returned to shareholders by NVIDIA in fiscal year 2023 through share repurchases and cash dividends?
- How does NVIDIA allocate costs or expenses between its Compute & Networking segment and Graphics segment?
What was the aggregate market value of the voting stock held by non-affiliates of NVIDIA Corporation as of July 29, 2022, based on the closing sales price of the common stock reported by the Nasdaq Global Select Market?
- What was the aggregate market value of the voting stock held by non-affiliates of the registrant as of July 29, 2022, based on the closing sales price of the registrant's common stock?
어떤 요소들이 NVIDIA의 제품 수요에 영향을 미치고, 이로 인해 제품의 공급과 수요 간 불일치가 발생할 수 있는가?
- What was the reason behind the significant increase in research and development expenses for fiscal year 2023 according to the information provided in the NVIDIA SEC 10-K document?
어떤 조치를 취해야 일본 거주자가 매년 3월 15일까지 외국 자산을 보고해야 하는지에 대해 설명해 주세요.
What factors contributed to the decrease in NVIDIA's effective tax rate in fiscal year 2023 compared to fiscal year 2022?
어떤 요인들이 NVIDIA의 수익률과 재무 결과에 부정적인 영향을 미칠 수 있는가?
어떤 기준에 따라 Performance Goals가 설정되며, Performance Criteria는 어떤 요소들을 기반으로 선택될 수 있는가?
What criteria does the Committee use to define Performance Goals for a Performance Period in the NVIDIA SEC 10-K document?
어떤 국가의 주요 규정이 주식 보상 계획에 영향을 미치는가요?
- What is the total aggregate principal amount of senior notes issued by NVIDIA in June 2021, March 2020, and September 2016, and what were the net proceeds from these offerings after deducting debt discount and issuance costs?
어떤 항목에서 관련 거래 및 이사 독립성에 대한 정보가 제공되며, 이 정보는 어디에서 찾을 수 있습니까?
어떤 조건 하에 참가자의 Option 또는 SAR이 종료되는지에 대한 규정은 무엇입니까?
어떤 조건에서 참가자의 Continuous Service가 끊겼을 때 Option 또는 SAR가 즉시 종료되며 참가자가 해당 Option 또는 SAR를 행사할 수 없게 되는지에 대한 규정은 무엇입니까?

Baseline eval for gpt-3.5-turbo

LLM Evaluation 라이브러리인 RAGAS와 Evaluate Module을 모두 사용하여 기본 모델을 평가해 보겠습니다.

Eval with `ragas`

다음 두 가지 지표를 사용할 것입니다:

answer_relevancy(답변 관련성) - 생성된 답변이 프롬프트와 얼마나 관련성이 있는지 측정합니다. 생성된 답변이 불완전하거나 중복 정보가 포함되어 있으면 점수가 낮아집니다. 이는 생성된 답변을 사용하여 LLM이 주어진 문제를 생성할 확률을 계산하여 정량화됩니다. 값 범위는 (0,1)이며, 높을수록 좋습니다.

faithfulness(충실도) - 주어진 문맥에 대해 생성된 답변의 사실적 일관성을 측정합니다. 이는 생성된 답변에서 문장을 생성한 다음 각 문장을 문맥과 대조하여 검증하는 다단계 패러다임을 사용하여 수행됩니다. 답변은 (0,1) 범위로 스케일링됩니다. 높을수록 좋습니다.

questions = []
with open("./fine-tune/eval_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

from llama_index.core import VectorStoreIndex

# limit the context window to 2048 tokens so that refine is used
from llama_index.core import Settings

Settings.context_window = 2048

index = VectorStoreIndex.from_documents(
    documents,
)

query_engine = index.as_query_engine(similarity_top_k=2, llm=gpt_35_llm)

contexts = []
answers = []

for question in questions:
    response = query_engine.query(question)
    contexts.append([x.node.get_content() for x in response.source_nodes])
    answers.append(str(response))

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness

ds = Dataset.from_dict(
    {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
    }
)

result = evaluate(ds, [answer_relevancy, faithfulness])
print(result)

Evaluating: 100%|██████████| 80/80 [00:40<00:00,  1.97it/s]


{'answer_relevancy': 0.8516, 'faithfulness': 0.8542}

pandas 데이터프레임으로 출력하여 각 질문에 대하여 확인해보겠습니다.

import pandas as pd

pd.set_option('display.max_colwidth', 200)
result.to_pandas().head()

GPT4 to collect training data

여기서는 GPT-4와 OpenAIFineTuningHandler를 사용하여 학습할 데이터를 수집합니다.

from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager

finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])

llm = OpenAI(model="gpt-4-turbo", temperature=0.3)
llm.callback_manager = callback_manager

questions = []
with open("./fine-tune/train_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents,
)

query_engine = index.as_query_engine(similarity_top_k=2, llm=llm)

for question in questions:
    response = query_engine.query(question)

Create OpenAIFinetuneEngine

FinetuneEngine은 fine-tune 작업을 시작하고 나머지 LlamaIndex 워크플로에 직접 플러그인할 수 있는 LLM 모델을 반환하는 작업을 처리합니다.기본 생성자를 사용하지만 from_finetuning_handler 클래스 메서드를 사용하여 이 엔진에 finetuning_handler를 직접 전달할 수도 있습니다.

finetuning_handler.save_finetuning_events("./fine-tune/finetuning_events.jsonl")

Wrote 0 examples to ./fine-tune/finetuning_events.jsonl

from llama_index.finetuning import OpenAIFinetuneEngine

finetune_engine = OpenAIFinetuneEngine(
    "gpt-4-turbo",
    "./fine-tune/finetuning_events.jsonl",
    # start_job_id="<start-job-id>"  # if you have an existing job, can specify id here
)

finetune_engine.finetune()

finetune_engine.get_current_job()

OpenAI에서 미세 조정 작업이 성공적으로 완료될 때까지 여기서 몇 분 정도 기다리면 OpenAI에서 미세 조정된 모델을 사용할 준비가 되었음을 알리는 이메일이 발송됩니다.

ft_llm = finetune_engine.get_finetuned_model(temperature=0.3)

Evaluation for fine-tuned model

Eval with ragas

from llama_index.llm.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager

questions = []
with open("eval_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(similarity_top_k=2, llm=ft_llm)

contexts = []
answers = []

for question in questions:
    response = query_engine.query(question)
    contexts.append([x.node.get_content() for x in response.source_nodes])
    answers.append(str(response))

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness

ds = Dataset.from_dict(
    {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
    }
)

result = evaluate(ds, [answer_relevancy, faithfulness])
print(result)

evaluating with [answer_relevancy] {'ragas_score': 0.8680, 'answer_relevancy': 0.9607, 'faithfulness': 0.7917}

import pandas as pd

pd.set_option('display.max_colwidth', 200)
result.to_pandas()

Eval with evaluation module

# eval for hallucination
evaluator = ResponseEvaluator(llm=llm_gpt4)
total_correct, all_results = evaluate_query_engine(evaluator, query_engine, questions)
print(f"Hallucination? Scored {total_correct} out of {len(questions)} questions correctly.")

# eval for answer quality
evaluator = QueryResponseEvaluator(llm=llm_gpt4)
total_correct, all_results = evaluate_query_engine(evaluator, query_engine, questions)
print(f"Response satisfies the query? Scored {total_correct} out of {len(questions)} questions correctly.")

Exploring difference

index = VectorStoreIndex.from_documents(documents)

questions = []
with open("eval_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

print(questions[0])

Baseline model

from llama_index.core.response.notebook_utils import display_response

query_engine = index.as_query_engine(llm=llm_gpt35)

response = query_engine.query(questions[0])

display_response(response)

Fine-tuned model

query_engine = index.as_query_engine(llm=ft_llm)

response = query_engine.query(questions[0])

display_response(response)

사례 2. Fine-tune Embedding Model

임베딩 모델을 미세 조정함으로써 가장 관련성이 높은 문서를 retrieval하는 시스템의 기능을 향상시켜 RAG 파이프라인이 최상의 성능을 발휘할 수 있도록 합니다.

세 가지 주요 섹션으로 구성됩니다:

데이터 준비하기(generate_qa_embedding_pairs 함수를 사용하면 이 작업을 쉽게 수행할 수 있습니다.)
모델 미세 조정(SentenceTransformersFinetuneEngine 사용)
Validation knowledge corpus에서 모델 평가하기

자세한 단계:

EmbeddingQAFinetuneDataset의 EmbeddingQA_embedding_pairs 함수를 호출하여 평가 및 훈련 데이터 세트에 대한 데이터 생성을 자동화합니다.
기본 모델과 훈련 데이터 세트를 전달하여 SentenceTransformersFinetuneEngine을 구축한 다음, 그 미세 조정 함수를 호출하여 기본 모델을 훈련합니다.
미세 조정된 모델을 생성합니다.
벡터 저장소 인덱스 검색기를 호출하여 관련 노드를 검색하고 기본 모델의 적중률을 평가합니다.
InformationRetrievalEvaluator를 호출하여 기본 모델을 평가합니다.
벡터 저장소 인덱스 검색기를 호출하여 관련 노드를 검색하고 미세 조정된 모델의 적중률을 평가합니다.
InformationRetrievalEvaluator를 호출하여 미세 조정된 모델을 평가합니다.

Installation and Configuration

#%pip install llama-index-llms-openai
#%pip install llama-index-embeddings-openai
#%pip install llama-index-finetuning

#%pip install -q llama_index pypdf sentence-transformers

import json
import openai
import os

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Response
)

import logging, sys, os
import nest_asyncio
from dotenv import load_dotenv  

nest_asyncio.apply()

!echo "OPENAI_API_KEY=<Your OpenAI Key>" >> .env
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

!mkdir fine-tune
!wget https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/4e9abe7b-fdc7-4cd2-8487-dc3a99f30e98.pdf -O ./fine-tune/nvidia-sec-10k-2022.pdf

Generate dataset

Load corpus

먼저, LlamaIndex를 활용하여 일부 재무 PDF를 로드하고 일반 텍스트 청크로 구문 분석/청크화하여 텍스트 청크의 말뭉치를 생성합니다.

def load_corpus(docs, for_training=False, verbose=False):
    parser = SentenceSplitter()
    if for_training:
        nodes = parser.get_nodes_from_documents(docs[:90], show_progress=verbose)
    else:
        nodes = parser.get_nodes_from_documents(docs[91:], show_progress=verbose)

    if verbose:
        print(f'Parsed {len(nodes)} nodes')

    return nodes

SEC_FILE = ['./fine-tune/nvidia-sec-10k-2022.pdf']

print(f"Loading files {SEC_FILE}")

reader = SimpleDirectoryReader(input_files=SEC_FILE)
docs = reader.load_data()
print(f'Loaded {len(docs)} docs')

train_nodes = load_corpus(docs, for_training=True, verbose=True)
val_nodes = load_corpus(docs, for_training=False, verbose=True)

Loading files ['./fine-tune/nvidia-sec-10k-2022.pdf']
Loaded 169 docs


Parsing nodes: 100%|██████████| 90/90 [00:00<00:00, 838.29it/s]


Parsed 97 nodes


Parsing nodes: 100%|██████████| 78/78 [00:00<00:00, 962.97it/s]

Parsed 85 nodes

Generate synthetic queries

from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
from llama_index.llms.openai import OpenAI

이제 LLM(gpt-3.5-turbo)을 사용하여 말뭉치의 각 텍스트 청크를 문맥으로 사용하여 질문을 생성합니다.각 쌍(생성된 질문, 문맥으로 사용된 텍스트 청크)은 (훈련 또는 평가용) 미세 조정 데이터 세트의 데이터 포인트가 됩니다.

llm=OpenAI(model="gpt-3.5-turbo-0613")

train_dataset = generate_qa_embedding_pairs(train_nodes, llm=llm)
val_dataset = generate_qa_embedding_pairs(val_nodes, llm=llm)

train_dataset.save_json("./fine-tune/train_dataset.json")
val_dataset.save_json("./fine-tune/val_dataset.json")

100%|██████████| 97/97 [02:16<00:00,  1.41s/it]
100%|██████████| 85/85 [01:56<00:00,  1.37s/it]

train_dataset = EmbeddingQAFinetuneDataset.from_json("./fine-tune/train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("./fine-tune/val_dataset.json")

Fine-tune embedding model

from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-small-en",
    model_output_path="test_model",
    val_dataset=val_dataset,
)

modules.json: 100%|██████████| 349/349 [00:00<00:00, 740kB/s]
config_sentence_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 241kB/s]
README.md: 100%|██████████| 90.8k/90.8k [00:00<00:00, 31.0MB/s]
sentence_bert_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 121kB/s]
config.json: 100%|██████████| 684/684 [00:00<00:00, 1.94MB/s]
model.safetensors: 100%|██████████| 133M/133M [00:01<00:00, 109MB/s] 
tokenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 1.03MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 59.4MB/s]
tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 1.91MB/s]
special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 277kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 570kB/s]

finetune_engine.finetune()

Epoch: 100%|██████████| 2/2 [00:08<00:00,  4.44s/it]

embed_model = finetune_engine.get_finetuned_model()

embed_model

HuggingFaceEmbedding(model_name='test_model', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7f12dfde5ed0>, max_length=512, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

Evaluate fine-tuned model

이 섹션에서는 3가지 임베딩 모델을 평가합니다:

독점적인 OpenAI 임베딩,
오픈 소스 BAAI/bge-small-en, 그리고
미세 조정된 임베딩 모델.

두 가지 평가 접근 방식을 고려합니다:

간단한 사용자 지정 적중률 메트릭
간단한 사용자 정의 적중률 메트릭과 문장_변환기의 정보 검색 평가기 사용

합성(LLM 생성) 데이터 세트에 대한 미세 조정이 오픈소스 임베딩 모델을 크게 개선한다는 것을 보여줍니다.

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

def evaluate(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
    index = VectorStoreIndex(nodes, embed_model=embed_model, show_progress=True)
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            "is_hit": is_hit,
            "retrieved": retrieved_ids,
            "expected": expected_id,
            "query": query_id,
        }
        eval_results.append(eval_result)
    return eval_results

from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path

def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)

    model = SentenceTransformer(model_id)
    output_path = "results/"
    Path(output_path).mkdir(exist_ok=True, parents=True)
    return evaluator(model, output_path=output_path)

Run Evals

OpenAI

ada = OpenAIEmbedding()
ada_val_results = evaluate(val_dataset, ada)

Generating embeddings:   0%|          | 0/84 [00:00<?, ?it/s]

  0%|          | 0/168 [00:00<?, ?it/s]

df_ada = pd.DataFrame(ada_val_results)

hit_rate_ada = df_ada['is_hit'].mean()
hit_rate_ada

BAAI/bge-small-en

bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate(val_dataset, bge)

df_bge = pd.DataFrame(bge_val_results)

hit_rate_bge = df_bge['is_hit'].mean()
hit_rate_bge

evaluate_st(val_dataset, "BAAI/bge-small-en", name='bge')

0.6355149436368012

이 단계에서 오류가 발생하면 프로젝트 루트에 "results" 폴더가 생성되었는지 확인하고, 생성되지 않은 경우 폴더를 생성한 후 이 단계를 다시 실행하세요.

Fine-tuned model

finetuned = "local:test_model"
val_results_finetuned = evaluate(val_dataset, finetuned)

Generating embeddings:   0%|          | 0/84 [00:00<?, ?it/s]

  0%|          | 0/168 [00:00<?, ?it/s]

df_finetuned = pd.DataFrame(val_results_finetuned)

hit_rate_finetuned = df_finetuned['is_hit'].mean()
hit_rate_finetuned

0.8511904761904762

evaluate_st(val_dataset, "test_model", name='finetuned')

0.6943243799411525

Summary of results

Hit rate

df_ada['model'] = 'ada'
df_bge['model'] = 'bge'
df_finetuned['model'] = 'fine_tuned'

작은 오픈소스 임베딩 모델을 미세 조정하면 검색 품질이 크게 향상되는 것을 볼 수 있습니다(심지어 독점적인 OpenAI 임베딩의 품질에 근접할 정도)!

df_all = pd.concat([df_ada, df_bge, df_finetuned])
df_all.groupby('model').mean('is_hit')

InformationRetrievalEvaluator

df_st_bge = pd.read_csv('results/Information-Retrieval_evaluation_bge_results.csv')
df_st_finetuned = pd.read_csv('results/Information-Retrieval_evaluation_finetuned_results.csv')

df_st_bge['model'] = 'bge'
df_st_finetuned['model'] = 'fine_tuned'
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index('model')
df_st_all

미세 조정을 포함하면 평가 지표 모음에서 일관되게 지표가 개선되는 것을 확인할 수 있습니다.

PreviousEvaluation-Driven Development NextPrompt Compression with LLMLingua

Last updated 1 year ago