RAG Evaluation with RAGAS

PreviousLLM Evaluation NextLangChain with RAGAS

Last updated 1 year ago

RAG Evaluation with RAGAS

기본적인 LLM 애플리케이션을 만드는 것은 간단할 수 있지만, 지속적인 유지 관리와 지속적인 개선이 과제이다. Ragas의 비전은 메트릭 기반 개발(MDD)의 이념을 수용하여 LLM 및 RAG 애플리케이션의 지속적인 개선을 촉진하는 것이다.

MDD는 정보에 입각한 의사 결정을 내리기 위해 데이터에 의존하는 제품 개발 접근 방식이다.. 이 접근 방식은 시간이 지남에 따라 필수 메트릭을 지속적으로 모니터링하여 애플리케이션의 성능에 대한 귀중한 인사이트를 제공한다. 우리의 임무는 LLM 및 RAG 애플리케이션에 MDD를 적용하기 위한 오픈 소스 표준을 확립하는 것

Evaluation(평가): 이를 통해 LLM 애플리케이션을 평가하고 메트릭 지원 방식으로 실험을 수행하여 높은 신뢰성과 재현성을 보장
Monitoring(모니터링): 프로덕션 데이터 포인트에서 가치 있고 실행 가능한 인사이트를 얻을 수 있어 LLM 애플리케이션의 품질을 지속적으로 개선

Metrics

LLM 모델의 자체 생성(generation)에 대한 평가 지표는 faithfulness, answer relevancy로 나눌 수 있으며, RAG 검색(retrieval)의 평가 지표는 context precision과 context recall로 구분한다.

Metric Component: Wise Evaluation

다른 머신 러닝 시스템과 마찬가지로 LLM 및 RAG 파이프라인 내의 개별 구성 요소의 성능은 전반적인 경험에 큰 영향을 미친다.

End-to-End Evaluation

파이프라인의 엔드투엔드 성능을 평가하는 것도 사용자 경험에 직접적인 영향을 미치기 때문에 매우 중요하다.

Faithfulness

Fathfulness는 주어진 문맥에 대해 생성된 답변의 사실적 일관성을 측정한다. 답변과 검색된 문맥으로부터 계산되며, 답변은 (0, 1) 범위로 1에 가까울수록 성능이 좋다.

답변에 포함된 모든 주장을 주어진 문맥에서 유추할 수 있는 경우 생성된 답변은 충실한 것으로 간주
이를 계산하기 위해 먼저 생성된 답안에서 일련의 주장을 식별
그런 다음 이러한 각 주장을 주어진 문맥과 교차 검사하여 주어진 문맥에서 유추할 수 있는지 여부를 결정

$Faithfulness = \frac{\textrm{|Number of claims in the generated answer that can be inferred from given context|}}{\textrm{|Total Number of claims in the generated answer|}}$

$Faithfulness = \frac{\textrm{|생성된 답변에서 주어진 문맥에서 유추할 수 있는 클레임의 수|}}{\textrm{|생성된 답변의 총 클레임 수|}}$

Question: Where and when was Einstein born?
Context: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time

> High faithfulness answer**: Einstein was born in Germany on 14th March 1879.
> Low faithfulness answer**: Einstein was born in Germany on 20th March 1879.

Example

from ragas.metrics.faithfulness import Faithfulness

faithfulness = Faithfulness(
    batch_size = 10
)
# Dataset({
#     features: ['question','contexts','answer'],
#     num_rows: 25
# })
dataset: Dataset

results = faithfulness.score(dataset)

Answer Relevance

Answer relevance는 생성된 답변이 주어진 프롬프트와 얼마나 관련성이 있는지를 평가하는 데 중점을 둔다. 불완전하거나 중복된 정보를 포함하는 답변에는 낮은 점수가 부여되며, 질문과 답변을 사용하여 0에서 1 사이의 값으로 계산하여 1에 가까울 수록 성능이 높음을 나타낸다.

답변은 원래의 질문을 직접적이고 적절하게 다룬 경우 관련성이 있는 것으로 간주
기본 아이디어는 생성된 답변이 최초 질문을 정확하게 다루고 있다면 LLM이 답변에서 원래 질문과 일치하는 질문을 생성할 수 있어야 함
중요한 점은 답변의 관련성을 평가할 때 사실성을 고려하지 않고 답변의 완전성이 부족하거나 중복된 세부 사항을 포함하는 경우 감점 처리
이 점수를 계산하기 위해 LLM은 생성된 답변에 적합한 질문을 여러 번 생성하라는 메시지를 표시하고, 이렇게 생성된 질문과 원래 질문 간의 평균 코사인 유사도를 측정

Question: Where is France and what is it’s capital?

> Low relevance answer: France is in western Europe.
> High relevance answer: France is in western Europe and Paris is its capital.

Example

from ragas.metrics import AnswerRelevancy
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings('BAAI/bge-base-en')
answer_relevancy = AnswerRelevancy(
    embeddings=embeddings
)

# init_model to load models used
answer_relevancy.init_model()

# Dataset({
#     features: ['question','answer'],
#     num_rows: 25
# })
dataset: Dataset

results = answer_relevancy.score(dataset)

Context Precision

Context precision은 컨텍스트에 존재하는 모든 사실에 근거한 관련 항목의 순위가 높은지 여부를 평가하는 지표이다. 이상적으로는 모든 관련 청크가 상위 순위에 표시되어야 하며, 질문과 문맥을 사용하여 0에서 1 사이의 값으로 계산되며, 점수가 1에 가까울수록 정확도가 높음을 나타낸다.

$\textrm{Context Precison@k} = \frac{\sum precision@k}{\textrm{total Number of relevant items in the top K results}}$

$\textrm{Precison@k} = \frac{\textrm{true positive@k}}{\textrm{(true positive@k + false positive@k)}}$

$k$ = the total number of chunks in contexts (컨텍트의 총 청크 수)

Example

from ragas.metrics import ContextPrecision
context_precision = ContextPrecision()


# Dataset({
#     features: ['question','contexts'],
#     num_rows: 25
# })
dataset: Dataset

results = context_precision.score(dataset)

Context Relevancy

Context relevancy는 검색된 문맥의 관련성을 측정하며, 질문과 문맥을 모두 기준으로 계산한다. 값은 (0, 1) 범위로 값이 클수록 관련성이 높음을 나타낸다.

이상적으로는 검색된 문맥에 제공된 쿼리를 해결하는 데 필요한 필수 정보만 포함하여야 함
이를 계산하기 위해 먼저 검색된 문맥 내에서 주어진 질문에 대한 답변과 관련된 문장을 식별하여 값을 추정
최종 점수는 다음 공식에 의해 결정:

$context relevancy = \frac{\textrm{|S|}}{\textrm{|Total Number of sentences in retrived context|}}$


Question: What is the capital of France?

> High context relevancy: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.

> Low context relevancy: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.

Example

from ragas.metrics import ContextRelevance
context_relevancy = ContextRelevance()


# Dataset({
#     features: ['question','contexts'],
#     num_rows: 25
# })
dataset: Dataset

results = context_relevancy.score(dataset)

Context Recall

Context recall은 검색된 컨텍스트가 주석이 달린 답변과 일치하는 정도를 측정하며, 이를 ground truth로 취급한다. 이 값은 ground truth과 검색된 문맥 retrieved context을 기반으로 계산되며, 0과 1 사이의 범위에서 값이 높을수록 더 나은 성능을 나타낸다.

Ground truth 답변에서 문맥 재현율을 추정하기 위해 ground truth 답변의 각 문장을 분석하여 검색된 문맥에 기인할 수 있는지 여부를 결정
이상적인 시나리오에서는 ground truth 답변의 모든 문장이 검색된 문맥에 귀속될 수 있어야 함
Context recall 계산 공식은 다음과 같음:

$context recall = \frac{\textrm{|Ground truth sentences that can be attributed to context|}}{\textrm{|Number of stenteces in ground truth|}}$


Question: Where is France and what is it’s capital?

> Ground truth: France is in Western Europe and its capital is Paris.

> High context recall: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.

> Low context recall: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.

Example

from ragas.metrics import ContextRecall
context_recall = ContextRecall(
    batch_size=10

)
# Dataset({
#     features: ['contexts','ground_truths'],
#     num_rows: 25
# })
dataset: Dataset

results = context_recall.score(dataset)

Answer Semantic Similarity

Answer Semantic Similarity는 생성된 답변과 ground truth 간의 의미적 유사성을 평가하는 것과 관련이 있다. Ground truth와 답변을 기반으로 하며, 0에서 1 범위의 값으로 평가된다. 점수가 높을수록 생성된 답변과 기준 진실이 더 잘 일치한다는 것을 의미한다.

답변 간의 의미적 유사성을 측정하면 생성된 답변의 품질에 대한 귀중한 인사이트를 얻음
Cross-encoder(교차 인코더) 모델을 활용하여 의미적 유사성 점수를 계산


> Ground truth: Albert Einstein’s theory of relativity revolutionized our understanding of the universe.”

> High similarity answer: Einstein’s groundbreaking theory of relativity transformed our comprehension of the cosmos.

> Low similarity answer: Isaac Newton’s laws of motion greatly influenced classical physics.

Example

from ragas.metrics import AnswerSimilarity
answer_similarity = AnswerSimilarity()


# Dataset({
#     features: ['answer','ground_truths'],
#     num_rows: 25
# })
dataset: Dataset

results = answer_similarity.score(dataset)

Answer Correctness

Answer correctness는 생성된 정답의 정확도를 groud truth와 비교하여 측정하여 0점에서 1점까지 점수로 매겨집니다. 점수가 높을수록 생성된 답변이 ground truth에 더 가까운 정확도가 높다는 것을 의미한다.

생성된 답변과 ground truth 간의 의미적 유사성과 사실적 유사성이라는 두 가지 중요한 측면이 포함
이러한 측면은 가중치 체계를 사용하여 결합되어 정답 정확도 점수를 공식화
사용자는 원하는 경우 '임계값'을 사용하여 결과 점수를 2진수로 반올림 사용


> Ground truth: Einstein was born in 1879 at Germany .

> High answer correctness: In 1879, in Germany, Einstein was born.

> Low answer correctness: In Spain, Einstein was born in 1879.

Example

from ragas.metrics import AnswerCorrectness
answer_correctness = AnswerCorrectness(
    weights=[0.4,0.6]
)

# Dataset({
#     features: ['answer','ground_truths'],
#     num_rows: 25
# })
dataset: Dataset

results = answer_correctness.score(dataset)

Aspect Critique

Aspect critique는 무해성, 정확성 등 사전 정의된 측면을 기반으로 제출물을 평가하도록 설계되었다. 사용자는 특정 기준에 따라 제출물을 평가하기 위한 자신만의 측면을 유연하게 정의할 수 있다. Aspect critique의 출력은 제출물이 정의된 측면에 부합하는지 여부를 나타내는 이진 형식으로 이 평가는 '답변'을 입력으로 사용하여 수행된다.

LLM 평가자 내의 비평은 제공된 측면에 따라 제출물을 평가
정확성, 유해성 등과 같이 미리 정의된 다양한 측면은 고유한 요구 사항에 따라 제출물을 평가하는 사용자 지정 측면을 제작 가능
Strictness parameter는 예측에서 일정 수준의 자체 일관성을 유지하는 데 중요한 역할을 하며, 이상적인 범위는 일반적으로 2에서 4 사이


SUPPORTED_ASPECTS = [ harmfulness, maliciousness, coherence, correctness, conciseness ]

from ragas.metrics.critique import conciseness

# Dataset({
#     features: ['question','answer'],
#     num_rows: 25
# })
dataset: Dataset


# Define your critique
from ragas.metrics.critique import AspectCritique
my_critique = AspectCritique(name="my-critique", definition="Is the submission safe to children?", strictness=2)

results = my_critique.score(dataset)

열부하를 해결하기 위해 워터 펌프 작동 임곗값을 수정해 해결

PreviousLLM Evaluation NextLangChain with RAGAS

Last updated 1 year ago

Evaluation(평가): 이를 통해 LLM 애플리케이션을 평가하고 메트릭 지원 방식으로 실험을 수행하여 높은 신뢰성과 재현성을 보장
Monitoring(모니터링): 프로덕션 데이터 포인트에서 가치 있고 실행 가능한 인사이트를 얻을 수 있어 LLM 애플리케이션의 품질을 지속적으로 개선

Metrics

Metric Component: Wise Evaluation

다른 머신 러닝 시스템과 마찬가지로 LLM 및 RAG 파이프라인 내의 개별 구성 요소의 성능은 전반적인 경험에 큰 영향을 미친다.

End-to-End Evaluation

파이프라인의 엔드투엔드 성능을 평가하는 것도 사용자 경험에 직접적인 영향을 미치기 때문에 매우 중요하다.

Faithfulness

답변에 포함된 모든 주장을 주어진 문맥에서 유추할 수 있는 경우 생성된 답변은 충실한 것으로 간주
이를 계산하기 위해 먼저 생성된 답안에서 일련의 주장을 식별
그런 다음 이러한 각 주장을 주어진 문맥과 교차 검사하여 주어진 문맥에서 유추할 수 있는지 여부를 결정

$Faithfulness = \frac{\textrm{|Number of claims in the generated answer that can be inferred from given context|}}{\textrm{|Total Number of claims in the generated answer|}}$

$Faithfulness = \frac{\textrm{|생성된 답변에서 주어진 문맥에서 유추할 수 있는 클레임의 수|}}{\textrm{|생성된 답변의 총 클레임 수|}}$

Question: Where and when was Einstein born?
Context: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time

> High faithfulness answer**: Einstein was born in Germany on 14th March 1879.
> Low faithfulness answer**: Einstein was born in Germany on 20th March 1879.

Example

from ragas.metrics.faithfulness import Faithfulness

faithfulness = Faithfulness(
    batch_size = 10
)
# Dataset({
#     features: ['question','contexts','answer'],
#     num_rows: 25
# })
dataset: Dataset

results = faithfulness.score(dataset)

Answer Relevance

답변은 원래의 질문을 직접적이고 적절하게 다룬 경우 관련성이 있는 것으로 간주
기본 아이디어는 생성된 답변이 최초 질문을 정확하게 다루고 있다면 LLM이 답변에서 원래 질문과 일치하는 질문을 생성할 수 있어야 함
중요한 점은 답변의 관련성을 평가할 때 사실성을 고려하지 않고 답변의 완전성이 부족하거나 중복된 세부 사항을 포함하는 경우 감점 처리
이 점수를 계산하기 위해 LLM은 생성된 답변에 적합한 질문을 여러 번 생성하라는 메시지를 표시하고, 이렇게 생성된 질문과 원래 질문 간의 평균 코사인 유사도를 측정

Question: Where is France and what is it’s capital?

> Low relevance answer: France is in western Europe.
> High relevance answer: France is in western Europe and Paris is its capital.

Example

from ragas.metrics import AnswerRelevancy
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings('BAAI/bge-base-en')
answer_relevancy = AnswerRelevancy(
    embeddings=embeddings
)

# init_model to load models used
answer_relevancy.init_model()

# Dataset({
#     features: ['question','answer'],
#     num_rows: 25
# })
dataset: Dataset

results = answer_relevancy.score(dataset)

Context Precision

$\textrm{Context Precison@k} = \frac{\sum precision@k}{\textrm{total Number of relevant items in the top K results}}$

$\textrm{Precison@k} = \frac{\textrm{true positive@k}}{\textrm{(true positive@k + false positive@k)}}$

$k$ = the total number of chunks in contexts (컨텍트의 총 청크 수)

Example

from ragas.metrics import ContextPrecision
context_precision = ContextPrecision()


# Dataset({
#     features: ['question','contexts'],
#     num_rows: 25
# })
dataset: Dataset

results = context_precision.score(dataset)

Context Relevancy

Context relevancy는 검색된 문맥의 관련성을 측정하며, 질문과 문맥을 모두 기준으로 계산한다. 값은 (0, 1) 범위로 값이 클수록 관련성이 높음을 나타낸다.

이상적으로는 검색된 문맥에 제공된 쿼리를 해결하는 데 필요한 필수 정보만 포함하여야 함
이를 계산하기 위해 먼저 검색된 문맥 내에서 주어진 질문에 대한 답변과 관련된 문장을 식별하여 값을 추정
최종 점수는 다음 공식에 의해 결정:

$context relevancy = \frac{\textrm{|S|}}{\textrm{|Total Number of sentences in retrived context|}}$


Question: What is the capital of France?

> High context relevancy: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.

> Low context relevancy: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.

Example

from ragas.metrics import ContextRelevance
context_relevancy = ContextRelevance()


# Dataset({
#     features: ['question','contexts'],
#     num_rows: 25
# })
dataset: Dataset

results = context_relevancy.score(dataset)

Context Recall

Ground truth 답변에서 문맥 재현율을 추정하기 위해 ground truth 답변의 각 문장을 분석하여 검색된 문맥에 기인할 수 있는지 여부를 결정
이상적인 시나리오에서는 ground truth 답변의 모든 문장이 검색된 문맥에 귀속될 수 있어야 함
Context recall 계산 공식은 다음과 같음:

$context recall = \frac{\textrm{|Ground truth sentences that can be attributed to context|}}{\textrm{|Number of stenteces in ground truth|}}$


Question: Where is France and what is it’s capital?

> Ground truth: France is in Western Europe and its capital is Paris.

> High context recall: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.

> Low context recall: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.

Example

from ragas.metrics import ContextRecall
context_recall = ContextRecall(
    batch_size=10

)
# Dataset({
#     features: ['contexts','ground_truths'],
#     num_rows: 25
# })
dataset: Dataset

results = context_recall.score(dataset)

Answer Semantic Similarity

답변 간의 의미적 유사성을 측정하면 생성된 답변의 품질에 대한 귀중한 인사이트를 얻음
Cross-encoder(교차 인코더) 모델을 활용하여 의미적 유사성 점수를 계산


> Ground truth: Albert Einstein’s theory of relativity revolutionized our understanding of the universe.”

> High similarity answer: Einstein’s groundbreaking theory of relativity transformed our comprehension of the cosmos.

> Low similarity answer: Isaac Newton’s laws of motion greatly influenced classical physics.

Example

from ragas.metrics import AnswerSimilarity
answer_similarity = AnswerSimilarity()


# Dataset({
#     features: ['answer','ground_truths'],
#     num_rows: 25
# })
dataset: Dataset

results = answer_similarity.score(dataset)

Answer Correctness

생성된 답변과 ground truth 간의 의미적 유사성과 사실적 유사성이라는 두 가지 중요한 측면이 포함
이러한 측면은 가중치 체계를 사용하여 결합되어 정답 정확도 점수를 공식화
사용자는 원하는 경우 '임계값'을 사용하여 결과 점수를 2진수로 반올림 사용


> Ground truth: Einstein was born in 1879 at Germany .

> High answer correctness: In 1879, in Germany, Einstein was born.

> Low answer correctness: In Spain, Einstein was born in 1879.

Example

from ragas.metrics import AnswerCorrectness
answer_correctness = AnswerCorrectness(
    weights=[0.4,0.6]
)

# Dataset({
#     features: ['answer','ground_truths'],
#     num_rows: 25
# })
dataset: Dataset

results = answer_correctness.score(dataset)

Aspect Critique

LLM 평가자 내의 비평은 제공된 측면에 따라 제출물을 평가
정확성, 유해성 등과 같이 미리 정의된 다양한 측면은 고유한 요구 사항에 따라 제출물을 평가하는 사용자 지정 측면을 제작 가능
Strictness parameter는 예측에서 일정 수준의 자체 일관성을 유지하는 데 중요한 역할을 하며, 이상적인 범위는 일반적으로 2에서 4 사이


SUPPORTED_ASPECTS = [ harmfulness, maliciousness, coherence, correctness, conciseness ]

from ragas.metrics.critique import conciseness

# Dataset({
#     features: ['question','answer'],
#     num_rows: 25
# })
dataset: Dataset


# Define your critique
from ragas.metrics.critique import AspectCritique
my_critique = AspectCritique(name="my-critique", definition="Is the submission safe to children?", strictness=2)

results = my_critique.score(dataset)

열부하를 해결하기 위해 워터 펌프 작동 임곗값을 수정해 해결