Referenzfreie Evaluierung RAG-Systmen¶
Lernen Sie, die Qualität von Retrieval-Systemen ohne Ground-Truth-Annotationen mit LLM-basierten Bewertern zu evaluieren
Dieses Beispiel zeigt, wie man die Qualität eines Retrieval-Systems bewerten kann, ohne vorher die korrekten Textabschnitte zu kennen. Anstatt manuell gekennzeichnete korrekte Chunks für jede Anfrage zu benötigen, verwenden wir einen LLM-basierten Bewerter, um die Relevanz und Vollständigkeit der abgerufenen Chunks zu bewerten. Dieser Ansatz ist besonders wertvoll, wenn:
- Sie keinen Zugang zu Ground-Truth-Annotationen haben
- Sie schnell verschiedene Retrieval-Parameter testen möchten
- Sie die Retrieval-Qualität in neuen Domänen schnell evaluieren möchten
Beispiel¶
Dies ist ein Beispiel aus dem Dokument, das wir in dieser Anleitung verwenden werden.
Für eine Suchanfrage wie
- "What are the methods for task decomposition in the Tree of Thoughts framework as proposed by Yao et al. 2023?"
könnte ein Retrieval-System mit top_k auf vier gesetzt folgende Chunks zurückgeben:
- "Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote. Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs."
- "Fig. 1. Overview of a LLM-powered autonomous agent system. Component One: Planning# A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. Task Decomposition# Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to "think step by step" to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model's thinking process."
- "The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning."
- "Another quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into "Problem PDDL", then (2) requests a classical planner to generate a PDDL plan based on an existing "Domain PDDL", and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains. Self-Reflection# Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable."
Ohne vorab zu wissen, welche Chunks nützlich sind, werden wir ihre Relevanz und Vollständigkeit bewerten sowie einschätzen, wie gut sie im Ranking zueinander stehen.
Überblick¶
Das Skript implementiert eine referenzfreie Evaluierungs-Pipeline, die:
- Ein Dokument lädt und verarbeitet (in diesem Fall einen Blog-Post)
- Embeddings erstellt und einen Vector Store für das Retrieval aufbaut
- Retrieval für eine Reihe von Testfragen durchführt
- Ein LLM verwendet, um die Qualität der abgerufenen Kontexte zu bewerten
- Evaluierungsmetriken wie Hit Rate und MRR (Mean Reciprocal Rank) berechnet
Wir erreichen dies durch kreative Neuausrichtung von elluminate's Abstraktionen für Prompt Templates, Responses und Kriterien. Eine zukünftige Version wird diesen Anwendungsfall nativer unterstützen.
Am Ende dieses Artikels finden Sie das vollständige Python-Skript.
Voraussetzungen¶
Installieren Sie zunächst die erforderlichen Abhängigkeiten. Dieses Skript ist von einem Beispiel aus LangChain inspiriert, aber Sie können auch jede andere Bibliothek für Embedding und Retrieval verwenden.
pip install langchain-core langchain-text-splitters langchain-openai langchain-community bs4 loguru elluminate
Sie müssen Ihre API-Schlüssel als Umgebungsvariablen einrichten:
Sie benötigen auch eine Reihe von Fragen, um das Retrieval-System zu evaluieren. Sie können plausible Beispiele entwickeln, sie aus Produktions-Logs sammeln oder sie mit einem LLM generieren. Wir stellen einen Datensatz von Fragen im vollständigen Skript am Ende dieses Artikels bereit.
Einrichten des Retrievals¶
Zunächst erstellen wir einen Vector Store aus unserem Dokument. Diese Funktion lädt eine Webseite (in diesem Fall einen Blog-Beitrag über KI-Agenten), teilt sie in Chunks auf und erstellt Embeddings:
def setup_retrieval(url="https://lilianweng.github.io/posts/2023-06-23-agent/"):
"""Setup retrieval by loading the blog post and splitting it into chunks."""
loader = WebBaseLoader(
web_paths=(url,),
bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))),
)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings(
api_key=os.getenv("OPENAI_API_KEY_EMBEDDING_SMALL"),
model="text-embedding-3-small",
)
vector_store = InMemoryVectorStore(embeddings)
vector_store.add_documents(documents=all_splits)
return vector_store
elluminate erwartet die Kontexte als Liste von Strings. Daher müssen Sie möglicherweise Ihre Retrieval-Funktion entsprechend anpassen.
def get_contexts(question, k) -> list[str]:
"""Wrap retrieval in a helper function to return the contexts as a list of strings."""
contexts = vector_store.similarity_search(question, k=k)
return [c.page_content for c in contexts]
Prompt Template mit Kriterien¶
Wir erstellen ein Prompt Template, das später die Suchanfrage entgegennehmen wird. Außerdem fügen wir dem Prompt Template Evaluierungskriterien hinzu, um die Relevanz und Vollständigkeit der abgerufenen Kontexte zu bewerten.
Als Kriterien wählen wir:
- Relevanz - Enthält der Chunk die in der Anfrage gewünschten Informationen?
- Vollständigkeit - Ist der Chunk ausreichend, um die Anfrage zu beantworten?
Für die Beispielanfrage
- "What are the methods for task decomposition in the Tree of Thoughts framework as proposed by Yao et al. 2023?"
wird der folgende Chunk als relevant und vollständig eingestuft:
- "Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote. Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs."
Dieser Chunk hingegen wird nur als relevant eingestuft:
- "Fig. 1. Overview of a LLM-powered autonomous agent system. Component One: Planning# A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. Task Decomposition# Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to "think step by step" to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model's thinking process."
def create_prompt_template(prompt_template_name: str) -> PromptTemplate:
"""Create an empty prompt template with the search query as template variable."""
template, created = client.get_or_create_prompt_template(
name=prompt_template_name,
messages="{{question}}",
)
if created:
criterion_set = client.create_criterion_set(name=f"{prompt_template_name} Criteria")
criterion_set.add_criteria(
[
"Does the response share significant domain overlap with the query?",
"Does the response contain the specific information requested in the query?",
]
)
criterion_set.link_template(template)
logger.info(f"Added criteria to prompt template {template.name}")
return template
Erstellen eines Experiments¶
Wir müssen ein Experiment erstellen, um eine Evaluierung durchzuführen. Wenn Responses zu einem Experiment hinzugefügt und dann bewertet werden, können die Bewertungsergebnisse sowohl über das SDK als auch im Frontend eingesehen werden.
def create_experiment(
prompt_template: PromptTemplate,
collection_name: str = "Retrieval Test Variables",
experiment_name: str = "Retrieval Test Experiment",
) -> tuple[Experiment, TemplateVariablesCollectionWithEntries]:
"""Create an experiment for retrieval testing."""
collection, _ = client.get_or_create_collection(
name=collection_name,
defaults={"description": "Template variables for retrieval test questions"},
)
experiment = client.create_experiment(
name=experiment_name,
prompt_template=prompt_template,
collection=collection,
description="Experiment for testing retrieval performance using LLM judges",
)
logger.info(f"Created experiment: {experiment.name}")
return experiment, collection
Hinzufügen von Kontexten¶
Wir füllen das Prompt Template mit der Anfrage und fügen jeden abgerufenen Kontext als separate Response hinzu. Auf diese Weise können wir jeden Kontext in Bezug auf die Anfrage unabhängig bewerten. Responses werden dem obigen Experiment zugewiesen, um die Bewertungsergebnisse zu verfolgen.
def add_contexts_as_responses(
experiment: Experiment,
collection: TemplateVariablesCollectionWithEntries,
questions_and_contexts: dict[str, list[str]],
) -> list[list[PromptResponse]]:
"""Add the contexts as responses to the prompt template."""
all_responses = []
for question, chunks in questions_and_contexts.items():
template_variables = collection.add_many(variables=[{"question": question}])[0]
responses = experiment.add_responses(
responses=chunks,
template_variables=[template_variables] * len(chunks),
)
all_responses.append(responses)
logger.info(f"Added {sum(len(x) for x in all_responses)} responses")
return all_responses
Bewertung der Kontexte¶
elluminate bewertet jeden Kontext hinsichtlich Relevanz und Vollständigkeit.
def rate_contexts(experiment: Experiment, all_responses: list[list[PromptResponse]]) -> list[list[list[Rating]]]:
"""Rate the contexts for each search query."""
experiment.rate_responses(rating_mode=RatingMode.FAST)
all_ratings = []
for responses in all_responses:
response_ratings = [response.ratings for response in responses]
all_ratings.append(response_ratings)
logger.info(f"Rated {sum(len(x) for x in all_ratings)} responses")
return all_ratings
Berechnung der Metriken¶
Die Hit Rate ist der Prozentsatz der Anfragen, die mindestens einen relevanten Kontext haben. Eine höhere Hit Rate zeigt an, dass Ihr Retrieval-System besser darin ist, mindestens ein nützliches Ergebnis für jede Anfrage zu finden.
MRR ist eine Metrik, die misst, wie gut das Retrieval-System relevante Ergebnisse rankt. Für jede Anfrage wird die Position des ersten relevanten Ergebnisses betrachtet und der Kehrwert (d.h. der inverse Wert) dieser Position (\(\frac{1}{position}\)) berechnet. Die endgültige Punktzahl ist der Durchschnitt über alle Anfragen.
Ein Einfaches Beispiel¶
Erste Anfrage: "What are the health benefits of drinking water?" Abgerufene Kontexte (in Reihenfolge):
- "Benefits of water for hydration and health..." ✓
- "Different types of beverages..."
- "Water and exercise performance..." ✓
- "Water pollution statistics..."
→ Erstes relevantes Ergebnis ist an Position 1
→ Reciprocal Rank = \(\frac{1}{1} = 1.0\)
Zweite Anfrage: "What is the recommended daily water intake?" Abgerufene Kontexte:
- "Caffeine consumption guidelines..."
- "Daily water intake recommendations..." ✓
- "Dehydration symptoms..."
- "Water quality standards..."
→ Erstes relevantes Ergebnis ist an Position 2
→ Reciprocal Rank = \(\frac{1}{2} = 0.5\)
MRR der zwei Beispiele
→ MRR = \(\frac{1.0 + 0.5}{2} = 0.75\)
Die Implementierung berechnet sowohl den MRR (der in unserem Fall Relevanz und Vollständigkeit erfordert) als auch einen nur auf Relevanz basierenden MRR, sowie die Hit Rate und Fehler.
def calculate_metrics(all_ratings: list[list[list[Rating]]]) -> tuple[float, float, float, int]:
"""Calculate MRR (mean reciprocal rank) for each search query."""
rrs = []
relevancy_rrs = []
for ratings in all_ratings:
rr = 0
found_relevant = False
for rating in ratings:
if len(rating) < 2:
continue
relevant = rating[0].rating
complete = rating[1].rating
if relevant and not found_relevant:
relevancy_rrs.append(1 / (ratings.index(rating) + 1))
found_relevant = True
if relevant and complete:
rr = 1 / (ratings.index(rating) + 1)
break
rrs.append(rr)
mrr = sum(rrs) / len(rrs) if rrs else 0
relevancy_mrr = sum(relevancy_rrs) / len(relevancy_rrs) if relevancy_rrs else 0
hit_rate = sum(1 for rr in rrs if rr > 0) / len(rrs) if rrs else 0
failures = sum(1 for rr in rrs if rr == 0)
logger.info(f"MRR: {mrr}, Relevancy MRR: {relevancy_mrr}, Hit rate: {hit_rate}, Failures: {failures}")
return hit_rate, mrr, relevancy_mrr, failures
Durchführung der Evaluierung¶
Die Haupt-Evaluierungsfunktion verbindet alles miteinander:
def run_test(questions: list[str], retrieval_fn: Callable) -> tuple[float, float, float, int]:
"""Main test function."""
queries_and_contexts_dict = {question: retrieval_fn(question) for question in questions}
prompt_template = create_prompt_template(prompt_template_name="Agent Blog Post Retrieval")
experiment, collection = create_experiment(prompt_template)
queries_and_contexts = add_contexts_as_responses(experiment, collection, queries_and_contexts_dict)
all_ratings = rate_contexts(experiment, queries_and_contexts)
return calculate_metrics(all_ratings)
Nachdem Sie die Metriken in Ihrem Terminal betrachtet oder einzelne Beispiele im Dashboard inspiziert haben, können Sie Ihre Retrieval-Parameter anpassen, um die Leistung zu verbessern.
vector_store = setup_retrieval()
client = Client(timeout=60)
k = 8
logger.info(f"Experiment kwargs: {k=}")
retrieval_fn = partial(get_contexts, k=k)
hit_rate, mrr, relevancy_mrr, failures = run_test(questions, retrieval_fn=retrieval_fn)
Vollständiges Skript¶
# Set your `OPENAI_API_KEY_EMBEDDING_SMALL` and `ELLUMINATE_API_KEY` in your environment
"""Retrieval Quality Assessment Example using LangChain (v1.0 API)
This example demonstrates how to assess RAG (Retrieval-Augmented Generation) pipelines
using Elluminate. It uses LangChain for document loading and vector search, then assesses
the retrieved contexts using LLM-as-judge with criteria for relevance and completeness.
Requires: langchain, langchain-community, langchain-openai, bs4, loguru
"""
import os
from functools import partial
from typing import Callable
import bs4
from dotenv import load_dotenv
from elluminate import Client
from elluminate.schemas import (
Experiment,
PromptResponse,
PromptTemplate,
Rating,
RatingMode,
TemplateVariablesCollectionWithEntries,
)
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from loguru import logger
load_dotenv(override=True)
questions = [
"What are the key components that complement the LLM in an LLM-powered autonomous agent system?",
"What is the difference between short-term memory and long-term memory in the context of AI models?",
"What is the purpose of the Chain of Thought (CoT) technique?",
"What are the methods for task decomposition in the Tree of Thoughts framework?",
"What is the role of PDDL in the LLM+P approach?",
]
def setup_retrieval(url="https://lilianweng.github.io/posts/2023-06-23-agent/"):
"""Setup retrieval by loading the blog post and splitting it into chunks."""
loader = WebBaseLoader(
web_paths=(url,),
bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))),
)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings(
api_key=os.getenv("OPENAI_API_KEY_EMBEDDING_SMALL"),
model="text-embedding-3-small",
)
vector_store = InMemoryVectorStore(embeddings)
vector_store.add_documents(documents=all_splits)
return vector_store
def get_contexts(question, k) -> list[str]:
"""Wrap retrieval in a helper function to return the contexts as a list of strings."""
contexts = vector_store.similarity_search(question, k=k)
return [c.page_content for c in contexts]
def create_prompt_template(prompt_template_name: str) -> PromptTemplate:
"""Create an empty prompt template with the search query as template variable."""
template, created = client.get_or_create_prompt_template(
name=prompt_template_name,
messages="{{question}}",
)
if created:
criterion_set = client.create_criterion_set(name=f"{prompt_template_name} Criteria")
criterion_set.add_criteria(
[
"Does the response share significant domain overlap with the query?",
"Does the response contain the specific information requested in the query?",
]
)
criterion_set.link_template(template)
logger.info(f"Added criteria to prompt template {template.name}")
return template
def create_experiment(
prompt_template: PromptTemplate,
collection_name: str = "Retrieval Test Variables",
experiment_name: str = "Retrieval Test Experiment",
) -> tuple[Experiment, TemplateVariablesCollectionWithEntries]:
"""Create an experiment for retrieval testing."""
collection, _ = client.get_or_create_collection(
name=collection_name,
defaults={"description": "Template variables for retrieval test questions"},
)
experiment = client.create_experiment(
name=experiment_name,
prompt_template=prompt_template,
collection=collection,
description="Experiment for testing retrieval performance using LLM judges",
)
logger.info(f"Created experiment: {experiment.name}")
return experiment, collection
def add_contexts_as_responses(
experiment: Experiment,
collection: TemplateVariablesCollectionWithEntries,
questions_and_contexts: dict[str, list[str]],
) -> list[list[PromptResponse]]:
"""Add the contexts as responses to the prompt template."""
all_responses = []
for question, chunks in questions_and_contexts.items():
template_variables = collection.add_many(variables=[{"question": question}])[0]
responses = experiment.add_responses(
responses=chunks,
template_variables=[template_variables] * len(chunks),
)
all_responses.append(responses)
logger.info(f"Added {sum(len(x) for x in all_responses)} responses")
return all_responses
def rate_contexts(experiment: Experiment, all_responses: list[list[PromptResponse]]) -> list[list[list[Rating]]]:
"""Rate the contexts for each search query."""
experiment.rate_responses(rating_mode=RatingMode.FAST)
all_ratings = []
for responses in all_responses:
response_ratings = [response.ratings for response in responses]
all_ratings.append(response_ratings)
logger.info(f"Rated {sum(len(x) for x in all_ratings)} responses")
return all_ratings
def calculate_metrics(all_ratings: list[list[list[Rating]]]) -> tuple[float, float, float, int]:
"""Calculate MRR (mean reciprocal rank) for each search query."""
rrs = []
relevancy_rrs = []
for ratings in all_ratings:
rr = 0
found_relevant = False
for rating in ratings:
if len(rating) < 2:
continue
relevant = rating[0].rating
complete = rating[1].rating
if relevant and not found_relevant:
relevancy_rrs.append(1 / (ratings.index(rating) + 1))
found_relevant = True
if relevant and complete:
rr = 1 / (ratings.index(rating) + 1)
break
rrs.append(rr)
mrr = sum(rrs) / len(rrs) if rrs else 0
relevancy_mrr = sum(relevancy_rrs) / len(relevancy_rrs) if relevancy_rrs else 0
hit_rate = sum(1 for rr in rrs if rr > 0) / len(rrs) if rrs else 0
failures = sum(1 for rr in rrs if rr == 0)
logger.info(f"MRR: {mrr}, Relevancy MRR: {relevancy_mrr}, Hit rate: {hit_rate}, Failures: {failures}")
return hit_rate, mrr, relevancy_mrr, failures
def run_test(questions: list[str], retrieval_fn: Callable) -> tuple[float, float, float, int]:
"""Main test function."""
queries_and_contexts_dict = {question: retrieval_fn(question) for question in questions}
prompt_template = create_prompt_template(prompt_template_name="Agent Blog Post Retrieval")
experiment, collection = create_experiment(prompt_template)
queries_and_contexts = add_contexts_as_responses(experiment, collection, queries_and_contexts_dict)
all_ratings = rate_contexts(experiment, queries_and_contexts)
return calculate_metrics(all_ratings)
if __name__ == "__main__":
vector_store = setup_retrieval()
client = Client(timeout=60)
k = 8
logger.info(f"Experiment kwargs: {k=}")
retrieval_fn = partial(get_contexts, k=k)
hit_rate, mrr, relevancy_mrr, failures = run_test(questions, retrieval_fn=retrieval_fn)