Referenzfreie Evaluierung RAG-Systmen¶

Dieses Beispiel zeigt, wie man die Qualität eines Retrieval-Systems bewerten kann, ohne vorher die korrekten Textabschnitte zu kennen. Anstatt manuell gekennzeichnete korrekte Chunks für jede Anfrage zu benötigen, verwenden wir einen LLM-basierten Bewerter, um die Relevanz und Vollständigkeit der abgerufenen Chunks zu bewerten. Dieser Ansatz ist besonders wertvoll, wenn:

Sie keinen Zugang zu Ground-Truth-Annotationen haben
Sie schnell verschiedene Retrieval-Parameter testen möchten
Sie die Retrieval-Qualität in neuen Domänen schnell evaluieren möchten

Beispiel¶

Dies ist ein Beispiel aus dem Dokument, das wir in dieser Anleitung verwenden werden.

Für eine Suchanfrage wie

"What are the methods for task decomposition in the Tree of Thoughts framework as proposed by Yao et al. 2023?"

könnte ein Retrieval-System mit top_k auf vier gesetzt folgende Chunks zurückgeben:

"Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote. Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs."
"Fig. 1. Overview of a LLM-powered autonomous agent system. Component One: Planning# A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. Task Decomposition# Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to "think step by step" to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model's thinking process."
"The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning."
"Another quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into "Problem PDDL", then (2) requests a classical planner to generate a PDDL plan based on an existing "Domain PDDL", and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains. Self-Reflection# Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable."

Ohne vorab zu wissen, welche Chunks nützlich sind, werden wir ihre Relevanz und Vollständigkeit bewerten sowie einschätzen, wie gut sie im Ranking zueinander stehen.

Überblick¶

Das Skript implementiert eine referenzfreie Evaluierungs-Pipeline, die:

Ein Dokument lädt und verarbeitet (in diesem Fall einen Blog-Post)
Embeddings erstellt und einen Vector Store für das Retrieval aufbaut
Retrieval für eine Reihe von Testfragen durchführt
Ein LLM verwendet, um die Qualität der abgerufenen Kontexte zu bewerten
Evaluierungsmetriken wie Hit Rate und MRR (Mean Reciprocal Rank) berechnet

Wir erreichen dies durch kreative Neuausrichtung von Elluminate's Abstraktionen für Prompt Templates, Responses und Kriterien. Eine zukünftige Version wird diesen Anwendungsfall nativer unterstützen.

Am Ende dieses Artikels finden Sie das vollständige Python-Skript.

Voraussetzungen¶

Installieren Sie zunächst die erforderlichen Abhängigkeiten. Dieses Skript ist von einem Beispiel aus LangChain inspiriert, aber Sie können auch jede andere Bibliothek für Embedding und Retrieval verwenden.

pip install langchain-core langchain-text-splitters langchain-openai langchain-community bs4 loguru elluminate

Sie müssen Ihre API-Schlüssel als Umgebungsvariablen einrichten:

export OPENAI_API_KEY="your-openai-key"
export ELLUMINATE_API_KEY="your-elluminate-key"

Sie benötigen auch eine Reihe von Fragen, um das Retrieval-System zu evaluieren. Sie können plausible Beispiele entwickeln, sie aus Produktions-Logs sammeln oder sie mit einem LLM generieren. Wir stellen einen Satz Fragen im vollständigen Skript am Ende dieses Artikels bereit.

Einrichten des Retrievals¶

Zunächst erstellen wir einen Vector Store aus unserem Dokument. Diese Funktion lädt eine Webseite (in diesem Fall einen Blog-Beitrag über KI-Agenten), teilt sie in Chunks auf und erstellt Embeddings:

def setup_retrieval(url="https://lilianweng.github.io/posts/2023-06-23-agent/"):
    """Setup retrieval by loading the blog post and splitting it into chunks."""
    loader = WebBaseLoader(
        web_paths=(url,),
        bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))),
    )
    docs = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    all_splits = text_splitter.split_documents(docs)

    embeddings = OpenAIEmbeddings(
        model="text-embedding-3-large",
    )
    vector_store = InMemoryVectorStore(embeddings)
    vector_store.add_documents(documents=all_splits)
    return vector_store

Elluminate erwartet die Kontexte als Liste von Strings. Daher müssen Sie möglicherweise Ihre Retrieval-Funktion entsprechend anpassen.

async def get_contexts(question, k) -> list[str]:
    """Wrap retrieval in a helper function to return the contexts as a list of strings."""
    contexts = await vector_store.asimilarity_search(question, k=k)
    return [c.page_content for c in contexts]

Prompt Template mit Kriterien¶

Wir erstellen ein Prompt Template, das später die Suchanfrage aufnehmen wird. Außerdem fügen wir dem Prompt Template Evaluierungskriterien hinzu, um die Relevanz und Vollständigkeit der abgerufenen Kontexte zu bewerten.

Als Kriterien wählen wir:

Relevanz: Enthält der Chunk die in der Anfrage gewünschten Informationen?
Vollständigkeit: Ist der Chunk ausreichend, um die Anfrage zu beantworten?

Für die Beispielanfrage

"What are the methods for task decomposition in the Tree of Thoughts framework as proposed by Yao et al. 2023?"

wird der folgende Chunk als relevant und vollständig eingestuft:

"Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote. Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs."

Dieser Chunk hingegen wird nur als relevant eingestuft:

"Fig. 1. Overview of a LLM-powered autonomous agent system. Component One: Planning# A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. Task Decomposition# Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to "think step by step" to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model's thinking process."

async def create_prompt_template(prompt_template_name: str) -> PromptTemplate:
    """Create an empty prompt template with the search query as template variable.

    We will fill this template with the query and add retrieved contexts for every example in the dataset.
    The specified criteria for relevance and completeness, however, stay the same for all examples and will be used to rate the retrieved contexts' relevance and completeness.
    """
    prompt_template, created = await client.prompt_templates.aget_or_create(
        "{{question}}", name=prompt_template_name
    )
    if created:
        # Add criteria to the prompt template if it's newly created
        await client.criteria.aadd_many(
            [
                "Does the response share significant domain overlap with the query, such as technical content for technical queries, even if it discusses different aspects or use cases within that domain? A response discussing a broader/narrower aspect of the same domain should be considered relevant.",
                "Does the response contain the specific information requested in the query? The response should contain an answer to the core information need expressed in the query, not just related information. For example, if asking about a specific code, price, or date, that exact information must be present. It is acceptable for the response to contain additional information as long as the core information need is met.",
            ],
            prompt_template,
        )
        logger.info(f"Added criteria to prompt template {prompt_template.name}")
    return prompt_template

Hinzufügen von Kontexten¶

Wir füllen das Prompt Template mit der Anfrage und fügen die abgerufenen Kontexte als separate Responses hinzu. Auf diese Weise können wir jeden Kontext in Bezug auf die Anfrage unabhängig bewerten.

async def add_contexts_as_responses(
    prompt_template: PromptTemplate,
    questions_and_contexts: dict[str, list[str]],
) -> list[list[PromptResponse]]:
    """Add the contexts as responses to the prompt template.

    We pass on the question as a template variable to fill the prompt template.
    We then add the contexts as multiple distinct responses to this prompt. This way we have prompt with multiple answers, we can rate independently.
    """
    all_responses = []
    for question, context in questions_and_contexts.items():
        # Add the question as a template variable to the prompt template
        template_variables = await client.template_variables.aadd_to_collection(
            template_variables={"question": question},
            collection=prompt_template.default_template_variables_collection,
        )
        # Add a response to the prompt template for each chunk of context
        responses = await asyncio.gather(
            *[
                client.responses.aadd(chunk, prompt_template, template_variables=template_variables)
                for chunk in context
            ]
        )
        all_responses.append(responses)
    logger.info(f"Added {sum(len(x) for x in all_responses)} responses to prompt template {prompt_template.name}")
    return all_responses

Bewertung der Kontexte¶

Elluminate bewertet jeden Kontext hinsichtlich Relevanz und Vollständigkeit.

async def rate_contexts(all_responses: list[list[PromptResponse]]) -> list[list[list[Rating]]]:
    """Rate the contexts for each search query.

    Each chunk receives two ratings from an LLM-judge: one for relevance and one for completeness.
    This way we can achieve a reference-free evaluation without requiring ground truth contexts for the search queries.
    """
    all_ratings = await asyncio.gather(
        *[
            asyncio.gather(*[client.ratings.arate(response, rating_mode=RatingMode.FAST) for response in responses])
            for responses in all_responses
        ]
    )
    logger.info(f"Rated {sum(len(x) for x in all_ratings)} responses")
    return all_ratings

Berechnung der Metriken¶

Die Hit Rate ist der Prozentsatz der Anfragen, die mindestens einen relevanten Kontext haben. Eine höhere Hit Rate zeigt an, dass Ihr Retrieval-System besser darin ist, mindestens ein nützliches Ergebnis für jede Anfrage zu finden.

MRR ist eine Metrik, die misst, wie gut das Retrieval-System relevante Ergebnisse rankt. Für jede Anfrage wird die Position des ersten relevanten Ergebnisses betrachtet und der Kehrwert (d.h. der inverse Wert) dieser Position (1/Position) berechnet. Die endgültige Punktzahl ist der Durchschnitt über alle Anfragen.

Hier ist ein einfaches Beispiel:

Erste Anfrage: "What are the health benefits of drinking water?"

Abgerufene Kontexte (in Reihenfolge):

"Benefits of water for hydration and health..." ✓
"Different types of beverages..."
"Water and exercise performance..." ✓
"Water pollution statistics..."
Erstes relevantes Ergebnis ist an Position 1
Reciprocal Rank = \(\frac{1}{1} = 1.0\)

Zweite Anfrage: "What is the recommended daily water intake?" Abgerufene Kontexte:

"Caffeine consumption guidelines..."
"Daily water intake recommendations..." ✓
"Dehydration symptoms..."
"Water quality standards..."
Erstes relevantes Ergebnis ist an Position 2
Reciprocal Rank = \(\frac{1}{2} = 0.5\)

MRR der zwei Beispiele:

MRR = \(\frac{1.0 + 0.5}{2} = 0.75\)

Die Implementierung berechnet sowohl den MRR (der in unserem Fall Relevanz und Vollständigkeit erfordert) als auch einen nur auf Relevanz basierenden MRR, sowie die Hit Rate und Fehler.

def calculate_metrics(all_ratings: list[list[list[Rating]]]) -> tuple[list[float], list[float]]:
    """Since we have an LLM-verdict on correct contexts now, we can calculate the MRR (mean reciprocal rank) for each search query,
    i.e. the position of the first complete and relevant chunk.
    """
    rrs = []
    relevancy_rrs = []
    for ratings in all_ratings:
        rr = 0
        found_relevant = False
        for rating in ratings:
            relevant = rating[0].rating
            complete = rating[1].rating

            # If the chunk is relevant and we haven't found a relevant chunk yet
            # Calculate MRR for relevancy (1/position) and mark that we found a relevant chunk
            if relevant and not found_relevant:
                relevancy_rrs.append(1 / (ratings.index(rating) + 1))
                found_relevant = True

            # If the chunk is both relevant AND complete
            # Calculate MRR (1/position) and stop looking at further chunks
            if relevant and complete:
                rr = 1 / (ratings.index(rating) + 1)
                break
        rrs.append(rr)

    # Calculate mean MRR, returns 0 if mrrs list is empty
    mrr = sum(rrs) / len(rrs) if rrs else 0
    logger.info(f"Mean Reciprocal Rank: {mrr}")

    # Calculate mean MRR for relevant chunks only, returns 0 if relevancy_mrrs list is empty
    relevancy_mrr = sum(relevancy_rrs) / len(relevancy_rrs) if relevancy_rrs else 0
    logger.info(f"Mean Reciprocal Rank for relevant chunks: {relevancy_mrr}")

    # Calculate hit rate (proportion of queries with any successful retrieval), returns 0 if mrrs list is empty
    hit_rate = sum(1 for rr in rrs if rr > 0) / len(rrs) if rrs else 0
    logger.info(f"Hit rate: {hit_rate}")

    # Count number of complete failures (where no relevant+complete chunk was found)
    failures = sum(1 for rr in rrs if rr == 0)
    logger.info(f"Failures: {failures}")

    return hit_rate, mrr, relevancy_mrr, failures

Durchführung der Evaluierung¶

Die Haupt-Evaluierungsfunktion verbindet alles miteinander:

### Main evaluation function
async def run_eval(questions: list[str], retrieval_fn: Callable) -> tuple[list[float], list[float]]:
    # Pair all queries with their contexts
    queries_and_contexts_dict = {question: await retrieval_fn(question) for question in questions}

    # Create a prompt template with the search query as template variable
    prompt_template = await create_prompt_template(prompt_template_name="Agent Blog Post Retrieval")

    # Pass on the questions and add the contexts as responses to the prompt template
    queries_and_contexts = await add_contexts_as_responses(
        prompt_template, questions_and_contexts=queries_and_contexts_dict
    )

    # Rate the contexts
    all_ratings = await rate_contexts(queries_and_contexts)

    # Calculate the MRR
    hit_rate, mrr, relevancy_mrr, failures = calculate_metrics(all_ratings)

    return hit_rate, mrr, relevancy_mrr, failures

Nachdem Sie die Metriken in Ihrem Terminal betrachtet oder einzelne Beispiele im Dashboard inspiziert haben, können Sie Ihre Retrieval-Parameter anpassen, um die Leistung zu verbessern.

### You can tune your retrieval parameters here
### Your retrieval function can have many parameters e.g. chunk size, overlap, reranker, etc.
# k = 2
# k = 4
k = 8
logger.info(f"Experiment kwargs: {k=}")
retrieval_fn = partial(get_contexts, k=k)
hit_rate, mrr, relevancy_mrr, failures = asyncio.run(run_eval(questions, retrieval_fn=retrieval_fn))
### Example output for k=8
# 2024-12-18 14:27:48.887 | INFO     | __main__:<module>:245 - Experiment kwargs: k=8
# 2024-12-18 14:28:07.621 | INFO     | __main__:add_contexts_as_responses:131 - Added 240 responses to prompt template Agent Blog Post Retrieval
# 2024-12-18 14:29:47.557 | INFO     | __main__:rate_contexts:151 - Rated 240 responses
# 2024-12-18 14:29:47.557 | INFO     | __main__:calculate_metrics:187 - Mean Reciprocal Rank: 0.35
# 2024-12-18 14:29:47.557 | INFO     | __main__:calculate_metrics:191 - Mean Reciprocal Rank for relevant chunks: 0.9833333333333333
# 2024-12-18 14:29:47.558 | INFO     | __main__:calculate_metrics:195 - Hit rate: 0.36666666666666664
# 2024-12-18 14:29:47.558 | INFO     | __main__:calculate_metrics:199 - Failures: 19
### Example output for k=4
# 2024-12-18 15:18:15.195 | INFO     | __main__:<module>:245 - Experiment kwargs: k=4
# 2024-12-18 15:18:37.742 | INFO     | __main__:add_contexts_as_responses:131 - Added 120 responses to prompt template Agent Blog Post Retrieval
# 2024-12-18 15:18:55.299 | INFO     | __main__:rate_contexts:151 - Rated 120 responses
# 2024-12-18 15:18:55.299 | INFO     | __main__:calculate_metrics:187 - Mean Reciprocal Rank: 0.31666666666666665
# 2024-12-18 15:18:55.299 | INFO     | __main__:calculate_metrics:191 - Mean Reciprocal Rank for relevant chunks: 0.9833333333333333
# 2024-12-18 15:18:55.300 | INFO     | __main__:calculate_metrics:195 - Hit rate: 0.3333333333333333
# 2024-12-18 15:18:55.300 | INFO     | __main__:calculate_metrics:199 - Failures: 20

Vollständiges Skript¶

# Set your `OPENAI_API_KEY` and `ELLUMINATE_API_KEY` in your environment
import asyncio
from functools import partial
from typing import Callable

import bs4
from dotenv.main import load_dotenv
from elluminate import Client
from elluminate.schemas import (
    PromptResponse,
    PromptTemplate,
    Rating,
    RatingMode,
)
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from loguru import logger

load_dotenv(override=True)

questions = [
    "What are the key components that complement the LLM in an LLM-powered autonomous agent system?",
    "What is the difference between short-term memory and long-term memory in the context of AI models as described in the text?",
    "What is the purpose of the Chain of Thought (CoT) technique in the context of a LLM-powered autonomous agent system?",
    "What are the methods for task decomposition in the Tree of Thoughts framework as proposed by Yao et al. 2023?",
    "What is the role of the Planning Domain Definition Language (PDDL) in the LLM+P approach described by Liu et al. (2023)?",
    "What is the purpose of extending the action space in the ReAct framework, as described by Yao et al. 2023?",
    "What role does Reflexion play in enhancing reasoning skills for agents in knowledge-intensive and decision-making tasks?",
    "What is the role of self-reflection in the Reflexion framework as described by Shinn & Labash (2023)?",
    "What is identified as a more common failure in the AlfWorld environment according to Shinn & Labash's 2023 study?",
    "What is the main goal of the Chain of Hindsight (CoH) approach described by Liu et al. 2023?",
    "What techniques does CoH use to prevent overfitting and shortcutting in its training process?",
    "How does the CoH approach, as described in the text, improve the model's performance in generating outputs?",
    "What is the necessary length of multi-episodic contexts to learn a near-optimal in-context reinforcement learning algorithm according to the text?",
    "How does the performance and learning speed of AD compare to the baselines, including ED, source policy, and RL^2, when conditioned on partial training history of the source policy?",
    "What types of memory are involved in the process of acquiring, storing, retaining, and retrieving information as mentioned in the text?",
    "What is the difference in capacity and duration between short-term memory and long-term memory as described in the text?",
    "What are the different types of human memory as categorized in the text, and how are they mapped to machine learning concepts?",
    "What are some common choices of Approximate Nearest Neighbors (ANN) algorithms used for fast Maximum Inner Product Search (MIPS)?",
    "What is the core data structure used in ANNOY, and how does it function in the context of approximate nearest neighbor search?",
    "How does the HNSW algorithm utilize hierarchical layers to optimize the search process in small-world networks?",
    "What is the main innovation in ScaNN compared to FAISS, and how does it differ in its approach to vector quantization?",
    "What website can be visited to find more information and performance comparisons of MIPS algorithms?",
    "What is the function of the general-purpose LLM in a MRKL system according to the text?",
    "What are the challenges faced by the 7B Jurassic1-large model in solving arithmetic problems, according to the experiment?",
    "What is HuggingGPT and how does it utilize ChatGPT in its framework?",
    "What are the attributes associated with each task in the task planning stage of HuggingGPT?",
    'What does the "dep" field indicate in the task structure described in the text?',
    "What is the process described for selecting a suitable model from a list of candidate models to handle a user request?",
    "What are the stages involved in the AI assistant's task process as described in the text?",
    "What challenges need to be addressed to effectively implement HuggingGPT in real-world applications?",
]


def setup_retrieval(url="https://lilianweng.github.io/posts/2023-06-23-agent/"):
    """Setup retrieval by loading the blog post and splitting it into chunks."""
    loader = WebBaseLoader(
        web_paths=(url,),
        bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))),
    )
    docs = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    all_splits = text_splitter.split_documents(docs)

    embeddings = OpenAIEmbeddings(
        model="text-embedding-3-large",
    )
    vector_store = InMemoryVectorStore(embeddings)
    vector_store.add_documents(documents=all_splits)
    return vector_store




async def get_contexts(question, k) -> list[str]:
    """Wrap retrieval in a helper function to return the contexts as a list of strings."""
    contexts = await vector_store.asimilarity_search(question, k=k)
    return [c.page_content for c in contexts]




async def create_prompt_template(prompt_template_name: str) -> PromptTemplate:
    """Create an empty prompt template with the search query as template variable.

    We will fill this template with the query and add retrieved contexts for every example in the dataset.
    The specified criteria for relevance and completeness, however, stay the same for all examples and will be used to rate the retrieved contexts' relevance and completeness.
    """
    prompt_template, created = await client.prompt_templates.aget_or_create(
        "{{question}}", name=prompt_template_name
    )
    if created:
        # Add criteria to the prompt template if it's newly created
        await client.criteria.aadd_many(
            [
                "Does the response share significant domain overlap with the query, such as technical content for technical queries, even if it discusses different aspects or use cases within that domain? A response discussing a broader/narrower aspect of the same domain should be considered relevant.",
                "Does the response contain the specific information requested in the query? The response should contain an answer to the core information need expressed in the query, not just related information. For example, if asking about a specific code, price, or date, that exact information must be present. It is acceptable for the response to contain additional information as long as the core information need is met.",
            ],
            prompt_template,
        )
        logger.info(f"Added criteria to prompt template {prompt_template.name}")
    return prompt_template




async def add_contexts_as_responses(
    prompt_template: PromptTemplate,
    questions_and_contexts: dict[str, list[str]],
) -> list[list[PromptResponse]]:
    """Add the contexts as responses to the prompt template.

    We pass on the question as a template variable to fill the prompt template.
    We then add the contexts as multiple distinct responses to this prompt. This way we have prompt with multiple answers, we can rate independently.
    """
    all_responses = []
    for question, context in questions_and_contexts.items():
        # Add the question as a template variable to the prompt template
        template_variables = await client.template_variables.aadd_to_collection(
            template_variables={"question": question},
            collection=prompt_template.default_template_variables_collection,
        )
        # Add a response to the prompt template for each chunk of context
        responses = await asyncio.gather(
            *[
                client.responses.aadd(chunk, prompt_template, template_variables=template_variables)
                for chunk in context
            ]
        )
        all_responses.append(responses)
    logger.info(f"Added {sum(len(x) for x in all_responses)} responses to prompt template {prompt_template.name}")
    return all_responses




async def rate_contexts(all_responses: list[list[PromptResponse]]) -> list[list[list[Rating]]]:
    """Rate the contexts for each search query.

    Each chunk receives two ratings from an LLM-judge: one for relevance and one for completeness.
    This way we can achieve a reference-free evaluation without requiring ground truth contexts for the search queries.
    """
    all_ratings = await asyncio.gather(
        *[
            asyncio.gather(*[client.ratings.arate(response, rating_mode=RatingMode.FAST) for response in responses])
            for responses in all_responses
        ]
    )
    logger.info(f"Rated {sum(len(x) for x in all_ratings)} responses")
    return all_ratings




def calculate_metrics(all_ratings: list[list[list[Rating]]]) -> tuple[list[float], list[float]]:
    """Since we have an LLM-verdict on correct contexts now, we can calculate the MRR (mean reciprocal rank) for each search query,
    i.e. the position of the first complete and relevant chunk.
    """
    rrs = []
    relevancy_rrs = []
    for ratings in all_ratings:
        rr = 0
        found_relevant = False
        for rating in ratings:
            relevant = rating[0].rating
            complete = rating[1].rating

            # If the chunk is relevant and we haven't found a relevant chunk yet
            # Calculate MRR for relevancy (1/position) and mark that we found a relevant chunk
            if relevant and not found_relevant:
                relevancy_rrs.append(1 / (ratings.index(rating) + 1))
                found_relevant = True

            # If the chunk is both relevant AND complete
            # Calculate MRR (1/position) and stop looking at further chunks
            if relevant and complete:
                rr = 1 / (ratings.index(rating) + 1)
                break
        rrs.append(rr)

    # Calculate mean MRR, returns 0 if mrrs list is empty
    mrr = sum(rrs) / len(rrs) if rrs else 0
    logger.info(f"Mean Reciprocal Rank: {mrr}")

    # Calculate mean MRR for relevant chunks only, returns 0 if relevancy_mrrs list is empty
    relevancy_mrr = sum(relevancy_rrs) / len(relevancy_rrs) if relevancy_rrs else 0
    logger.info(f"Mean Reciprocal Rank for relevant chunks: {relevancy_mrr}")

    # Calculate hit rate (proportion of queries with any successful retrieval), returns 0 if mrrs list is empty
    hit_rate = sum(1 for rr in rrs if rr > 0) / len(rrs) if rrs else 0
    logger.info(f"Hit rate: {hit_rate}")

    # Count number of complete failures (where no relevant+complete chunk was found)
    failures = sum(1 for rr in rrs if rr == 0)
    logger.info(f"Failures: {failures}")

    return hit_rate, mrr, relevancy_mrr, failures




### Main evaluation function
async def run_eval(questions: list[str], retrieval_fn: Callable) -> tuple[list[float], list[float]]:
    # Pair all queries with their contexts
    queries_and_contexts_dict = {question: await retrieval_fn(question) for question in questions}

    # Create a prompt template with the search query as template variable
    prompt_template = await create_prompt_template(prompt_template_name="Agent Blog Post Retrieval")

    # Pass on the questions and add the contexts as responses to the prompt template
    queries_and_contexts = await add_contexts_as_responses(
        prompt_template, questions_and_contexts=queries_and_contexts_dict
    )

    # Rate the contexts
    all_ratings = await rate_contexts(queries_and_contexts)

    # Calculate the MRR
    hit_rate, mrr, relevancy_mrr, failures = calculate_metrics(all_ratings)

    return hit_rate, mrr, relevancy_mrr, failures



if __name__ == "__main__":
    ### Index chunks
    vector_store = setup_retrieval()

    ### Elluminate client
    client = Client(timeout=60)

    ### You can tune your retrieval parameters here
    ### Your retrieval function can have many parameters e.g. chunk size, overlap, reranker, etc.
    # k = 2
    # k = 4
    k = 8
    logger.info(f"Experiment kwargs: {k=}")
    retrieval_fn = partial(get_contexts, k=k)
    hit_rate, mrr, relevancy_mrr, failures = asyncio.run(run_eval(questions, retrieval_fn=retrieval_fn))
    ### Example output for k=8
    # 2024-12-18 14:27:48.887 | INFO     | __main__:<module>:245 - Experiment kwargs: k=8
    # 2024-12-18 14:28:07.621 | INFO     | __main__:add_contexts_as_responses:131 - Added 240 responses to prompt template Agent Blog Post Retrieval
    # 2024-12-18 14:29:47.557 | INFO     | __main__:rate_contexts:151 - Rated 240 responses
    # 2024-12-18 14:29:47.557 | INFO     | __main__:calculate_metrics:187 - Mean Reciprocal Rank: 0.35
    # 2024-12-18 14:29:47.557 | INFO     | __main__:calculate_metrics:191 - Mean Reciprocal Rank for relevant chunks: 0.9833333333333333
    # 2024-12-18 14:29:47.558 | INFO     | __main__:calculate_metrics:195 - Hit rate: 0.36666666666666664
    # 2024-12-18 14:29:47.558 | INFO     | __main__:calculate_metrics:199 - Failures: 19
    ### Example output for k=4
    # 2024-12-18 15:18:15.195 | INFO     | __main__:<module>:245 - Experiment kwargs: k=4
    # 2024-12-18 15:18:37.742 | INFO     | __main__:add_contexts_as_responses:131 - Added 120 responses to prompt template Agent Blog Post Retrieval
    # 2024-12-18 15:18:55.299 | INFO     | __main__:rate_contexts:151 - Rated 120 responses
    # 2024-12-18 15:18:55.299 | INFO     | __main__:calculate_metrics:187 - Mean Reciprocal Rank: 0.31666666666666665
    # 2024-12-18 15:18:55.299 | INFO     | __main__:calculate_metrics:191 - Mean Reciprocal Rank for relevant chunks: 0.9833333333333333
    # 2024-12-18 15:18:55.300 | INFO     | __main__:calculate_metrics:195 - Hit rate: 0.3333333333333333
    # 2024-12-18 15:18:55.300 | INFO     | __main__:calculate_metrics:199 - Failures: 20