Reference-free Evaluation of Retrieval¶

This example demonstrates how to evaluate the quality of a retrieval system without knowing correct chunks beforehand. Instead of requiring manually labeled correct chunks for each query, we use an LLM-based judge to rate the relevance and completeness of retrieved chunks. This approach is particularly valuable when:

You don't have access to ground truth annotations
You want to quickly iterate on retrieval parameters
You want to quickly evaluate retrieval quality on new domains

Example¶

This is an example from the document we will use in this guide.

For a search query like

"What are the methods for task decomposition in the Tree of Thoughts framework as proposed by Yao et al. 2023?"

a retrieval system with top_k set to four, might return the following chunks:

"Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote. Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs."
"Fig. 1. Overview of a LLM-powered autonomous agent system. Component One: Planning# A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. Task Decomposition# Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process."
"The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning."
"Another quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into “Problem PDDL”, then (2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”, and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains. Self-Reflection# Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable."

Without knowing apriori which chunks are usefull, we will evaluate their relevance and completeness, as well as score how well they rank among each other.

Overview¶

The script implements a reference-free evaluation pipeline that:

Loads and processes a document (in this case, a blog post)
Creates embeddings and builds a vector store for retrieval
Performs retrieval for a set of test questions
Uses an LLM to judge the quality of retrieved contexts
Calculates evaluation metrics like Hit Rate and MRR (Mean Reciprocal Rank)

We will achieve this by creatively repurposing Elluminate's abstractions for prompt templates, responses, and criteria. A future version will support this use case more natively.

At the end of this article, you can find the full Python script.

Prerequisites¶

First, install the required dependencies. This script is inspired by an example from LangChain, but you can use any other library for embedding and retrieval.

pip install langchain-core langchain-text-splitters langchain-openai langchain-community bs4 loguru elluminate

You'll need to set up your API keys as environment variables:

export OPENAI_API_KEY="your-openai-key"
export ELLUMINATE_API_KEY="your-elluminate-key"

You'll also need a set of questions to evaluate the retrieval system. You can come up with plausible examples, collect them from production logs, or generate them with an LLM. We provide a set of questions in the full script at the end of this article.

Setting up Retrieval¶

First, we create a vector store from our document. This function loads a web page (a blog post on AI agents in this case), splits it into chunks, and creates embeddings:

def setup_retrieval(url="https://lilianweng.github.io/posts/2023-06-23-agent/"):
    """Setup retrieval by loading the blog post and splitting it into chunks."""
    loader = WebBaseLoader(
        web_paths=(url,),
        bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))),
    )
    docs = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    all_splits = text_splitter.split_documents(docs)

    embeddings = OpenAIEmbeddings(
        model="text-embedding-3-large",
    )
    vector_store = InMemoryVectorStore(embeddings)
    vector_store.add_documents(documents=all_splits)
    return vector_store

Elluminate expects the contexts to be a list of strings. So you might have to wrap your retrieval function.

async def get_contexts(question, k) -> list[str]:
    """Wrap retrieval in a helper function to return the contexts as a list of strings."""
    contexts = await vector_store.asimilarity_search(question, k=k)
    return [c.page_content for c in contexts]

Prompt Template with Criteria¶

We create a prompt template that will hold the search query later on. We also add evaluation criteria to the prompt template for rating the relevance and completeness of the retrieved contexts.

As criteria, we choose:

Relevance: Does the chunk cover the information requested in the query?
Completeness: Is the chunk sufficient to answer the query?

For the example query

"What are the methods for task decomposition in the Tree of Thoughts framework as proposed by Yao et al. 2023?"

the following chunk is deemed relevant and complete:

"Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote. Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs."

This chunk, however, is only deemed relevant:

"Fig. 1. Overview of a LLM-powered autonomous agent system. Component One: Planning# A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. Task Decomposition# Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process."

async def create_prompt_template(prompt_template_name: str) -> PromptTemplate:
    """Create an empty prompt template with the search query as template variable.

    We will fill this template with the query and add retrieved contexts for every example in the dataset.
    The specified criteria for relevance and completeness, however, stay the same for all examples and will be used to rate the retrieved contexts' relevance and completeness.
    """
    prompt_template, created = await client.prompt_templates.aget_or_create(
        "{{question}}", name=prompt_template_name
    )
    if created:
        # Add criteria to the prompt template if it's newly created
        await client.criteria.aadd_many(
            [
                "Does the response share significant domain overlap with the query, such as technical content for technical queries, even if it discusses different aspects or use cases within that domain? A response discussing a broader/narrower aspect of the same domain should be considered relevant.",
                "Does the response contain the specific information requested in the query? The response should contain an answer to the core information need expressed in the query, not just related information. For example, if asking about a specific code, price, or date, that exact information must be present. It is acceptable for the response to contain additional information as long as the core information need is met.",
            ],
            prompt_template,
        )
        logger.info(f"Added criteria to prompt template {prompt_template.name}")
    return prompt_template

Adding Contexts¶

We fill in the prompt template with the query and add the retrieved contexts as distinct responses. This way we can rate each context independently in regards to the query.

async def add_contexts_as_responses(
    prompt_template: PromptTemplate,
    questions_and_contexts: dict[str, list[str]],
) -> list[list[PromptResponse]]:
    """Add the contexts as responses to the prompt template.

    We pass on the question as a template variable to fill the prompt template.
    We then add the contexts as multiple distinct responses to this prompt. This way we have prompt with multiple answers, we can rate independently.
    """
    all_responses = []
    for question, context in questions_and_contexts.items():
        # Add the question as a template variable to the prompt template
        template_variables = await client.template_variables.aadd_to_collection(
            template_variables={"question": question},
            collection=prompt_template.default_template_variables_collection,
        )
        # Add a response to the prompt template for each chunk of context
        responses = await asyncio.gather(
            *[
                client.responses.aadd(chunk, prompt_template, template_variables=template_variables)
                for chunk in context
            ]
        )
        all_responses.append(responses)
    logger.info(f"Added {sum(len(x) for x in all_responses)} responses to prompt template {prompt_template.name}")
    return all_responses

Rating the Contexts¶

Elluminate rates each context for relevance and completeness.

async def rate_contexts(all_responses: list[list[PromptResponse]]) -> list[list[list[Rating]]]:
    """Rate the contexts for each search query.

    Each chunk receives two ratings from an LLM-judge: one for relevance and one for completeness.
    This way we can achieve a reference-free evaluation without requiring ground truth contexts for the search queries.
    """
    all_ratings = await asyncio.gather(
        *[
            asyncio.gather(*[client.ratings.arate(response, rating_mode=RatingMode.FAST) for response in responses])
            for responses in all_responses
        ]
    )
    logger.info(f"Rated {sum(len(x) for x in all_ratings)} responses")
    return all_ratings

Calculating Metrics¶

Hit rate is the percentage of queries that have at least one relevant context. A higher hit rate indicates that your retrieval system is better at finding at least one useful result for each query.

MRR is a metric that measures how well the retrieval system ranks relevant results. For each query, it looks at the position of the first relevant result and takes the reciprocal (i.e. inverse) of that position (1/position). The final score is the average across all queries.

Here is a simple example:

First query: "What are the health benefits of drinking water?"

Retrieved contexts (in order):

"Benefits of water for hydration and health..." ✓
"Different types of beverages..."
"Water and exercise performance..." ✓
"Water pollution statistics..."
First relevant result is at position 1
Reciprocal rank = \(\frac{1}{1} = 1.0\)

Second query: "What is the recommended daily water intake?"
Retrieved contexts:

"Caffeine consumption guidelines..."
"Daily water intake recommendations..." ✓
"Dehydration symptoms..."
"Water quality standards..."
First relevant result is at position 2
Reciprocal rank = \(\frac{1}{2} = 0.5\)

MRR of two examples:

MRR = \(\frac{1.0 + 0.5}{2} = 0.75\)

The implementation calculates both MRR (requiring relevance and completeness in our case) and a relevancy-only MRR, as well as hit rate and failures.

def calculate_metrics(all_ratings: list[list[list[Rating]]]) -> tuple[list[float], list[float]]:
    """Since we have an LLM-verdict on correct contexts now, we can calculate the MRR (mean reciprocal rank) for each search query,
    i.e. the position of the first complete and relevant chunk.
    """
    rrs = []
    relevancy_rrs = []
    for ratings in all_ratings:
        rr = 0
        found_relevant = False
        for rating in ratings:
            relevant = rating[0].rating
            complete = rating[1].rating

            # If the chunk is relevant and we haven't found a relevant chunk yet
            # Calculate MRR for relevancy (1/position) and mark that we found a relevant chunk
            if relevant and not found_relevant:
                relevancy_rrs.append(1 / (ratings.index(rating) + 1))
                found_relevant = True

            # If the chunk is both relevant AND complete
            # Calculate MRR (1/position) and stop looking at further chunks
            if relevant and complete:
                rr = 1 / (ratings.index(rating) + 1)
                break
        rrs.append(rr)

    # Calculate mean MRR, returns 0 if mrrs list is empty
    mrr = sum(rrs) / len(rrs) if rrs else 0
    logger.info(f"Mean Reciprocal Rank: {mrr}")

    # Calculate mean MRR for relevant chunks only, returns 0 if relevancy_mrrs list is empty
    relevancy_mrr = sum(relevancy_rrs) / len(relevancy_rrs) if relevancy_rrs else 0
    logger.info(f"Mean Reciprocal Rank for relevant chunks: {relevancy_mrr}")

    # Calculate hit rate (proportion of queries with any successful retrieval), returns 0 if mrrs list is empty
    hit_rate = sum(1 for rr in rrs if rr > 0) / len(rrs) if rrs else 0
    logger.info(f"Hit rate: {hit_rate}")

    # Count number of complete failures (where no relevant+complete chunk was found)
    failures = sum(1 for rr in rrs if rr == 0)
    logger.info(f"Failures: {failures}")

    return hit_rate, mrr, relevancy_mrr, failures

Running the Evaluation¶

The main evaluation function ties everything together:

### Main evaluation function
async def run_eval(questions: list[str], retrieval_fn: Callable) -> tuple[list[float], list[float]]:
    # Pair all queries with their contexts
    queries_and_contexts_dict = {question: await retrieval_fn(question) for question in questions}

    # Create a prompt template with the search query as template variable
    prompt_template = await create_prompt_template(prompt_template_name="Agent Blog Post Retrieval")

    # Pass on the questions and add the contexts as responses to the prompt template
    queries_and_contexts = await add_contexts_as_responses(
        prompt_template, questions_and_contexts=queries_and_contexts_dict
    )

    # Rate the contexts
    all_ratings = await rate_contexts(queries_and_contexts)

    # Calculate the MRR
    hit_rate, mrr, relevancy_mrr, failures = calculate_metrics(all_ratings)

    return hit_rate, mrr, relevancy_mrr, failures

After looking at the metrics in your terminal or inspecting single examples in the dashboard, you can tune your retrieval parameters to improve performance.

### You can tune your retrieval parameters here
### Your retrieval function can have many parameters e.g. chunk size, overlap, reranker, etc.
# k = 2
# k = 4
k = 8
logger.info(f"Experiment kwargs: {k=}")
retrieval_fn = partial(get_contexts, k=k)
hit_rate, mrr, relevancy_mrr, failures = asyncio.run(run_eval(questions, retrieval_fn=retrieval_fn))
### Example output for k=8
# 2024-12-18 14:27:48.887 | INFO     | __main__:<module>:245 - Experiment kwargs: k=8
# 2024-12-18 14:28:07.621 | INFO     | __main__:add_contexts_as_responses:131 - Added 240 responses to prompt template Agent Blog Post Retrieval
# 2024-12-18 14:29:47.557 | INFO     | __main__:rate_contexts:151 - Rated 240 responses
# 2024-12-18 14:29:47.557 | INFO     | __main__:calculate_metrics:187 - Mean Reciprocal Rank: 0.35
# 2024-12-18 14:29:47.557 | INFO     | __main__:calculate_metrics:191 - Mean Reciprocal Rank for relevant chunks: 0.9833333333333333
# 2024-12-18 14:29:47.558 | INFO     | __main__:calculate_metrics:195 - Hit rate: 0.36666666666666664
# 2024-12-18 14:29:47.558 | INFO     | __main__:calculate_metrics:199 - Failures: 19
### Example output for k=4
# 2024-12-18 15:18:15.195 | INFO     | __main__:<module>:245 - Experiment kwargs: k=4
# 2024-12-18 15:18:37.742 | INFO     | __main__:add_contexts_as_responses:131 - Added 120 responses to prompt template Agent Blog Post Retrieval
# 2024-12-18 15:18:55.299 | INFO     | __main__:rate_contexts:151 - Rated 120 responses
# 2024-12-18 15:18:55.299 | INFO     | __main__:calculate_metrics:187 - Mean Reciprocal Rank: 0.31666666666666665
# 2024-12-18 15:18:55.299 | INFO     | __main__:calculate_metrics:191 - Mean Reciprocal Rank for relevant chunks: 0.9833333333333333
# 2024-12-18 15:18:55.300 | INFO     | __main__:calculate_metrics:195 - Hit rate: 0.3333333333333333
# 2024-12-18 15:18:55.300 | INFO     | __main__:calculate_metrics:199 - Failures: 20

Full Script¶

# Set your `OPENAI_API_KEY` and `ELLUMINATE_API_KEY` in your environment
import asyncio
from functools import partial
from typing import Callable

import bs4
from dotenv.main import load_dotenv
from elluminate import Client
from elluminate.schemas import (
    PromptResponse,
    PromptTemplate,
    Rating,
    RatingMode,
)
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from loguru import logger

load_dotenv(override=True)

questions = [
    "What are the key components that complement the LLM in an LLM-powered autonomous agent system?",
    "What is the difference between short-term memory and long-term memory in the context of AI models as described in the text?",
    "What is the purpose of the Chain of Thought (CoT) technique in the context of a LLM-powered autonomous agent system?",
    "What are the methods for task decomposition in the Tree of Thoughts framework as proposed by Yao et al. 2023?",
    "What is the role of the Planning Domain Definition Language (PDDL) in the LLM+P approach described by Liu et al. (2023)?",
    "What is the purpose of extending the action space in the ReAct framework, as described by Yao et al. 2023?",
    "What role does Reflexion play in enhancing reasoning skills for agents in knowledge-intensive and decision-making tasks?",
    "What is the role of self-reflection in the Reflexion framework as described by Shinn & Labash (2023)?",
    "What is identified as a more common failure in the AlfWorld environment according to Shinn & Labash's 2023 study?",
    "What is the main goal of the Chain of Hindsight (CoH) approach described by Liu et al. 2023?",
    "What techniques does CoH use to prevent overfitting and shortcutting in its training process?",
    "How does the CoH approach, as described in the text, improve the model's performance in generating outputs?",
    "What is the necessary length of multi-episodic contexts to learn a near-optimal in-context reinforcement learning algorithm according to the text?",
    "How does the performance and learning speed of AD compare to the baselines, including ED, source policy, and RL^2, when conditioned on partial training history of the source policy?",
    "What types of memory are involved in the process of acquiring, storing, retaining, and retrieving information as mentioned in the text?",
    "What is the difference in capacity and duration between short-term memory and long-term memory as described in the text?",
    "What are the different types of human memory as categorized in the text, and how are they mapped to machine learning concepts?",
    "What are some common choices of Approximate Nearest Neighbors (ANN) algorithms used for fast Maximum Inner Product Search (MIPS)?",
    "What is the core data structure used in ANNOY, and how does it function in the context of approximate nearest neighbor search?",
    "How does the HNSW algorithm utilize hierarchical layers to optimize the search process in small-world networks?",
    "What is the main innovation in ScaNN compared to FAISS, and how does it differ in its approach to vector quantization?",
    "What website can be visited to find more information and performance comparisons of MIPS algorithms?",
    "What is the function of the general-purpose LLM in a MRKL system according to the text?",
    "What are the challenges faced by the 7B Jurassic1-large model in solving arithmetic problems, according to the experiment?",
    "What is HuggingGPT and how does it utilize ChatGPT in its framework?",
    "What are the attributes associated with each task in the task planning stage of HuggingGPT?",
    'What does the "dep" field indicate in the task structure described in the text?',
    "What is the process described for selecting a suitable model from a list of candidate models to handle a user request?",
    "What are the stages involved in the AI assistant's task process as described in the text?",
    "What challenges need to be addressed to effectively implement HuggingGPT in real-world applications?",
]


def setup_retrieval(url="https://lilianweng.github.io/posts/2023-06-23-agent/"):
    """Setup retrieval by loading the blog post and splitting it into chunks."""
    loader = WebBaseLoader(
        web_paths=(url,),
        bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))),
    )
    docs = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    all_splits = text_splitter.split_documents(docs)

    embeddings = OpenAIEmbeddings(
        model="text-embedding-3-large",
    )
    vector_store = InMemoryVectorStore(embeddings)
    vector_store.add_documents(documents=all_splits)
    return vector_store




async def get_contexts(question, k) -> list[str]:
    """Wrap retrieval in a helper function to return the contexts as a list of strings."""
    contexts = await vector_store.asimilarity_search(question, k=k)
    return [c.page_content for c in contexts]




async def create_prompt_template(prompt_template_name: str) -> PromptTemplate:
    """Create an empty prompt template with the search query as template variable.

    We will fill this template with the query and add retrieved contexts for every example in the dataset.
    The specified criteria for relevance and completeness, however, stay the same for all examples and will be used to rate the retrieved contexts' relevance and completeness.
    """
    prompt_template, created = await client.prompt_templates.aget_or_create(
        "{{question}}", name=prompt_template_name
    )
    if created:
        # Add criteria to the prompt template if it's newly created
        await client.criteria.aadd_many(
            [
                "Does the response share significant domain overlap with the query, such as technical content for technical queries, even if it discusses different aspects or use cases within that domain? A response discussing a broader/narrower aspect of the same domain should be considered relevant.",
                "Does the response contain the specific information requested in the query? The response should contain an answer to the core information need expressed in the query, not just related information. For example, if asking about a specific code, price, or date, that exact information must be present. It is acceptable for the response to contain additional information as long as the core information need is met.",
            ],
            prompt_template,
        )
        logger.info(f"Added criteria to prompt template {prompt_template.name}")
    return prompt_template




async def add_contexts_as_responses(
    prompt_template: PromptTemplate,
    questions_and_contexts: dict[str, list[str]],
) -> list[list[PromptResponse]]:
    """Add the contexts as responses to the prompt template.

    We pass on the question as a template variable to fill the prompt template.
    We then add the contexts as multiple distinct responses to this prompt. This way we have prompt with multiple answers, we can rate independently.
    """
    all_responses = []
    for question, context in questions_and_contexts.items():
        # Add the question as a template variable to the prompt template
        template_variables = await client.template_variables.aadd_to_collection(
            template_variables={"question": question},
            collection=prompt_template.default_template_variables_collection,
        )
        # Add a response to the prompt template for each chunk of context
        responses = await asyncio.gather(
            *[
                client.responses.aadd(chunk, prompt_template, template_variables=template_variables)
                for chunk in context
            ]
        )
        all_responses.append(responses)
    logger.info(f"Added {sum(len(x) for x in all_responses)} responses to prompt template {prompt_template.name}")
    return all_responses




async def rate_contexts(all_responses: list[list[PromptResponse]]) -> list[list[list[Rating]]]:
    """Rate the contexts for each search query.

    Each chunk receives two ratings from an LLM-judge: one for relevance and one for completeness.
    This way we can achieve a reference-free evaluation without requiring ground truth contexts for the search queries.
    """
    all_ratings = await asyncio.gather(
        *[
            asyncio.gather(*[client.ratings.arate(response, rating_mode=RatingMode.FAST) for response in responses])
            for responses in all_responses
        ]
    )
    logger.info(f"Rated {sum(len(x) for x in all_ratings)} responses")
    return all_ratings




def calculate_metrics(all_ratings: list[list[list[Rating]]]) -> tuple[list[float], list[float]]:
    """Since we have an LLM-verdict on correct contexts now, we can calculate the MRR (mean reciprocal rank) for each search query,
    i.e. the position of the first complete and relevant chunk.
    """
    rrs = []
    relevancy_rrs = []
    for ratings in all_ratings:
        rr = 0
        found_relevant = False
        for rating in ratings:
            relevant = rating[0].rating
            complete = rating[1].rating

            # If the chunk is relevant and we haven't found a relevant chunk yet
            # Calculate MRR for relevancy (1/position) and mark that we found a relevant chunk
            if relevant and not found_relevant:
                relevancy_rrs.append(1 / (ratings.index(rating) + 1))
                found_relevant = True

            # If the chunk is both relevant AND complete
            # Calculate MRR (1/position) and stop looking at further chunks
            if relevant and complete:
                rr = 1 / (ratings.index(rating) + 1)
                break
        rrs.append(rr)

    # Calculate mean MRR, returns 0 if mrrs list is empty
    mrr = sum(rrs) / len(rrs) if rrs else 0
    logger.info(f"Mean Reciprocal Rank: {mrr}")

    # Calculate mean MRR for relevant chunks only, returns 0 if relevancy_mrrs list is empty
    relevancy_mrr = sum(relevancy_rrs) / len(relevancy_rrs) if relevancy_rrs else 0
    logger.info(f"Mean Reciprocal Rank for relevant chunks: {relevancy_mrr}")

    # Calculate hit rate (proportion of queries with any successful retrieval), returns 0 if mrrs list is empty
    hit_rate = sum(1 for rr in rrs if rr > 0) / len(rrs) if rrs else 0
    logger.info(f"Hit rate: {hit_rate}")

    # Count number of complete failures (where no relevant+complete chunk was found)
    failures = sum(1 for rr in rrs if rr == 0)
    logger.info(f"Failures: {failures}")

    return hit_rate, mrr, relevancy_mrr, failures




### Main evaluation function
async def run_eval(questions: list[str], retrieval_fn: Callable) -> tuple[list[float], list[float]]:
    # Pair all queries with their contexts
    queries_and_contexts_dict = {question: await retrieval_fn(question) for question in questions}

    # Create a prompt template with the search query as template variable
    prompt_template = await create_prompt_template(prompt_template_name="Agent Blog Post Retrieval")

    # Pass on the questions and add the contexts as responses to the prompt template
    queries_and_contexts = await add_contexts_as_responses(
        prompt_template, questions_and_contexts=queries_and_contexts_dict
    )

    # Rate the contexts
    all_ratings = await rate_contexts(queries_and_contexts)

    # Calculate the MRR
    hit_rate, mrr, relevancy_mrr, failures = calculate_metrics(all_ratings)

    return hit_rate, mrr, relevancy_mrr, failures



if __name__ == "__main__":
    ### Index chunks
    vector_store = setup_retrieval()

    ### Elluminate client
    client = Client(timeout=60)

    ### You can tune your retrieval parameters here
    ### Your retrieval function can have many parameters e.g. chunk size, overlap, reranker, etc.
    # k = 2
    # k = 4
    k = 8
    logger.info(f"Experiment kwargs: {k=}")
    retrieval_fn = partial(get_contexts, k=k)
    hit_rate, mrr, relevancy_mrr, failures = asyncio.run(run_eval(questions, retrieval_fn=retrieval_fn))
    ### Example output for k=8
    # 2024-12-18 14:27:48.887 | INFO     | __main__:<module>:245 - Experiment kwargs: k=8
    # 2024-12-18 14:28:07.621 | INFO     | __main__:add_contexts_as_responses:131 - Added 240 responses to prompt template Agent Blog Post Retrieval
    # 2024-12-18 14:29:47.557 | INFO     | __main__:rate_contexts:151 - Rated 240 responses
    # 2024-12-18 14:29:47.557 | INFO     | __main__:calculate_metrics:187 - Mean Reciprocal Rank: 0.35
    # 2024-12-18 14:29:47.557 | INFO     | __main__:calculate_metrics:191 - Mean Reciprocal Rank for relevant chunks: 0.9833333333333333
    # 2024-12-18 14:29:47.558 | INFO     | __main__:calculate_metrics:195 - Hit rate: 0.36666666666666664
    # 2024-12-18 14:29:47.558 | INFO     | __main__:calculate_metrics:199 - Failures: 19
    ### Example output for k=4
    # 2024-12-18 15:18:15.195 | INFO     | __main__:<module>:245 - Experiment kwargs: k=4
    # 2024-12-18 15:18:37.742 | INFO     | __main__:add_contexts_as_responses:131 - Added 120 responses to prompt template Agent Blog Post Retrieval
    # 2024-12-18 15:18:55.299 | INFO     | __main__:rate_contexts:151 - Rated 120 responses
    # 2024-12-18 15:18:55.299 | INFO     | __main__:calculate_metrics:187 - Mean Reciprocal Rank: 0.31666666666666665
    # 2024-12-18 15:18:55.299 | INFO     | __main__:calculate_metrics:191 - Mean Reciprocal Rank for relevant chunks: 0.9833333333333333
    # 2024-12-18 15:18:55.300 | INFO     | __main__:calculate_metrics:195 - Hit rate: 0.3333333333333333
    # 2024-12-18 15:18:55.300 | INFO     | __main__:calculate_metrics:199 - Failures: 20