Skip to content

Reference-free Evaluation of Retrieval

Learn to evaluate retrieval system quality without ground truth annotations using LLM-based judges

This example demonstrates how to evaluate the quality of a retrieval system without knowing correct chunks beforehand. Instead of requiring manually labeled correct chunks for each query, we use an LLM-based judge to rate the relevance and completeness of retrieved chunks. This approach is particularly valuable when:

  • You don't have access to ground truth annotations
  • You want to quickly iterate on retrieval parameters
  • You want to quickly evaluate retrieval quality on new domains

Example

This is an example from the document we will use in this guide.

For a search query like

  • "What are the methods for task decomposition in the Tree of Thoughts framework as proposed by Yao et al. 2023?"

a retrieval system with top_k set to four, might return the following chunks:

  • "Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote. Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs."
  • "Fig. 1. Overview of a LLM-powered autonomous agent system. Component One: Planning# A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. Task Decomposition# Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process."
  • "The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning."
  • "Another quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into “Problem PDDL”, then (2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”, and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains. Self-Reflection# Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable."

Without knowing apriori which chunks are usefull, we will evaluate their relevance and completeness, as well as score how well they rank among each other.

Overview

The script implements a reference-free evaluation pipeline that:

  1. Loads and processes a document (in this case, a blog post)
  2. Creates embeddings and builds a vector store for retrieval
  3. Performs retrieval for a set of test questions
  4. Uses an LLM to judge the quality of retrieved contexts
  5. Calculates evaluation metrics like Hit Rate and MRR (Mean Reciprocal Rank)

We will achieve this by creatively repurposing elluminate's abstractions for prompt templates, responses, and criteria. A future version will support this use case more natively.

At the end of this article, you can find the full Python script.

Prerequisites

First, install the required dependencies. This script is inspired by an example from LangChain, but you can use any other library for embedding and retrieval.

pip install langchain-core langchain-text-splitters langchain-openai langchain-community bs4 loguru elluminate

You'll need to set up your API keys as environment variables:

export OPENAI_API_KEY="your-openai-key"
export ELLUMINATE_API_KEY="your-elluminate-key"

You'll also need a set of questions to evaluate the retrieval system. You can come up with plausible examples, collect them from production logs, or generate them with an LLM. We provide a set of questions in the full script at the end of this article.

Setting up Retrieval

First, we create a vector store from our document. This function loads a web page (a blog post on AI agents in this case), splits it into chunks, and creates embeddings:

def setup_retrieval(url="https://lilianweng.github.io/posts/2023-06-23-agent/"):
    """Setup retrieval by loading the blog post and splitting it into chunks."""
    loader = WebBaseLoader(
        web_paths=(url,),
        bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))),
    )
    docs = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    all_splits = text_splitter.split_documents(docs)

    embeddings = OpenAIEmbeddings(
        api_key=os.getenv("OPENAI_API_KEY_EMBEDDING_SMALL"),
        model="text-embedding-3-small",
    )
    vector_store = InMemoryVectorStore(embeddings)
    vector_store.add_documents(documents=all_splits)
    return vector_store

elluminate expects the contexts to be a list of strings. So you might have to wrap your retrieval function:

def get_contexts(question, k) -> list[str]:
    """Wrap retrieval in a helper function to return the contexts as a list of strings."""
    contexts = vector_store.similarity_search(question, k=k)
    return [c.page_content for c in contexts]

Prompt Template with Criteria

We create a prompt template that will hold the search query later on. We also add evaluation criteria to the prompt template for rating the relevance and completeness of the retrieved contexts.

As criteria, we choose:

  • Relevance - Does the chunk cover the information requested in the query?
  • Completeness - Is the chunk sufficient to answer the query?

For the example query

  • "What are the methods for task decomposition in the Tree of Thoughts framework as proposed by Yao et al. 2023?"

the following chunk is deemed relevant and complete:

  • "Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote. Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs."

This chunk, however, is only deemed relevant:

  • "Fig. 1. Overview of a LLM-powered autonomous agent system. Component One: Planning# A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. Task Decomposition# Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process."
def create_prompt_template(prompt_template_name: str) -> PromptTemplate:
    """Create an empty prompt template with the search query as template variable."""
    template, created = client.get_or_create_prompt_template(
        name=prompt_template_name,
        messages="{{question}}",
    )
    if created:
        criterion_set = client.create_criterion_set(name=f"{prompt_template_name} Criteria")
        criterion_set.add_criteria(
            [
                "Does the response share significant domain overlap with the query?",
                "Does the response contain the specific information requested in the query?",
            ]
        )
        criterion_set.link_template(template)
        logger.info(f"Added criteria to prompt template {template.name}")
    return template

Creating an Experiment

We need to create an experiment in order to run an evaluation. When responses are added to an experiment and then rated, the rating results can be inspected either via the SDK or in the frontend.

def create_experiment(
    prompt_template: PromptTemplate,
    collection_name: str = "Retrieval Test Variables",
    experiment_name: str = "Retrieval Test Experiment",
) -> tuple[Experiment, TemplateVariablesCollectionWithEntries]:
    """Create an experiment for retrieval testing."""
    collection, _ = client.get_or_create_collection(
        name=collection_name,
        defaults={"description": "Template variables for retrieval test questions"},
    )

    experiment = client.create_experiment(
        name=experiment_name,
        prompt_template=prompt_template,
        collection=collection,
        description="Experiment for testing retrieval performance using LLM judges",
    )

    logger.info(f"Created experiment: {experiment.name}")
    return experiment, collection

Adding Contexts

We populate the prompt template with the query and add each retrieved context as its own distinct response. This way we can rate each context independently in regards to the query. Responses are assigned to the experiment from above to keep track of the rating results.

def add_contexts_as_responses(
    experiment: Experiment,
    collection: TemplateVariablesCollectionWithEntries,
    questions_and_contexts: dict[str, list[str]],
) -> list[list[PromptResponse]]:
    """Add the contexts as responses to the prompt template."""
    all_responses = []
    for question, chunks in questions_and_contexts.items():
        template_variables = collection.add_many(variables=[{"question": question}])[0]

        responses = experiment.add_responses(
            responses=chunks,
            template_variables=[template_variables] * len(chunks),
        )
        all_responses.append(responses)
    logger.info(f"Added {sum(len(x) for x in all_responses)} responses")
    return all_responses

Rating the Contexts

elluminate rates each context for relevance and completeness.

def rate_contexts(experiment: Experiment, all_responses: list[list[PromptResponse]]) -> list[list[list[Rating]]]:
    """Rate the contexts for each search query."""
    experiment.rate_responses(rating_mode=RatingMode.FAST)

    all_ratings = []
    for responses in all_responses:
        response_ratings = [response.ratings for response in responses]
        all_ratings.append(response_ratings)

    logger.info(f"Rated {sum(len(x) for x in all_ratings)} responses")
    return all_ratings

Calculating Metrics

Hit rate is the percentage of queries that have at least one relevant context. A higher hit rate indicates that your retrieval system is better at finding at least one useful result for each query.

MRR is a metric that measures how well the retrieval system ranks relevant results. For each query, it looks at the position of the first relevant result and takes the reciprocal (i.e. inverse) of that position (\(\frac{1}{position}\)). The final score is the average across all queries.

A Simple Example

First query: "What are the health benefits of drinking water?"
Retrieved contexts (in order):

  1. "Benefits of water for hydration and health..."
  2. "Different types of beverages..."
  3. "Water and exercise performance..."
  4. "Water pollution statistics..."

→ First relevant result is at position 1
→ Reciprocal rank = \(\frac{1}{1} = 1.0\)

Second query: "What is the recommended daily water intake?"
Retrieved contexts:

  1. "Caffeine consumption guidelines..."
  2. "Daily water intake recommendations..."
  3. "Dehydration symptoms..."
  4. "Water quality standards..."

→ First relevant result is at position 2
→ Reciprocal rank = \(\frac{1}{2} = 0.5\)

MRR of two examples

→ MRR = \(\frac{1.0 + 0.5}{2} = 0.75\)

The implementation calculates both MRR (requiring relevance and completeness in our case) and a relevancy-only MRR, as well as hit rate and failures.

def calculate_metrics(all_ratings: list[list[list[Rating]]]) -> tuple[float, float, float, int]:
    """Calculate MRR (mean reciprocal rank) for each search query."""
    rrs = []
    relevancy_rrs = []
    for ratings in all_ratings:
        rr = 0
        found_relevant = False
        for rating in ratings:
            if len(rating) < 2:
                continue
            relevant = rating[0].rating
            complete = rating[1].rating

            if relevant and not found_relevant:
                relevancy_rrs.append(1 / (ratings.index(rating) + 1))
                found_relevant = True

            if relevant and complete:
                rr = 1 / (ratings.index(rating) + 1)
                break
        rrs.append(rr)

    mrr = sum(rrs) / len(rrs) if rrs else 0
    relevancy_mrr = sum(relevancy_rrs) / len(relevancy_rrs) if relevancy_rrs else 0
    hit_rate = sum(1 for rr in rrs if rr > 0) / len(rrs) if rrs else 0
    failures = sum(1 for rr in rrs if rr == 0)

    logger.info(f"MRR: {mrr}, Relevancy MRR: {relevancy_mrr}, Hit rate: {hit_rate}, Failures: {failures}")
    return hit_rate, mrr, relevancy_mrr, failures

Running the Evaluation

The main evaluation function ties everything together:

def run_test(questions: list[str], retrieval_fn: Callable) -> tuple[float, float, float, int]:
    """Main test function."""
    queries_and_contexts_dict = {question: retrieval_fn(question) for question in questions}
    prompt_template = create_prompt_template(prompt_template_name="Agent Blog Post Retrieval")
    experiment, collection = create_experiment(prompt_template)
    queries_and_contexts = add_contexts_as_responses(experiment, collection, queries_and_contexts_dict)
    all_ratings = rate_contexts(experiment, queries_and_contexts)
    return calculate_metrics(all_ratings)

After looking at the metrics in your terminal or inspecting single examples in the dashboard, you can tune your retrieval parameters to improve performance.

vector_store = setup_retrieval()
client = Client(timeout=60)
k = 8
logger.info(f"Experiment kwargs: {k=}")
retrieval_fn = partial(get_contexts, k=k)
hit_rate, mrr, relevancy_mrr, failures = run_test(questions, retrieval_fn=retrieval_fn)

Full Script

# Set your `OPENAI_API_KEY_EMBEDDING_SMALL` and `ELLUMINATE_API_KEY` in your environment
"""Retrieval Quality Assessment Example using LangChain (v1.0 API)

This example demonstrates how to assess RAG (Retrieval-Augmented Generation) pipelines
using Elluminate. It uses LangChain for document loading and vector search, then assesses
the retrieved contexts using LLM-as-judge with criteria for relevance and completeness.

Requires: langchain, langchain-community, langchain-openai, bs4, loguru
"""

import os
from functools import partial
from typing import Callable

import bs4
from dotenv import load_dotenv
from elluminate import Client
from elluminate.schemas import (
    Experiment,
    PromptResponse,
    PromptTemplate,
    Rating,
    RatingMode,
    TemplateVariablesCollectionWithEntries,
)
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from loguru import logger

load_dotenv(override=True)

questions = [
    "What are the key components that complement the LLM in an LLM-powered autonomous agent system?",
    "What is the difference between short-term memory and long-term memory in the context of AI models?",
    "What is the purpose of the Chain of Thought (CoT) technique?",
    "What are the methods for task decomposition in the Tree of Thoughts framework?",
    "What is the role of PDDL in the LLM+P approach?",
]


def setup_retrieval(url="https://lilianweng.github.io/posts/2023-06-23-agent/"):
    """Setup retrieval by loading the blog post and splitting it into chunks."""
    loader = WebBaseLoader(
        web_paths=(url,),
        bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))),
    )
    docs = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    all_splits = text_splitter.split_documents(docs)

    embeddings = OpenAIEmbeddings(
        api_key=os.getenv("OPENAI_API_KEY_EMBEDDING_SMALL"),
        model="text-embedding-3-small",
    )
    vector_store = InMemoryVectorStore(embeddings)
    vector_store.add_documents(documents=all_splits)
    return vector_store




def get_contexts(question, k) -> list[str]:
    """Wrap retrieval in a helper function to return the contexts as a list of strings."""
    contexts = vector_store.similarity_search(question, k=k)
    return [c.page_content for c in contexts]




def create_prompt_template(prompt_template_name: str) -> PromptTemplate:
    """Create an empty prompt template with the search query as template variable."""
    template, created = client.get_or_create_prompt_template(
        name=prompt_template_name,
        messages="{{question}}",
    )
    if created:
        criterion_set = client.create_criterion_set(name=f"{prompt_template_name} Criteria")
        criterion_set.add_criteria(
            [
                "Does the response share significant domain overlap with the query?",
                "Does the response contain the specific information requested in the query?",
            ]
        )
        criterion_set.link_template(template)
        logger.info(f"Added criteria to prompt template {template.name}")
    return template




def create_experiment(
    prompt_template: PromptTemplate,
    collection_name: str = "Retrieval Test Variables",
    experiment_name: str = "Retrieval Test Experiment",
) -> tuple[Experiment, TemplateVariablesCollectionWithEntries]:
    """Create an experiment for retrieval testing."""
    collection, _ = client.get_or_create_collection(
        name=collection_name,
        defaults={"description": "Template variables for retrieval test questions"},
    )

    experiment = client.create_experiment(
        name=experiment_name,
        prompt_template=prompt_template,
        collection=collection,
        description="Experiment for testing retrieval performance using LLM judges",
    )

    logger.info(f"Created experiment: {experiment.name}")
    return experiment, collection




def add_contexts_as_responses(
    experiment: Experiment,
    collection: TemplateVariablesCollectionWithEntries,
    questions_and_contexts: dict[str, list[str]],
) -> list[list[PromptResponse]]:
    """Add the contexts as responses to the prompt template."""
    all_responses = []
    for question, chunks in questions_and_contexts.items():
        template_variables = collection.add_many(variables=[{"question": question}])[0]

        responses = experiment.add_responses(
            responses=chunks,
            template_variables=[template_variables] * len(chunks),
        )
        all_responses.append(responses)
    logger.info(f"Added {sum(len(x) for x in all_responses)} responses")
    return all_responses




def rate_contexts(experiment: Experiment, all_responses: list[list[PromptResponse]]) -> list[list[list[Rating]]]:
    """Rate the contexts for each search query."""
    experiment.rate_responses(rating_mode=RatingMode.FAST)

    all_ratings = []
    for responses in all_responses:
        response_ratings = [response.ratings for response in responses]
        all_ratings.append(response_ratings)

    logger.info(f"Rated {sum(len(x) for x in all_ratings)} responses")
    return all_ratings




def calculate_metrics(all_ratings: list[list[list[Rating]]]) -> tuple[float, float, float, int]:
    """Calculate MRR (mean reciprocal rank) for each search query."""
    rrs = []
    relevancy_rrs = []
    for ratings in all_ratings:
        rr = 0
        found_relevant = False
        for rating in ratings:
            if len(rating) < 2:
                continue
            relevant = rating[0].rating
            complete = rating[1].rating

            if relevant and not found_relevant:
                relevancy_rrs.append(1 / (ratings.index(rating) + 1))
                found_relevant = True

            if relevant and complete:
                rr = 1 / (ratings.index(rating) + 1)
                break
        rrs.append(rr)

    mrr = sum(rrs) / len(rrs) if rrs else 0
    relevancy_mrr = sum(relevancy_rrs) / len(relevancy_rrs) if relevancy_rrs else 0
    hit_rate = sum(1 for rr in rrs if rr > 0) / len(rrs) if rrs else 0
    failures = sum(1 for rr in rrs if rr == 0)

    logger.info(f"MRR: {mrr}, Relevancy MRR: {relevancy_mrr}, Hit rate: {hit_rate}, Failures: {failures}")
    return hit_rate, mrr, relevancy_mrr, failures




def run_test(questions: list[str], retrieval_fn: Callable) -> tuple[float, float, float, int]:
    """Main test function."""
    queries_and_contexts_dict = {question: retrieval_fn(question) for question in questions}
    prompt_template = create_prompt_template(prompt_template_name="Agent Blog Post Retrieval")
    experiment, collection = create_experiment(prompt_template)
    queries_and_contexts = add_contexts_as_responses(experiment, collection, queries_and_contexts_dict)
    all_ratings = rate_contexts(experiment, queries_and_contexts)
    return calculate_metrics(all_ratings)




if __name__ == "__main__":
    vector_store = setup_retrieval()
    client = Client(timeout=60)
    k = 8
    logger.info(f"Experiment kwargs: {k=}")
    retrieval_fn = partial(get_contexts, k=k)
    hit_rate, mrr, relevancy_mrr, failures = run_test(questions, retrieval_fn=retrieval_fn)