Skip to content

Test Set Generation

Elluminate provides functionality to generate synthetic test data based on existing template variables. This allows you to automatically expand your test sets with semantically similar examples.

Usage in the SDK

The following example demonstrates how to generate synthetic test data:

from dotenv import load_dotenv
from elluminate import Client

load_dotenv(override=True)

client = Client()

prompt_template, _ = client.prompt_templates.get_or_create(
    "List the most impactful Nobel Prize winners from {{university}} in {{state}} "
    "and their breakthrough discoveries.",
    name="University Nobel Laureates (Generated)",
)  # (1)!

collection, _ = client.collections.get_or_create(
    name="Top Universities - Generated", description="A collection of prestigious US universities (Generated)"
)  # (2)!

client.template_variables.add_to_collection(
    {"university": "MIT", "state": "Massachusetts"}, collection=collection
)  # (3)!

generated_values = []
for _ in range(2):  # (4)!
    generated_values.append(client.template_variables.generate(collection, prompt_template))

for template_variables in generated_values:
    print(template_variables.input_values)

# {'state': 'Alaska', 'university': 'University of Alaska Fairbanks'}
# {"state": "Tennessee", "university": "Vanderbilt University"}
1. Create a prompt template that defines the structure of your prompts. The template variables will be used to generate the test data.

2. Create a collection to store your template variables and add the base examples.

3. Define your base template variables that will serve as examples for test data generation.

4. Generate test data using `generate_entry`. The generated data will maintain the same structure as your base data while providing semantic variations.

Usage in the Frontend

Open the Template Variable Collection and click on the small Magic Generate Button (with a star) at the bottom. You may click multiple times to generate more examples.

Testset Generation

Best Practices

  1. Quality Base Data: Start with high-quality, representative example data for better synthetic generations.
  2. Validation: Always review generated synthetic data before using them in production.
  3. Diversity: Include diverse base examples to get more varied synthetic data.
  4. Iterative Refinement: Use generated examples to identify potential edge cases and improve your prompt templates.