Welcome to our comprehensive guide on AI evaluation methods!

As artificial intelligence becomes increasingly integrated into business processes, choosing the right evaluation approach is crucial for ensuring your AI solutions perform reliably and effectively. This guide will walk you through various evaluation strategies, helping you select the most appropriate methods based on your specific use case.

Whether you're working on classification systems, text generation, or complex AI agents, we'll cover everything you need to know - from evaluation methodologies and key metrics to recommended tools and practical implementation tips.

Our goal is to help you build more robust and reliable AI systems through proper testing and validation.

</aside>

<aside>

🎯 Use Case: Classification or Routing

Examples: Tagging emails, routing queries to the right department, assigning categories
Goal: Output is deterministic and must match an exact value
Best Eval Type: Ground Truth Comparison
How to Run:
1. Create a dataset of test cases. You can ask ChatGPT or Claude to create it for you in CSV format.
2. Add a column “Ideal_output” and manually add what the exact output should be
3. Run your prompt on your dataset and compare with the ideal output. Use a specific tool like Basalt (free tier) or create a simple python script to do this.
Tools: Any eval engine (e.g. Basalt’s free tool or custom python script), your favorite AI to generate the Dataset.

CleanShot 2025-05-31 at 06.55.34.gif

In this example we used Basalt to run our evaluations. You can create an account for free and run up to 1k test cases per month :

Discover Basalt, your command center for LLM quality :

🛠 Trace model behavior across runs

✅ Test & Evaluate LLM outputs with structured grading and real-world data.

🚀 Run evals at scale (ground truth, LLM-as-a-judge, regression, and more)

📊 Monitor changes, regressions, and unexpected behaviors

💡 Collaborate with your team for better AI-driven products. ****

Ship faster, with fewer surprises—and keep your AI performance loop running on autopilot.

</aside>

N.B : You can also be done using a very simple python script and running it on your dataset using Jupyter Notebook.

</aside>

<aside>

💬 Use Case: Text Generation (Replies, Emails, Summaries)

Examples: Customer support replies, internal summaries, cold outreach
Goal: Output should sound human, be on-topic, respect brand voice and avoid hallucinations.
Best Eval Type: LLM-as-a-Judge
How to Run:
1. Create a dataset of test cases, including common test cases and edge cases. Ask you favorite AI to generate the dataset for you. At least 50 test cases.
2. Run your prompt on the full dataset and collect outputs.
3. Ask a separate LLM to rate outputs on defined criteria (e.g. tone, clarity, helpfulness)
4. Pro mode: give each LLM as a judge exemples of a correct output and of incorrect outputs, to be even more aligned with its judgement.
Metrics: pass / fail for each evaluator, giving your an overall scoring of your prompt
Tools: Basalt free tier, OpenAI or Claude to generate dataset
Tips:
- Keep eval questions specific (e.g. "Is the tone friendly and professional?") and add exemples of what a friendly and professional tone looks like.
- Run multiple AI evaluators for different purposes instead of one big AI evaluators

CleanShot 2025-05-31 at 07.33.37@2x.png

In this example (document summarization) we used Basalt to create LLM as a judge evaluators for Hallucinations, another one for professionnal tone, and a custom script for length of output.

You can create an account for free and run up to 1k test cases per month :

Discover Basalt, your command center for LLM quality :

🛠 Trace model behavior across runs

✅ Test & Evaluate LLM outputs with structured grading and real-world data.

🚀 Run evals at scale (ground truth, LLM-as-a-judge, regression, and more)

📊 Monitor changes, regressions, and unexpected behaviors

💡 Collaborate with your team for better AI-driven products. ****

Ship faster, with fewer surprises—and keep your AI performance loop running on autopilot.

</aside>

<aside>

🤖 Use Case: Agents & Multi-Step Behavior (advanced level)

Examples: prompts chained together, agents calling tools, multistep workflows
Goal: The agent should reason well and follow correct sequences
Best Eval Types:
- Trace Evaluation (step-by-step)
- LLM as a judge for steps including text generation
- Custom script evaluators for steps including API calls
How to Run:
- Log every step of the agent's process. Monitoring or Evals tools typically have an API or a SDK your engineering team can easily integrate.
- Score each step based on correctness, tool usage, memory handling, etc.
Metrics: Step correctness %, number of retries, loop detection
Tools: Basalt SDK
Tips:
- Visualize the full trace to identify where things went wrong
- Evaluate intent and execution separately

Discover Basalt, your command center for LLM quality :

🛠 Trace model behavior across runs

✅ Test & Evaluate LLM outputs with structured grading and real-world data.

🚀 Run evals at scale (ground truth, LLM-as-a-judge, regression, and more)

📊 Monitor changes, regressions, and unexpected behaviors

💡 Collaborate with your team for better AI-driven products. ****

Ship faster, with fewer surprises—and keep your AI performance loop running on autopilot.

</aside>

<aside>

</aside>

⚡ What is Basalt?

</aside>

Free Resources to launch AI features in 2025 🚀

</aside>