Vellum <> LlamaIndex integration

About us

The central mission of LlamaIndex is to provide an interface between Large Language Models (LLM’s), and your private, external data. Over the past few months, it has become one of the most popular open-source frameworks for LLM data augmentation (context-augmented generation), for a variety of use cases: question-answering, summarization, structured queries, and more.

Vellum is a developer platform to build high quality LLM applications. The platform provides best-in-class tooling for prompt engineering, unit testing, regression testing, monitoring & versioning of in-production traffic and model fine tuning. Vellum’s platform helps companies save countless engineering hours to build internal tooling and instead use that time to build end user facing applications.

Why we partnered on this integration

Until recently, LlamaIndex users did not have a way to do prompt engineering and unit testing pre-production and versioning/monitoring the prompts post production. Prompt engineering and unit testing is key to ensure that your LLM feature is producing reliable results in production. Here’s an example of simple prompt that produces vastly different results between GPT-3, GPT-3.5 and GPT-4:

Screen Shot 2023-05-29 at 3.58.18 PM.png

Unit testing your prompts

Creating a unit test bank is a proactive approach to ensure prompt reliability — it’s best practice to run 50-100 test cases before putting prompts in production. The test bank should comprise scenarios & edge cases anticipated in production, think of this as QAing your feature before it goes to production. The prompts should "pass" these test cases based on your evaluation criteria. Use Vellum Test Suites to upload test cases in bulk via CSV upload.

Regression testing in production

Despite how well you test before sending a prompt in production, edge cases can appear when in production. This is expected, so no stress! Through the Vellum integration, LlamaIndex users can change prompts and get prompt versioning without making any code changes. While doing that, however, it’s best practice to run historical inputs that were sent to the prompt in production to the new prompt and confirm it doesn’t break any existing behavior. LLMs are sometimes unpredictable, even changing the word “good” to “great” in a prompt can result in differing outputs!

Untitled

Best practices to leverage the integration

How to access the integration

This demo notebook goes into detail on how you can use Vellum to manage prompts within LlamaIndex.