Testing AI Systems

Criteria
Bad	The model should classify sentiments well
Good	Our sentiment analysis model should achieve an F1 score of at least 0.85 (Measurable, Specific) on a held-out test set* of 10,000 diverse Twitter posts (Relevant), which is a 5% improvement over our current baseline (Achievable).

Additional Considerations:

Prompt Engineering Cycle

Develop Test Cases→ Engineer Preliminary prompt reposnse→ Test Prompt against cases & refine → Test against evals → Ship polished prompt

Comparing Prompts

Side-by-side comparison: Compare the outputs of two or more prompts to quickly see the impact of your changes.
Quality grading: Grade response quality on a 5-point scale to track improvements in response quality per prompt.
Prompt versioning: Create new versions of your prompt and re-run the test suite to quickly iterate and improve results.