Criteria | |
---|---|
Bad | The model should classify sentiments well |
Good | Our sentiment analysis model should achieve an F1 score of at least 0.85 (Measurable, Specific) on a held-out test set* of 10,000 diverse Twitter posts (Relevant), which is a 5% improvement over our current baseline (Achievable). |
Additional Considerations:
Prompt Engineering Cycle
Develop Test Cases→ Engineer Preliminary prompt reposnse→ Test Prompt against cases & refine → Test against evals → Ship polished prompt
Comparing Prompts