Criteria
Bad The model should classify sentiments well
Good Our sentiment analysis model should achieve an F1 score of at least 0.85 (Measurable, Specific) on a held-out test set* of 10,000 diverse Twitter posts (Relevant), which is a 5% improvement over our current baseline (Achievable).

Additional Considerations:

  1. How well does the model need to perform on the task?
  2. Consistency
  3. Tone/ Style
  4. Privacy Preservation
  5. Context Utilization
  6. Response Latency
  7. Price

Prompt Engineering Cycle

Develop Test Cases→ Engineer Preliminary prompt reposnse→ Test Prompt against cases & refine → Test against evals → Ship polished prompt

Comparing Prompts

  1. Side-by-side comparison: Compare the outputs of two or more prompts to quickly see the impact of your changes.
  2. Quality grading: Grade response quality on a 5-point scale to track improvements in response quality per prompt.
  3. Prompt versioning: Create new versions of your prompt and re-run the test suite to quickly iterate and improve results.