Evaluation Framework

Evaluation Criteria

Evaluation Set

Data Generation & Benchmarks

LLM Response Sample Set

Results

output (1).png

GPT-4o Claude 3.5 Sonnet Gemini 1.5 Flash Inflection-2.5 Grok-2
Formal 12 6 12 11 16
Informal 12 8 10 15 6
Negative 3 1 2 1 3
Positive 12 14 10 10 6
Confident 8 4 9 3 9
Tentative 11 14 11 9 13
LowEmpathy 2 1 4 4 1
HighEmpathy 11 15 14 9 10
Inquisitive 7 17 6 5 6
Declarative 8 2 7 3 4
Complex 14 13 13 6 16
Simple 5 6 7 9 4
Persuasive 6 2 7 4 5
Suggestive 12 13 14 4 8

output (3).png

output (4).png

Key Observations

<aside> 🗒️

Looking at the trait distribution across models, here are the key observations:

</aside>

  1. Most Distinctive Model Patterns:
    1. GPT-4o: Most balanced distribution across most trait pairs
    2. Claude 3.5 Sonnet: Most significant contrasting pairs Inquisitive (17) vs Declarative (2), Confident (4) vs Tentative (14), Persuasive (2) vs Suggestive (13), and LowEmpathy (1) vs HighEmpathy (15)
    3. Gemini 1.5 Flash: Strong in Suggestive (14) and HighEmpathy (14)
    4. Inflection-2.5: Highest in Informal (15), lowest in Complex (6) among all models
    5. Grok-2: Strongly favors Formal (16) and Complex (16), while being low in Informal (6)
  2. Common Patterns Across All Models:
    1. HighEmpathy consistently outweighs LowEmpathy by a large margin
    2. Negative traits are rare (1-3 occurrences) across all models
    3. Complex tends to appear more frequently than Simple
  3. Contrasting Pairs:
    1. Formal vs Informal: Models vary in their balance, with Grok-2 being more Formal (16 vs 6) while Inflection-2.5 is more Informal (11 vs 15)
    2. Complex vs Simple: All models tend toward Complex, with particularly strong skews in Grok-2 (16 vs 4) and GPT-4o (14 vs 5)
    3. Inquisitive vs Declarative: Claude 3.5 Sonnet stands out with high Inquisitive (17) and low Declarative (2)
  4. Most Balanced Model:
    1. GPT-4o shows the most even distribution across trait pairs:
      1. Formal/Informal (12/12)
      2. Confident/Tentative (8/11)
      3. Complex/Simple (14/5)
  5. Strongest Individual Trait Showings:
    1. Highest: Claude's Inquisitive (17)
    2. Tied for second: Grok-2's Formal and Complex (16 each)
    3. Third: Inflection-2.5's Informal (15)

Tonal Trait Deconstruction

<aside> 🗒️

Deconstructing each tonal trait via specific examples from the sample set, and types of responses likely to induce each trait

</aside>