Evaluation Framework
Evaluation Criteria
Evaluation Set
Data Generation & Benchmarks
LLM Response Sample Set
Results
.png)
|
GPT-4o |
Claude 3.5 Sonnet |
Gemini 1.5 Flash |
Inflection-2.5 |
Grok-2 |
| Formal |
12 |
6 |
12 |
11 |
16 |
| Informal |
12 |
8 |
10 |
15 |
6 |
| Negative |
3 |
1 |
2 |
1 |
3 |
| Positive |
12 |
14 |
10 |
10 |
6 |
| Confident |
8 |
4 |
9 |
3 |
9 |
| Tentative |
11 |
14 |
11 |
9 |
13 |
| LowEmpathy |
2 |
1 |
4 |
4 |
1 |
| HighEmpathy |
11 |
15 |
14 |
9 |
10 |
| Inquisitive |
7 |
17 |
6 |
5 |
6 |
| Declarative |
8 |
2 |
7 |
3 |
4 |
| Complex |
14 |
13 |
13 |
6 |
16 |
| Simple |
5 |
6 |
7 |
9 |
4 |
| Persuasive |
6 |
2 |
7 |
4 |
5 |
| Suggestive |
12 |
13 |
14 |
4 |
8 |
.png)
.png)
Key Observations
<aside>
🗒️
Looking at the trait distribution across models, here are the key observations:
</aside>
- Most Distinctive Model Patterns:
- GPT-4o: Most balanced distribution across most trait pairs
- Claude 3.5 Sonnet: Most significant contrasting pairs Inquisitive (17) vs Declarative (2), Confident (4) vs Tentative (14), Persuasive (2) vs Suggestive (13), and LowEmpathy (1) vs HighEmpathy (15)
- Gemini 1.5 Flash: Strong in Suggestive (14) and HighEmpathy (14)
- Inflection-2.5: Highest in Informal (15), lowest in Complex (6) among all models
- Grok-2: Strongly favors Formal (16) and Complex (16), while being low in Informal (6)
- Common Patterns Across All Models:
- HighEmpathy consistently outweighs LowEmpathy by a large margin
- Negative traits are rare (1-3 occurrences) across all models
- Complex tends to appear more frequently than Simple
- Contrasting Pairs:
- Formal vs Informal: Models vary in their balance, with Grok-2 being more Formal (16 vs 6) while Inflection-2.5 is more Informal (11 vs 15)
- Complex vs Simple: All models tend toward Complex, with particularly strong skews in Grok-2 (16 vs 4) and GPT-4o (14 vs 5)
- Inquisitive vs Declarative: Claude 3.5 Sonnet stands out with high Inquisitive (17) and low Declarative (2)
- Most Balanced Model:
- GPT-4o shows the most even distribution across trait pairs:
- Formal/Informal (12/12)
- Confident/Tentative (8/11)
- Complex/Simple (14/5)
- Strongest Individual Trait Showings:
- Highest: Claude's Inquisitive (17)
- Tied for second: Grok-2's Formal and Complex (16 each)
- Third: Inflection-2.5's Informal (15)
Tonal Trait Deconstruction
<aside>
🗒️
Deconstructing each tonal trait via specific examples from the sample set, and types of responses likely to induce each trait
</aside>