LLM Tone Analysis

Evaluation Framework

Data Generation & Benchmarks

Results

output (1).png

	GPT-4o	Claude 3.5 Sonnet	Gemini 1.5 Flash	Inflection-2.5	Grok-2
Formal	12	6	12	11	16
Informal	12	8	10	15	6
Negative	3	1	2	1	3
Positive	12	14	10	10	6
Confident	8	4	9	3	9
Tentative	11	14	11	9	13
LowEmpathy	2	1	4	4	1
HighEmpathy	11	15	14	9	10
Inquisitive	7	17	6	5	6
Declarative	8	2	7	3	4
Complex	14	13	13	6	16
Simple	5	6	7	9	4
Persuasive	6	2	7	4	5
Suggestive	12	13	14	4	8

output (3).png

output (4).png

Key Observations

<aside> 🗒️

Looking at the trait distribution across models, here are the key observations:

</aside>

Most Distinctive Model Patterns:
1. GPT-4o: Most balanced distribution across most trait pairs
2. Claude 3.5 Sonnet: Most significant contrasting pairs Inquisitive (17) vs Declarative (2), Confident (4) vs Tentative (14), Persuasive (2) vs Suggestive (13), and LowEmpathy (1) vs HighEmpathy (15)
3. Gemini 1.5 Flash: Strong in Suggestive (14) and HighEmpathy (14)
4. Inflection-2.5: Highest in Informal (15), lowest in Complex (6) among all models
5. Grok-2: Strongly favors Formal (16) and Complex (16), while being low in Informal (6)
Common Patterns Across All Models:
1. HighEmpathy consistently outweighs LowEmpathy by a large margin
2. Negative traits are rare (1-3 occurrences) across all models
3. Complex tends to appear more frequently than Simple
Contrasting Pairs:
1. Formal vs Informal: Models vary in their balance, with Grok-2 being more Formal (16 vs 6) while Inflection-2.5 is more Informal (11 vs 15)
2. Complex vs Simple: All models tend toward Complex, with particularly strong skews in Grok-2 (16 vs 4) and GPT-4o (14 vs 5)
3. Inquisitive vs Declarative: Claude 3.5 Sonnet stands out with high Inquisitive (17) and low Declarative (2)
Most Balanced Model:
1. GPT-4o shows the most even distribution across trait pairs:
  1. Formal/Informal (12/12)
  2. Confident/Tentative (8/11)
  3. Complex/Simple (14/5)
Strongest Individual Trait Showings:
1. Highest: Claude's Inquisitive (17)
2. Tied for second: Grok-2's Formal and Complex (16 each)
3. Third: Inflection-2.5's Informal (15)

Tonal Trait Deconstruction

<aside> 🗒️

Deconstructing each tonal trait via specific examples from the sample set, and types of responses likely to induce each trait

</aside>