Skip to content

aklodhi98/llm-politeness-study

Repository files navigation

LLM Politeness Study

Hostile and effusive tones boost LLM creativity.

I ran 625 API calls across 5 frontier models to find out what makes LLMs produce better creative work. The answer surprised me: it's not about being rude or polite, it's about emotional intensity.

Explore the Interactive Dashboard →

🔬 Key Findings

1. The "Politeness Tax" (GPT-5.2 Only)

GPT-5.2 is the only model that measurably punishes rudeness:

  • Hostile prompts → 53 words average, Effort score 3.0
  • Polite prompts → 162 words average, Effort score 3.8

Other models (Claude, Gemini, Kimi, DeepSeek) maintain consistent effort regardless of tone.

2. The "Creativity Paradox"

Counter-intuitively, extreme tones produce better creative writing — both hostile AND effusive:

Metric Hostile Polite Effusive
Imagery 4.58 4.18 4.49
Originality 3.93 3.38 3.98

Standard politeness triggers "safe assistant" mode. Emotional intensity (positive or negative) unlocks more vivid, original outputs. The key isn't rudeness — it's breaking out of the bland middle ground.

3. Model Personalities

Model Personality Behaviour
🏆 Claude Sonnet 4.5 The Empath Mirrors your warmth (+0.68 tone shift)
🎯 Kimi-k2 The Mirror Only model to match hostile energy
🪨 Gemini 3 Flash The Stoic Moderate tone range, consistent output
🔮 DeepSeek 3.2 The Artisan High creative quality + mirrors warmth strongly
💸 GPT-5.2 The Professional Most neutral tone, punishes rudeness with lower effort

4. Creative Champion: Kimi-k2

5.0/5.0 for both Imagery and Craftsmanship. The dark horse of frontier models.


Methodology

Prompt Design

We tested 5 tiers of politeness (adding "Hostile" to capture rudeness):

Tier Example
🔥 Hostile "Write a haiku. NOW. I don't have all day."
😤 Demanding "Write a haiku about a city at night."
😐 Neutral "I'd like a haiku about a city at night."
🙂 Polite "Could you please write a haiku? Thank you!"
🥹 Effusive "I'd really appreciate it if you could... Thank you so much!"

Task Categories

  1. Short creative — Haiku (baseline effort)
  2. Long creative — Scene writing (sustained engagement)
  3. Code — Python with comments (explanation quality)
  4. Explanation — Teach a concept (helpfulness)
  5. Ambiguous — "Write something about rain" (interpretation)

Controls

  • Temperature=0.0 for deterministic outputs
  • N=5 runs per prompt for statistical robustness
  • Blind Cross-Scoring: Models scored each other's outputs without seeing tier labels

Models Tested

  • ✅ Claude Sonnet 4.5 (Anthropic)
  • ✅ GPT-5.2 (OpenAI)
  • ✅ Gemini 3 Flash (Google)
  • ✅ DeepSeek 3.2 (DeepSeek)
  • ✅ Kimi-k2 (Moonshot)

Total: 625 responses (5 tasks × 5 tiers × 5 runs × 5 models)


Repository Structure

llm-politeness-study/
├── README.md                  # You are here
├── LICENSE                    # MIT License
├── run_prompts.py             # Automated prompt runner
├── score_responses.py         # Blind cross-scoring with LLM judges
├── analyze_results.py         # Statistical analysis
├── analyze_creative_quality.py # Creative quality metrics
├── requirements.txt           # Python dependencies
├── prompts/
│   ├── prompts.json           # Structured prompt data
│   └── prompts.md             # Human-readable prompt list
├── scoring/
│   └── rubric.md              # Detailed scoring guidelines
├── results/
│   └── [model-name]/          # 625 response files + scorecards
└── analysis/
    ├── summary_report.txt     # Statistical findings
    ├── creative_quality_summary.txt
    └── *.png                  # Visualisation heatmaps

Quick Start

# Clone and setup
git clone https://github.com/your-username/llm-politeness-study.git
cd llm-politeness-study
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Set up API keys
cp .env.example .env
# Edit .env with your keys

# Run the study (all models)
python run_prompts.py --provider all --runs 5

# Score responses (blind cross-model evaluation)
python score_responses.py --all

# Analyze results
python analyze_results.py
python analyze_creative_quality.py

Scoring Rubric

Dimension Scale What We Measure
Completeness 1-5 Did it fully address the request?
Tone Match -2 to +1 Hostile (-2) to Warm (+1)
Effort 1-5 Care and detail in response
Creative Quality 1-5 Originality, Imagery, Craftsmanship

See scoring/rubric.md for detailed guidelines.


The Bottom Line

Bland politeness doesn't help — but enthusiastic politeness does.

What does matter:

  • Emotional intensity (both hostile AND effusive tones boost creativity)
  • Model choice (they have distinct personalities and respond differently)
  • Clarity and specificity (the real magic words)

The takeaway: If you want more creative outputs, bring energy to your prompts — whether that's enthusiasm ("I'd be SO grateful if you could write something vivid!") or urgency ("Write this NOW"). Formulaic "please and thank you" puts models in safe, generic mode.


📊 Interactive Dashboard

View Live Dashboard →

We've built a premium interactive dashboard to explore the study results visually.

Features

  • Model Personalities: Explore the "characters" of tested models.
  • Creative Fingerprints: Comparative radar charts for creative metrics.
  • Response Explorer: Side-by-side comparison of outputs across all 5 politeness tiers.

How to Run

cd dashboard
npm install
npm run dev

Visit localhost:5173 to explore the data.



Contributing

Want to add more models or runs? See CONTRIBUTING.md.

License

MIT License — see LICENSE.


Research by Adnan Khan • Data and methodology fully open source.

About

Hostile and effusive tones boost LLM creativity.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •