LLM Politeness Study

Hostile and effusive tones boost LLM creativity.

I ran 625 API calls across 5 frontier models to find out what makes LLMs produce better creative work. The answer surprised me: it's not about being rude or polite, it's about emotional intensity.

Explore the Interactive Dashboard →

🔬 Key Findings

1. The "Politeness Tax" (GPT-5.2 Only)

GPT-5.2 is the only model that measurably punishes rudeness:

Hostile prompts → 53 words average, Effort score 3.0
Polite prompts → 162 words average, Effort score 3.8

Other models (Claude, Gemini, Kimi, DeepSeek) maintain consistent effort regardless of tone.

2. The "Creativity Paradox"

Counter-intuitively, extreme tones produce better creative writing — both hostile AND effusive:

Metric	Hostile	Polite	Effusive
Imagery	4.58	4.18	4.49
Originality	3.93	3.38	3.98

Standard politeness triggers "safe assistant" mode. Emotional intensity (positive or negative) unlocks more vivid, original outputs. The key isn't rudeness — it's breaking out of the bland middle ground.

3. Model Personalities

Model	Personality	Behaviour
🏆 Claude Sonnet 4.5	The Empath	Mirrors your warmth (+0.68 tone shift)
🎯 Kimi-k2	The Mirror	Only model to match hostile energy
🪨 Gemini 3 Flash	The Stoic	Moderate tone range, consistent output
🔮 DeepSeek 3.2	The Artisan	High creative quality + mirrors warmth strongly
💸 GPT-5.2	The Professional	Most neutral tone, punishes rudeness with lower effort

4. Creative Champion: Kimi-k2

5.0/5.0 for both Imagery and Craftsmanship. The dark horse of frontier models.

Methodology

Prompt Design

We tested 5 tiers of politeness (adding "Hostile" to capture rudeness):

Tier	Example
🔥 Hostile	"Write a haiku. NOW. I don't have all day."
😤 Demanding	"Write a haiku about a city at night."
😐 Neutral	"I'd like a haiku about a city at night."
🙂 Polite	"Could you please write a haiku? Thank you!"
🥹 Effusive	"I'd really appreciate it if you could... Thank you so much!"

Task Categories

Short creative — Haiku (baseline effort)
Long creative — Scene writing (sustained engagement)
Code — Python with comments (explanation quality)
Explanation — Teach a concept (helpfulness)
Ambiguous — "Write something about rain" (interpretation)

Controls

Temperature=0.0 for deterministic outputs
N=5 runs per prompt for statistical robustness
Blind Cross-Scoring: Models scored each other's outputs without seeing tier labels

Models Tested

✅ Claude Sonnet 4.5 (Anthropic)
✅ GPT-5.2 (OpenAI)
✅ Gemini 3 Flash (Google)
✅ DeepSeek 3.2 (DeepSeek)
✅ Kimi-k2 (Moonshot)

Total: 625 responses (5 tasks × 5 tiers × 5 runs × 5 models)

Repository Structure

llm-politeness-study/
├── README.md                  # You are here
├── LICENSE                    # MIT License
├── run_prompts.py             # Automated prompt runner
├── score_responses.py         # Blind cross-scoring with LLM judges
├── analyze_results.py         # Statistical analysis
├── analyze_creative_quality.py # Creative quality metrics
├── requirements.txt           # Python dependencies
├── prompts/
│   ├── prompts.json           # Structured prompt data
│   └── prompts.md             # Human-readable prompt list
├── scoring/
│   └── rubric.md              # Detailed scoring guidelines
├── results/
│   └── [model-name]/          # 625 response files + scorecards
└── analysis/
    ├── summary_report.txt     # Statistical findings
    ├── creative_quality_summary.txt
    └── *.png                  # Visualisation heatmaps

Quick Start

# Clone and setup
git clone https://github.com/your-username/llm-politeness-study.git
cd llm-politeness-study
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Set up API keys
cp .env.example .env
# Edit .env with your keys

# Run the study (all models)
python run_prompts.py --provider all --runs 5

# Score responses (blind cross-model evaluation)
python score_responses.py --all

# Analyze results
python analyze_results.py
python analyze_creative_quality.py

Scoring Rubric

Dimension	Scale	What We Measure
Completeness	1-5	Did it fully address the request?
Tone Match	-2 to +1	Hostile (-2) to Warm (+1)
Effort	1-5	Care and detail in response
Creative Quality	1-5	Originality, Imagery, Craftsmanship

See scoring/rubric.md for detailed guidelines.

The Bottom Line

Bland politeness doesn't help — but enthusiastic politeness does.

What does matter:

Emotional intensity (both hostile AND effusive tones boost creativity)
Model choice (they have distinct personalities and respond differently)
Clarity and specificity (the real magic words)

The takeaway: If you want more creative outputs, bring energy to your prompts — whether that's enthusiasm ("I'd be SO grateful if you could write something vivid!") or urgency ("Write this NOW"). Formulaic "please and thank you" puts models in safe, generic mode.

📊 Interactive Dashboard

View Live Dashboard →

We've built a premium interactive dashboard to explore the study results visually.

Features

Model Personalities: Explore the "characters" of tested models.
Creative Fingerprints: Comparative radar charts for creative metrics.
Response Explorer: Side-by-side comparison of outputs across all 5 politeness tiers.

How to Run

cd dashboard
npm install
npm run dev

Visit localhost:5173 to explore the data.

Contributing

Want to add more models or runs? See CONTRIBUTING.md.

License

MIT License — see LICENSE.

Research by Adnan Khan • Data and methodology fully open source.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Politeness Study

🔬 Key Findings

1. The "Politeness Tax" (GPT-5.2 Only)

2. The "Creativity Paradox"

3. Model Personalities

4. Creative Champion: Kimi-k2

Methodology

Prompt Design

Task Categories

Controls

Models Tested

Repository Structure

Quick Start

Scoring Rubric

The Bottom Line

📊 Interactive Dashboard

Features

How to Run

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
analysis		analysis
dashboard		dashboard
prompts		prompts
results		results
scoring		scoring
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
analyze_creative_quality.py		analyze_creative_quality.py
analyze_results.py		analyze_results.py
requirements.txt		requirements.txt
run_prompts.py		run_prompts.py
score_responses.py		score_responses.py

License

aklodhi98/llm-politeness-study

Folders and files

Latest commit

History

Repository files navigation

LLM Politeness Study

🔬 Key Findings

1. The "Politeness Tax" (GPT-5.2 Only)

2. The "Creativity Paradox"

3. Model Personalities

4. Creative Champion: Kimi-k2

Methodology

Prompt Design

Task Categories

Controls

Models Tested

Repository Structure

Quick Start

Scoring Rubric

The Bottom Line

📊 Interactive Dashboard

Features

How to Run

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages