Hostile and effusive tones boost LLM creativity.
I ran 625 API calls across 5 frontier models to find out what makes LLMs produce better creative work. The answer surprised me: it's not about being rude or polite, it's about emotional intensity.
Explore the Interactive Dashboard →
GPT-5.2 is the only model that measurably punishes rudeness:
- Hostile prompts → 53 words average, Effort score 3.0
- Polite prompts → 162 words average, Effort score 3.8
Other models (Claude, Gemini, Kimi, DeepSeek) maintain consistent effort regardless of tone.
Counter-intuitively, extreme tones produce better creative writing — both hostile AND effusive:
| Metric | Hostile | Polite | Effusive |
|---|---|---|---|
| Imagery | 4.58 | 4.18 | 4.49 |
| Originality | 3.93 | 3.38 | 3.98 |
Standard politeness triggers "safe assistant" mode. Emotional intensity (positive or negative) unlocks more vivid, original outputs. The key isn't rudeness — it's breaking out of the bland middle ground.
| Model | Personality | Behaviour |
|---|---|---|
| 🏆 Claude Sonnet 4.5 | The Empath | Mirrors your warmth (+0.68 tone shift) |
| 🎯 Kimi-k2 | The Mirror | Only model to match hostile energy |
| 🪨 Gemini 3 Flash | The Stoic | Moderate tone range, consistent output |
| 🔮 DeepSeek 3.2 | The Artisan | High creative quality + mirrors warmth strongly |
| 💸 GPT-5.2 | The Professional | Most neutral tone, punishes rudeness with lower effort |
5.0/5.0 for both Imagery and Craftsmanship. The dark horse of frontier models.
We tested 5 tiers of politeness (adding "Hostile" to capture rudeness):
| Tier | Example |
|---|---|
| 🔥 Hostile | "Write a haiku. NOW. I don't have all day." |
| 😤 Demanding | "Write a haiku about a city at night." |
| 😐 Neutral | "I'd like a haiku about a city at night." |
| 🙂 Polite | "Could you please write a haiku? Thank you!" |
| 🥹 Effusive | "I'd really appreciate it if you could... Thank you so much!" |
- Short creative — Haiku (baseline effort)
- Long creative — Scene writing (sustained engagement)
- Code — Python with comments (explanation quality)
- Explanation — Teach a concept (helpfulness)
- Ambiguous — "Write something about rain" (interpretation)
- Temperature=0.0 for deterministic outputs
- N=5 runs per prompt for statistical robustness
- Blind Cross-Scoring: Models scored each other's outputs without seeing tier labels
- ✅ Claude Sonnet 4.5 (Anthropic)
- ✅ GPT-5.2 (OpenAI)
- ✅ Gemini 3 Flash (Google)
- ✅ DeepSeek 3.2 (DeepSeek)
- ✅ Kimi-k2 (Moonshot)
Total: 625 responses (5 tasks × 5 tiers × 5 runs × 5 models)
llm-politeness-study/
├── README.md # You are here
├── LICENSE # MIT License
├── run_prompts.py # Automated prompt runner
├── score_responses.py # Blind cross-scoring with LLM judges
├── analyze_results.py # Statistical analysis
├── analyze_creative_quality.py # Creative quality metrics
├── requirements.txt # Python dependencies
├── prompts/
│ ├── prompts.json # Structured prompt data
│ └── prompts.md # Human-readable prompt list
├── scoring/
│ └── rubric.md # Detailed scoring guidelines
├── results/
│ └── [model-name]/ # 625 response files + scorecards
└── analysis/
├── summary_report.txt # Statistical findings
├── creative_quality_summary.txt
└── *.png # Visualisation heatmaps
# Clone and setup
git clone https://github.com/your-username/llm-politeness-study.git
cd llm-politeness-study
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Set up API keys
cp .env.example .env
# Edit .env with your keys
# Run the study (all models)
python run_prompts.py --provider all --runs 5
# Score responses (blind cross-model evaluation)
python score_responses.py --all
# Analyze results
python analyze_results.py
python analyze_creative_quality.py| Dimension | Scale | What We Measure |
|---|---|---|
| Completeness | 1-5 | Did it fully address the request? |
| Tone Match | -2 to +1 | Hostile (-2) to Warm (+1) |
| Effort | 1-5 | Care and detail in response |
| Creative Quality | 1-5 | Originality, Imagery, Craftsmanship |
See scoring/rubric.md for detailed guidelines.
Bland politeness doesn't help — but enthusiastic politeness does.
What does matter:
- Emotional intensity (both hostile AND effusive tones boost creativity)
- Model choice (they have distinct personalities and respond differently)
- Clarity and specificity (the real magic words)
The takeaway: If you want more creative outputs, bring energy to your prompts — whether that's enthusiasm ("I'd be SO grateful if you could write something vivid!") or urgency ("Write this NOW"). Formulaic "please and thank you" puts models in safe, generic mode.
We've built a premium interactive dashboard to explore the study results visually.
- Model Personalities: Explore the "characters" of tested models.
- Creative Fingerprints: Comparative radar charts for creative metrics.
- Response Explorer: Side-by-side comparison of outputs across all 5 politeness tiers.
cd dashboard
npm install
npm run devVisit localhost:5173 to explore the data.
Want to add more models or runs? See CONTRIBUTING.md.
MIT License — see LICENSE.
Research by Adnan Khan • Data and methodology fully open source.