-
Notifications
You must be signed in to change notification settings - Fork 2
Cost & Pricing
OpenAI model pricing and budget optimisation strategies
All prices shown are per 1 million tokens, Standard Tier
| Model | Input | Cached Input* | Output | Best For |
|---|---|---|---|---|
| gpt-5.2 | $1.75 | $0.175 | $14.00 | Latest flagship, best reasoning |
| gpt-5.1 | $1.25 | $0.125 | $10.00 | Excellent balance, cost-effective |
| gpt-5 | $1.25 | $0.125 | $10.00 | Previous flagship, proven reliable |
| gpt-4.1 | $2.00 | $0.50 | $8.00 | Balance option |
| gpt-4o | $2.50 | $1.25 | $10.00 | Multimodal, widely used |
| gpt-5-mini | $0.25 | $0.025 | $2.00 | Budget testing |
| gpt-4o-mini | $0.15 | $0.075 | $0.60 | Cheapest option |
| o4-mini | $1.10 | $0.275 | $4.40 | Enhanced reasoning |
*Cached input = prompt caching for repeated system prompts (automatic)
Cost = (Input Tokens / 1,000,000 × Input Price) + (Output Tokens / 1,000,000 × Output Price)
For AI Tutor Testing:
| Component | Typical Tokens |
|---|---|
| System prompt | 800-1,200 |
| Student prompt | 50-200 |
| Tutor response | 300-800 |
| Total per interaction | ~1,150-2,200 |
For Evaluation:
| Component | Typical Tokens |
|---|---|
| Evaluation prompt | 1,500-2,000 |
| Judge reasoning | 400-800 |
| Total per evaluation | ~1,900-2,800 |
Tutor Responses:
Input: 140 × 1,000 = 140,000 tokens
Output: 140 × 500 = 70,000 tokens
Cost = (140,000 / 1,000,000 × $0.15) + (70,000 / 1,000,000 × $0.60)
= $0.021 + $0.042
= $0.063 (~$0.06)
Evaluation:
Input: 140 × 1,800 = 252,000 tokens
Output: 140 × 600 = 84,000 tokens
Cost = (252,000 / 1,000,000 × $0.15) + (84,000 / 1,000,000 × $0.60)
= $0.038 + $0.050
= $0.088 (~$0.09)
Total Pipeline: ~$0.15
Tutor Responses:
Input: 140,000 tokens
Output: 70,000 tokens
Cost = (140,000 / 1,000,000 × $1.25) + (70,000 / 1,000,000 × $10.00)
= $0.175 + $0.700
= $0.875 (~$0.88)
Evaluation (gpt-5.1 as judge):
Input: 252,000 tokens
Output: 84,000 tokens
Cost = (252,000 / 1,000,000 × $1.25) + (84,000 / 1,000,000 × $10.00)
= $0.315 + $0.840
= $1.155 (~$1.16)
Total Pipeline: ~$2.04
Total Pipeline: ~$2.85
Most cost-effective approach:
Phase 1: Initial Testing (gpt-4o-mini)
- 20-50 test prompts
- Quick iteration on system prompt
- Cost: $0.02-0.05
Phase 2: Expanded Testing (gpt-4o-mini)
- 100-140 prompts
- Comprehensive coverage
- Cost: $0.10-0.15
Phase 3: Validation (gpt-5.1)
- Same 140 prompts
- Final confirmation before deployment
- Cost: ~$2.00
Total: ~$2.15-2.20
vs. Using gpt-5.2 for everything: ~$8.55
Savings: 74%
Instead of evaluating all responses, evaluate strategically:
# Evaluate only failed/borderline cases
# Filter responses.csv first
# Or evaluate random sample
python3 sample_responses.py --input responses.csv --sample 50 --output sample.csv
python3 llm_evaluator.py --input sample.csv ...Cost reduction: 50-75%
OpenAI automatically caches system prompts. To maximise caching:
- Keep system prompt stable during a test session
- Run all tests with same prompt before changing it
- Rerun tests benefit from cached system prompt
Cached input pricing:
- gpt-5.1: $0.125 vs $1.25 (90% off)
- gpt-4o: $1.25 vs $2.50 (50% off)
Example benefit:
Without caching: $2.04
With caching (80% of input cached): $1.54
Savings: 24%
For non-urgent testing, use OpenAI's Batch API:
Pricing (50% off Standard):
- gpt-5.1: $0.625 input, $5.00 output
- gpt-4o: $1.25 input, $5.00 output
Trade-off: 24-hour processing time
When to use: Large-scale testing (500+ prompts) that isn't urgent
Use: gpt-4o-mini
✅ Very cheap
✅ Fast responses
✅ Good enough for iteration
❌ Less capable reasoning
❌ More prone to mistakes
Cost: ~$0.15 for 140 prompts (full pipeline)
Use: gpt-5.1
✅ Excellent capabilities
✅ Reliable performance
✅ Good cost-performance ratio
✅ Adequate for most use cases
❌ Not the absolute best
Cost: ~$2.00 for 140 prompts (full pipeline)
Use: gpt-5.2
✅ Best available reasoning
✅ Most accurate
✅ Latest features
❌ Most expensive
Cost: ~$2.85 for 140 prompts (full pipeline)
Use: gpt-4o-mini for tutor, gpt-4o for evaluation
✅ Cheap tutor testing
✅ Better evaluation quality
✅ Reasonable compromise
Cost: ~$0.50 for 140 prompts (full pipeline)
140 prompts, full pipeline (tutor + evaluation)
| Configuration | Tutor Model | Judge Model | Total Cost | Use Case |
|---|---|---|---|---|
| Ultra Budget | gpt-4o-mini | gpt-4o-mini | ~$0.15 | Early iteration |
| Budget | gpt-4o-mini | gpt-4o | ~$0.50 | Testing phase |
| Balanced | gpt-5.1 | gpt-5.1 | ~$2.00 | Standard validation |
| Premium | gpt-5.2 | gpt-5.2 | ~$2.85 | Final validation |
| Mixed | gpt-4o-mini | gpt-5.1 | ~$1.20 | Cost-conscious validation |
Goal: Test AI tutor for BIO301, 50 students
Approach:
Week 1: Develop system prompt
- 20 test prompts with gpt-4o-mini
- Iterate 3-4 times
- Cost: $0.10
Week 2: Expand testing
- 100 comprehensive prompts with gpt-4o-mini
- Manual review of responses
- Cost: $0.12
Week 3: Final validation
- Same 100 prompts with gpt-5.1
- Full evaluation with gpt-5.1
- Cost: $1.50
Total: $1.72
Goal: Deploy AI tutors for 5 courses, 500 students
Approach:
Phase 1: Template development
- Create base prompt
- Test with 140 prompts × gpt-4o-mini
- Cost per course: $0.15
- Total: $0.75
Phase 2: Course customisation
- Adapt for each course
- Test customisations × gpt-4o-mini
- Cost per course: $0.10
- Total: $0.50
Phase 3: Production validation
- All courses × 140 prompts × gpt-5.1
- Cost per course: $2.00
- Total: $10.00
Total: $11.25 for 5 complete AI tutors
Goal: Compare different prompting strategies
Approach:
Test 5 different system prompts
- Each with 140 test prompts
- Use gpt-5.1 for consistency
- Evaluate all responses
Cost: 5 × $2.00 = $10.00
Batch Processor:
python3 llm_batch_processor.py \
--input prompts.csv \
--output responses.csv \
--system system_prompt.txt \
--model gpt-5.1 \
--price-input-per-1k 0.00125 \
--price-output-per-1k 0.01Output includes:
=== SUMMARY ===
Measured total tokens: 102,755
Estimated total cost (USD): $0.87536
Average cost per row: $0.00625
Create a spreadsheet:
| Date | Phase | Model | Prompts | Cost | Notes |
|---|---|---|---|---|---|
| Jan 15 | Initial | gpt-4o-mini | 20 | $0.02 | First iteration |
| Jan 16 | Testing | gpt-4o-mini | 100 | $0.12 | Full test suite |
| Jan 17 | Validation | gpt-5.1 | 100 | $1.45 | Final check |
| Total | 220 | $1.59 |
OpenAI offers $5 free credit for new accounts:
- Covers ~25 full test cycles with gpt-4o-mini
- Or ~2-3 validation runs with gpt-5.1
- Perfect for initial exploration
Create once: comprehensive_test_suite.csv
Reuse for:
- Different system prompts
- Different models
- Different courses (adapted)
Saves: Development time and ensures consistency
Batch processor → Manual spot check → Evaluation (if needed)
If spot check reveals major issues:
- Fix system prompt
- Rerun batch processor (cheap)
- Skip evaluation until fixed
Saves: $1-2 per iteration
# Inefficient (3 separate runs)
--input test_set1.csv --model gpt-5.1
--input test_set2.csv --model gpt-5.1
--input test_set3.csv --model gpt-5.1
# Efficient (1 combined run, benefits from caching)
cat test_set1.csv test_set2.csv test_set3.csv > combined.csv
--input combined.csv --model gpt-5.1
Saves: 10-20% via better prompt caching
Don't evaluate everything:
# Manual review first
# Identify problem areas
# Evaluate only those specific cases
python3 llm_evaluator.py \
--input responses.csv \
--output evaluated.csv \
--mode rerun_ids \
--ids "45,67,89,120"
Saves: 50-90% of evaluation costs
Manual review of 140 responses:
- ~3-5 minutes per response
- Total: 7-12 hours human time
Automated evaluation:
- Setup: 5 minutes
- Execution: 20-30 minutes
- Review: 1-2 hours
- Total: ~2 hours machine time
Time saved: 5-10 total hours
If your time is worth $50/hour:
- Manual review: $350-600
- Automated: $100-150 (time) + $2 (API) = $102-152
Savings: $248-448
ROI: 163-295%
For everything:
Testing: gpt-4o-mini (~$0.06)
Evaluation: Manual review (free)
Total: $0.06 + your time
Trade-off: Your time vs. money
Instead of 140 prompts:
- 50 prompts with gpt-5.1 = $0.70
Instead of full evaluation:
- Manual review of 50 = ~2 hours
Works well for: Smaller courses, tight budgets
Use the provided prompts.csv:
- Already designed and tested
- Covers common scenarios
- Free to use
Cost: Only API usage, no development time
Important: OpenAI updates pricing periodically.
Always check current prices:
- Visit OpenAI Pricing Page
- Update your
--price-input-per-1kand--price-output-per-1karguments - Re-calculate budget estimates
Historical trend: Prices generally decrease over time as models improve.
- Getting Started → - Begin testing with budget models
- Batch Processing → - Set up cost tracking
- Automated Evaluation → - Optimise evaluation costs
--model gpt-4o-mini \
--price-input-per-1k 0.00015 \
--price-output-per-1k 0.0006--model gpt-5.1 \
--price-input-per-1k 0.00125 \
--price-output-per-1k 0.011. Develop with gpt-4o-mini ($0.10-0.20)
2. Validate with gpt-5.1 ($1.50-2.50)
3. Total: ~$2.00 for complete pipeline