Cost & Pricing

Cost & Pricing Guide

OpenAI model pricing and budget optimisation strategies

Current OpenAI Pricing (December 2025)

All prices shown are per 1 million tokens, Standard Tier

Recommended Models

Model	Input	Cached Input*	Output	Best For
gpt-5.2	$1.75	$0.175	$14.00	Latest flagship, best reasoning
gpt-5.1	$1.25	$0.125	$10.00	Excellent balance, cost-effective
gpt-5	$1.25	$0.125	$10.00	Previous flagship, proven reliable
gpt-4.1	$2.00	$0.50	$8.00	Balance option
gpt-4o	$2.50	$1.25	$10.00	Multimodal, widely used
gpt-5-mini	$0.25	$0.025	$2.00	Budget testing
gpt-4o-mini	$0.15	$0.075	$0.60	Cheapest option
o4-mini	$1.10	$0.275	$4.40	Enhanced reasoning

*Cached input = prompt caching for repeated system prompts (automatic)

Cost Calculator

Formula

Cost = (Input Tokens / 1,000,000 × Input Price) + (Output Tokens / 1,000,000 × Output Price)

Typical Token Counts

For AI Tutor Testing:

Component	Typical Tokens
System prompt	800-1,200
Student prompt	50-200
Tutor response	300-800
Total per interaction	~1,150-2,200

For Evaluation:

Component	Typical Tokens
Evaluation prompt	1,500-2,000
Judge reasoning	400-800
Total per evaluation	~1,900-2,800

Example Calculations

140 Prompts, gpt-4o-mini (Budget Testing)

Tutor Responses:

Input:  140 × 1,000 = 140,000 tokens
Output: 140 × 500  = 70,000 tokens

Cost = (140,000 / 1,000,000 × $0.15) + (70,000 / 1,000,000 × $0.60)
     = $0.021 + $0.042
     = $0.063 (~$0.06)

Evaluation:

Input:  140 × 1,800 = 252,000 tokens
Output: 140 × 600   = 84,000 tokens

Cost = (252,000 / 1,000,000 × $0.15) + (84,000 / 1,000,000 × $0.60)
     = $0.038 + $0.050
     = $0.088 (~$0.09)

Total Pipeline: ~$0.15

140 Prompts, gpt-5.1 (Production Validation)

Tutor Responses:

Input:  140,000 tokens
Output: 70,000 tokens

Cost = (140,000 / 1,000,000 × $1.25) + (70,000 / 1,000,000 × $10.00)
     = $0.175 + $0.700
     = $0.875 (~$0.88)

Evaluation (gpt-5.1 as judge):

Input:  252,000 tokens
Output: 84,000 tokens

Cost = (252,000 / 1,000,000 × $1.25) + (84,000 / 1,000,000 × $10.00)
     = $0.315 + $0.840
     = $1.155 (~$1.16)

Total Pipeline: ~$2.04

140 Prompts, gpt-5.2 (Premium Validation)

Total Pipeline: ~$2.85

Budget Optimisation Strategies

Strategy 1: Tiered Testing

Most cost-effective approach:

Phase 1: Initial Testing (gpt-4o-mini)
- 20-50 test prompts
- Quick iteration on system prompt
- Cost: $0.02-0.05

Phase 2: Expanded Testing (gpt-4o-mini)
- 100-140 prompts
- Comprehensive coverage
- Cost: $0.10-0.15

Phase 3: Validation (gpt-5.1)
- Same 140 prompts
- Final confirmation before deployment
- Cost: ~$2.00

Total: ~$2.15-2.20

vs. Using gpt-5.2 for everything: ~$8.55

Savings: 74%

Strategy 2: Selective Evaluation

Instead of evaluating all responses, evaluate strategically:

# Evaluate only failed/borderline cases
# Filter responses.csv first

# Or evaluate random sample
python3 sample_responses.py --input responses.csv --sample 50 --output sample.csv
python3 llm_evaluator.py --input sample.csv ...

Cost reduction: 50-75%

Strategy 3: Prompt Caching

OpenAI automatically caches system prompts. To maximise caching:

Keep system prompt stable during a test session
Run all tests with same prompt before changing it
Rerun tests benefit from cached system prompt

Cached input pricing:

gpt-5.1: $0.125 vs $1.25 (90% off)
gpt-4o: $1.25 vs $2.50 (50% off)

Example benefit:

Without caching: $2.04
With caching (80% of input cached): $1.54
Savings: 24%

Strategy 4: Batch API (Advanced)

For non-urgent testing, use OpenAI's Batch API:

Pricing (50% off Standard):

gpt-5.1: $0.625 input, $5.00 output
gpt-4o: $1.25 input, $5.00 output

Trade-off: 24-hour processing time

When to use: Large-scale testing (500+ prompts) that isn't urgent

Model Selection Guide

For Initial Development

Use: gpt-4o-mini

✅ Very cheap
✅ Fast responses
✅ Good enough for iteration
❌ Less capable reasoning
❌ More prone to mistakes

Cost: ~$0.15 for 140 prompts (full pipeline)

For Production Validation

Use: gpt-5.1

✅ Excellent capabilities
✅ Reliable performance
✅ Good cost-performance ratio
✅ Adequate for most use cases
❌ Not the absolute best

Cost: ~$2.00 for 140 prompts (full pipeline)

For Critical Applications

Use: gpt-5.2

✅ Best available reasoning
✅ Most accurate
✅ Latest features
❌ Most expensive

Cost: ~$2.85 for 140 prompts (full pipeline)

For Budget-Conscious Testing

Use: gpt-4o-mini for tutor, gpt-4o for evaluation

✅ Cheap tutor testing
✅ Better evaluation quality
✅ Reasonable compromise

Cost: ~$0.50 for 140 prompts (full pipeline)

Cost Comparison Table

140 prompts, full pipeline (tutor + evaluation)

Configuration	Tutor Model	Judge Model	Total Cost	Use Case
Ultra Budget	gpt-4o-mini	gpt-4o-mini	~$0.15	Early iteration
Budget	gpt-4o-mini	gpt-4o	~$0.50	Testing phase
Balanced	gpt-5.1	gpt-5.1	~$2.00	Standard validation
Premium	gpt-5.2	gpt-5.2	~$2.85	Final validation
Mixed	gpt-4o-mini	gpt-5.1	~$1.20	Cost-conscious validation

Real-World Testing Scenarios

Scenario 1: Course Coordinator, First Time

Goal: Test AI tutor for BIO301, 50 students

Approach:

Week 1: Develop system prompt
- 20 test prompts with gpt-4o-mini
- Iterate 3-4 times
- Cost: $0.10

Week 2: Expand testing
- 100 comprehensive prompts with gpt-4o-mini
- Manual review of responses
- Cost: $0.12

Week 3: Final validation
- Same 100 prompts with gpt-5.1
- Full evaluation with gpt-5.1
- Cost: $1.50

Total: $1.72

Scenario 2: Large Department, Multiple Courses

Goal: Deploy AI tutors for 5 courses, 500 students

Approach:

Phase 1: Template development
- Create base prompt
- Test with 140 prompts × gpt-4o-mini
- Cost per course: $0.15
- Total: $0.75

Phase 2: Course customisation
- Adapt for each course
- Test customisations × gpt-4o-mini
- Cost per course: $0.10
- Total: $0.50

Phase 3: Production validation
- All courses × 140 prompts × gpt-5.1
- Cost per course: $2.00
- Total: $10.00

Total: $11.25 for 5 complete AI tutors

Scenario 3: Research Study

Goal: Compare different prompting strategies

Approach:

Test 5 different system prompts
- Each with 140 test prompts
- Use gpt-5.1 for consistency
- Evaluate all responses

Cost: 5 × $2.00 = $10.00

Cost Tracking

Using the Scripts

Batch Processor:

python3 llm_batch_processor.py \
    --input prompts.csv \
    --output responses.csv \
    --system system_prompt.txt \
    --model gpt-5.1 \
    --price-input-per-1k 0.00125 \
    --price-output-per-1k 0.01

Output includes:

=== SUMMARY ===
Measured total tokens: 102,755
Estimated total cost (USD): $0.87536
Average cost per row: $0.00625

Manual Tracking

Create a spreadsheet:

Date	Phase	Model	Prompts	Cost	Notes
Jan 15	Initial	gpt-4o-mini	20	$0.02	First iteration
Jan 16	Testing	gpt-4o-mini	100	$0.12	Full test suite
Jan 17	Validation	gpt-5.1	100	$1.45	Final check
Total			220	$1.59

Money-Saving Tips

💰 Tip 1: Start with Free Tier

OpenAI offers $5 free credit for new accounts:

Covers ~25 full test cycles with gpt-4o-mini
Or ~2-3 validation runs with gpt-5.1
Perfect for initial exploration

💰 Tip 2: Reuse Test Prompts

Create once: comprehensive_test_suite.csv
Reuse for:
- Different system prompts
- Different models
- Different courses (adapted)

Saves: Development time and ensures consistency

💰 Tip 3: Manual Review Before Evaluation

Batch processor → Manual spot check → Evaluation (if needed)

If spot check reveals major issues:

Fix system prompt
Rerun batch processor (cheap)
Skip evaluation until fixed

Saves: $1-2 per iteration

💰 Tip 4: Batch Similar Requests

# Inefficient (3 separate runs)
--input test_set1.csv --model gpt-5.1
--input test_set2.csv --model gpt-5.1
--input test_set3.csv --model gpt-5.1

# Efficient (1 combined run, benefits from caching)
cat test_set1.csv test_set2.csv test_set3.csv > combined.csv
--input combined.csv --model gpt-5.1

Saves: 10-20% via better prompt caching

💰 Tip 5: Use Evaluation Selectively

Don't evaluate everything:

# Manual review first
# Identify problem areas
# Evaluate only those specific cases

python3 llm_evaluator.py \
    --input responses.csv \
    --output evaluated.csv \
    --mode rerun_ids \
    --ids "45,67,89,120"

Saves: 50-90% of evaluation costs

ROI Considerations

Time Saved

Manual review of 140 responses:

~3-5 minutes per response
Total: 7-12 hours human time

Automated evaluation:

Setup: 5 minutes
Execution: 20-30 minutes
Review: 1-2 hours
Total: ~2 hours machine time

Time saved: 5-10 total hours

Cost of Time vs. API Costs

If your time is worth $50/hour:
- Manual review: $350-600
- Automated: $100-150 (time) + $2 (API) = $102-152

Savings: $248-448
ROI: 163-295%

Free/Low-Cost Alternatives

Option 1: Use GPT-4o-mini Only

For everything:

Testing: gpt-4o-mini (~$0.06)
Evaluation: Manual review (free)

Total: $0.06 + your time

Trade-off: Your time vs. money

Option 2: Smaller Test Suites

Instead of 140 prompts:
- 50 prompts with gpt-5.1 = $0.70

Instead of full evaluation:
- Manual review of 50 = ~2 hours

Works well for: Smaller courses, tight budgets

Option 3: Community Test Prompts

Use the provided prompts.csv:

Already designed and tested
Covers common scenarios
Free to use

Cost: Only API usage, no development time

Pricing Changes

Important: OpenAI updates pricing periodically.

Always check current prices:

Visit OpenAI Pricing Page
Update your --price-input-per-1k and --price-output-per-1k arguments
Re-calculate budget estimates

Historical trend: Prices generally decrease over time as models improve.

Next Steps

Getting Started → - Begin testing with budget models
Batch Processing → - Set up cost tracking
Automated Evaluation → - Optimise evaluation costs

Quick Reference

Budget Testing Setup

--model gpt-4o-mini \
--price-input-per-1k 0.00015 \
--price-output-per-1k 0.0006

Production Validation Setup

--model gpt-5.1 \
--price-input-per-1k 0.00125 \
--price-output-per-1k 0.01

Cost-Effective Workflow

1. Develop with gpt-4o-mini ($0.10-0.20)
2. Validate with gpt-5.1 ($1.50-2.50)
3. Total: ~$2.00 for complete pipeline

Cost & Pricing

Cost & Pricing Guide

Current OpenAI Pricing (December 2025)

Recommended Models

Cost Calculator

Formula

Typical Token Counts

Example Calculations

140 Prompts, gpt-4o-mini (Budget Testing)

140 Prompts, gpt-5.1 (Production Validation)

140 Prompts, gpt-5.2 (Premium Validation)

Budget Optimisation Strategies

Strategy 1: Tiered Testing

Strategy 2: Selective Evaluation

Strategy 3: Prompt Caching

Strategy 4: Batch API (Advanced)

Model Selection Guide

For Initial Development

For Production Validation

For Critical Applications

For Budget-Conscious Testing

Cost Comparison Table

Real-World Testing Scenarios

Scenario 1: Course Coordinator, First Time

Scenario 2: Large Department, Multiple Courses

Scenario 3: Research Study

Cost Tracking

Using the Scripts

Manual Tracking

Money-Saving Tips

💰 Tip 1: Start with Free Tier

💰 Tip 2: Reuse Test Prompts

💰 Tip 3: Manual Review Before Evaluation

💰 Tip 4: Batch Similar Requests

💰 Tip 5: Use Evaluation Selectively

ROI Considerations

Time Saved

Cost of Time vs. API Costs

Free/Low-Cost Alternatives

Option 1: Use GPT-4o-mini Only

Option 2: Smaller Test Suites

Option 3: Community Test Prompts

Pricing Changes

Next Steps

Quick Reference

Budget Testing Setup

Production Validation Setup

Cost-Effective Workflow

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally