Skip to content

Cost & Pricing

Fils0010 edited this page Dec 29, 2025 · 2 revisions

Cost & Pricing Guide

OpenAI model pricing and budget optimisation strategies


Current OpenAI Pricing (December 2025)

All prices shown are per 1 million tokens, Standard Tier

Recommended Models

Model Input Cached Input* Output Best For
gpt-5.2 $1.75 $0.175 $14.00 Latest flagship, best reasoning
gpt-5.1 $1.25 $0.125 $10.00 Excellent balance, cost-effective
gpt-5 $1.25 $0.125 $10.00 Previous flagship, proven reliable
gpt-4.1 $2.00 $0.50 $8.00 Balance option
gpt-4o $2.50 $1.25 $10.00 Multimodal, widely used
gpt-5-mini $0.25 $0.025 $2.00 Budget testing
gpt-4o-mini $0.15 $0.075 $0.60 Cheapest option
o4-mini $1.10 $0.275 $4.40 Enhanced reasoning

*Cached input = prompt caching for repeated system prompts (automatic)


Cost Calculator

Formula

Cost = (Input Tokens / 1,000,000 × Input Price) + (Output Tokens / 1,000,000 × Output Price)

Typical Token Counts

For AI Tutor Testing:

Component Typical Tokens
System prompt 800-1,200
Student prompt 50-200
Tutor response 300-800
Total per interaction ~1,150-2,200

For Evaluation:

Component Typical Tokens
Evaluation prompt 1,500-2,000
Judge reasoning 400-800
Total per evaluation ~1,900-2,800

Example Calculations

140 Prompts, gpt-4o-mini (Budget Testing)

Tutor Responses:

Input:  140 × 1,000 = 140,000 tokens
Output: 140 × 500  = 70,000 tokens

Cost = (140,000 / 1,000,000 × $0.15) + (70,000 / 1,000,000 × $0.60)
     = $0.021 + $0.042
     = $0.063 (~$0.06)

Evaluation:

Input:  140 × 1,800 = 252,000 tokens
Output: 140 × 600   = 84,000 tokens

Cost = (252,000 / 1,000,000 × $0.15) + (84,000 / 1,000,000 × $0.60)
     = $0.038 + $0.050
     = $0.088 (~$0.09)

Total Pipeline: ~$0.15


140 Prompts, gpt-5.1 (Production Validation)

Tutor Responses:

Input:  140,000 tokens
Output: 70,000 tokens

Cost = (140,000 / 1,000,000 × $1.25) + (70,000 / 1,000,000 × $10.00)
     = $0.175 + $0.700
     = $0.875 (~$0.88)

Evaluation (gpt-5.1 as judge):

Input:  252,000 tokens
Output: 84,000 tokens

Cost = (252,000 / 1,000,000 × $1.25) + (84,000 / 1,000,000 × $10.00)
     = $0.315 + $0.840
     = $1.155 (~$1.16)

Total Pipeline: ~$2.04


140 Prompts, gpt-5.2 (Premium Validation)

Total Pipeline: ~$2.85


Budget Optimisation Strategies

Strategy 1: Tiered Testing

Most cost-effective approach:

Phase 1: Initial Testing (gpt-4o-mini)
- 20-50 test prompts
- Quick iteration on system prompt
- Cost: $0.02-0.05

Phase 2: Expanded Testing (gpt-4o-mini)
- 100-140 prompts
- Comprehensive coverage
- Cost: $0.10-0.15

Phase 3: Validation (gpt-5.1)
- Same 140 prompts
- Final confirmation before deployment
- Cost: ~$2.00

Total: ~$2.15-2.20

vs. Using gpt-5.2 for everything: ~$8.55

Savings: 74%

Strategy 2: Selective Evaluation

Instead of evaluating all responses, evaluate strategically:

# Evaluate only failed/borderline cases
# Filter responses.csv first

# Or evaluate random sample
python3 sample_responses.py --input responses.csv --sample 50 --output sample.csv
python3 llm_evaluator.py --input sample.csv ...

Cost reduction: 50-75%

Strategy 3: Prompt Caching

OpenAI automatically caches system prompts. To maximise caching:

  1. Keep system prompt stable during a test session
  2. Run all tests with same prompt before changing it
  3. Rerun tests benefit from cached system prompt

Cached input pricing:

  • gpt-5.1: $0.125 vs $1.25 (90% off)
  • gpt-4o: $1.25 vs $2.50 (50% off)

Example benefit:

Without caching: $2.04
With caching (80% of input cached): $1.54
Savings: 24%

Strategy 4: Batch API (Advanced)

For non-urgent testing, use OpenAI's Batch API:

Pricing (50% off Standard):

  • gpt-5.1: $0.625 input, $5.00 output
  • gpt-4o: $1.25 input, $5.00 output

Trade-off: 24-hour processing time

When to use: Large-scale testing (500+ prompts) that isn't urgent


Model Selection Guide

For Initial Development

Use: gpt-4o-mini

✅ Very cheap
✅ Fast responses
✅ Good enough for iteration
❌ Less capable reasoning
❌ More prone to mistakes

Cost: ~$0.15 for 140 prompts (full pipeline)

For Production Validation

Use: gpt-5.1

✅ Excellent capabilities
✅ Reliable performance
✅ Good cost-performance ratio
✅ Adequate for most use cases
❌ Not the absolute best

Cost: ~$2.00 for 140 prompts (full pipeline)

For Critical Applications

Use: gpt-5.2

✅ Best available reasoning
✅ Most accurate
✅ Latest features
❌ Most expensive

Cost: ~$2.85 for 140 prompts (full pipeline)

For Budget-Conscious Testing

Use: gpt-4o-mini for tutor, gpt-4o for evaluation

✅ Cheap tutor testing
✅ Better evaluation quality
✅ Reasonable compromise

Cost: ~$0.50 for 140 prompts (full pipeline)


Cost Comparison Table

140 prompts, full pipeline (tutor + evaluation)

Configuration Tutor Model Judge Model Total Cost Use Case
Ultra Budget gpt-4o-mini gpt-4o-mini ~$0.15 Early iteration
Budget gpt-4o-mini gpt-4o ~$0.50 Testing phase
Balanced gpt-5.1 gpt-5.1 ~$2.00 Standard validation
Premium gpt-5.2 gpt-5.2 ~$2.85 Final validation
Mixed gpt-4o-mini gpt-5.1 ~$1.20 Cost-conscious validation

Real-World Testing Scenarios

Scenario 1: Course Coordinator, First Time

Goal: Test AI tutor for BIO301, 50 students

Approach:

Week 1: Develop system prompt
- 20 test prompts with gpt-4o-mini
- Iterate 3-4 times
- Cost: $0.10

Week 2: Expand testing
- 100 comprehensive prompts with gpt-4o-mini
- Manual review of responses
- Cost: $0.12

Week 3: Final validation
- Same 100 prompts with gpt-5.1
- Full evaluation with gpt-5.1
- Cost: $1.50

Total: $1.72

Scenario 2: Large Department, Multiple Courses

Goal: Deploy AI tutors for 5 courses, 500 students

Approach:

Phase 1: Template development
- Create base prompt
- Test with 140 prompts × gpt-4o-mini
- Cost per course: $0.15
- Total: $0.75

Phase 2: Course customisation
- Adapt for each course
- Test customisations × gpt-4o-mini
- Cost per course: $0.10
- Total: $0.50

Phase 3: Production validation
- All courses × 140 prompts × gpt-5.1
- Cost per course: $2.00
- Total: $10.00

Total: $11.25 for 5 complete AI tutors

Scenario 3: Research Study

Goal: Compare different prompting strategies

Approach:

Test 5 different system prompts
- Each with 140 test prompts
- Use gpt-5.1 for consistency
- Evaluate all responses

Cost: 5 × $2.00 = $10.00

Cost Tracking

Using the Scripts

Batch Processor:

python3 llm_batch_processor.py \
    --input prompts.csv \
    --output responses.csv \
    --system system_prompt.txt \
    --model gpt-5.1 \
    --price-input-per-1k 0.00125 \
    --price-output-per-1k 0.01

Output includes:

=== SUMMARY ===
Measured total tokens: 102,755
Estimated total cost (USD): $0.87536
Average cost per row: $0.00625

Manual Tracking

Create a spreadsheet:

Date Phase Model Prompts Cost Notes
Jan 15 Initial gpt-4o-mini 20 $0.02 First iteration
Jan 16 Testing gpt-4o-mini 100 $0.12 Full test suite
Jan 17 Validation gpt-5.1 100 $1.45 Final check
Total 220 $1.59

Money-Saving Tips

💰 Tip 1: Start with Free Tier

OpenAI offers $5 free credit for new accounts:

  • Covers ~25 full test cycles with gpt-4o-mini
  • Or ~2-3 validation runs with gpt-5.1
  • Perfect for initial exploration

💰 Tip 2: Reuse Test Prompts

Create once: comprehensive_test_suite.csv
Reuse for:
- Different system prompts
- Different models
- Different courses (adapted)

Saves: Development time and ensures consistency

💰 Tip 3: Manual Review Before Evaluation

Batch processor → Manual spot check → Evaluation (if needed)

If spot check reveals major issues:

  • Fix system prompt
  • Rerun batch processor (cheap)
  • Skip evaluation until fixed

Saves: $1-2 per iteration

💰 Tip 4: Batch Similar Requests

# Inefficient (3 separate runs)
--input test_set1.csv --model gpt-5.1
--input test_set2.csv --model gpt-5.1
--input test_set3.csv --model gpt-5.1

# Efficient (1 combined run, benefits from caching)
cat test_set1.csv test_set2.csv test_set3.csv > combined.csv
--input combined.csv --model gpt-5.1

Saves: 10-20% via better prompt caching

💰 Tip 5: Use Evaluation Selectively

Don't evaluate everything:

# Manual review first
# Identify problem areas
# Evaluate only those specific cases

python3 llm_evaluator.py \
    --input responses.csv \
    --output evaluated.csv \
    --mode rerun_ids \
    --ids "45,67,89,120"

Saves: 50-90% of evaluation costs


ROI Considerations

Time Saved

Manual review of 140 responses:

  • ~3-5 minutes per response
  • Total: 7-12 hours human time

Automated evaluation:

  • Setup: 5 minutes
  • Execution: 20-30 minutes
  • Review: 1-2 hours
  • Total: ~2 hours machine time

Time saved: 5-10 total hours

Cost of Time vs. API Costs

If your time is worth $50/hour:
- Manual review: $350-600
- Automated: $100-150 (time) + $2 (API) = $102-152

Savings: $248-448
ROI: 163-295%

Free/Low-Cost Alternatives

Option 1: Use GPT-4o-mini Only

For everything:

Testing: gpt-4o-mini (~$0.06)
Evaluation: Manual review (free)

Total: $0.06 + your time

Trade-off: Your time vs. money

Option 2: Smaller Test Suites

Instead of 140 prompts:
- 50 prompts with gpt-5.1 = $0.70

Instead of full evaluation:
- Manual review of 50 = ~2 hours

Works well for: Smaller courses, tight budgets

Option 3: Community Test Prompts

Use the provided prompts.csv:

  • Already designed and tested
  • Covers common scenarios
  • Free to use

Cost: Only API usage, no development time


Pricing Changes

Important: OpenAI updates pricing periodically.

Always check current prices:

  1. Visit OpenAI Pricing Page
  2. Update your --price-input-per-1k and --price-output-per-1k arguments
  3. Re-calculate budget estimates

Historical trend: Prices generally decrease over time as models improve.


Next Steps


Quick Reference

Budget Testing Setup

--model gpt-4o-mini \
--price-input-per-1k 0.00015 \
--price-output-per-1k 0.0006

Production Validation Setup

--model gpt-5.1 \
--price-input-per-1k 0.00125 \
--price-output-per-1k 0.01

Cost-Effective Workflow

1. Develop with gpt-4o-mini ($0.10-0.20)
2. Validate with gpt-5.1 ($1.50-2.50)
3. Total: ~$2.00 for complete pipeline

Clone this wiki locally