This interactive wizard guides teams through selecting the right evaluation approach for their AI project. Answer the questions below to get a customized evaluation recommendation with specific metrics, tools, and implementation guidance.
What type of AI system are you building?
A) Question & Answer System
- Customer support chatbot
- Knowledge base search
- FAQ automation
β Go to Section A: Q&A Systems
B) Content Generation System
- Blog posts, marketing copy
- Creative writing, stories
- Product descriptions
β Go to Section B: Content Generation
C) Document Analysis System (RAG)
- Document Q&A
- Research assistance
- Information extraction
β Go to Section C: RAG Systems
D) Code Assistant
- Code completion
- Code generation
- Code explanation
β Go to Section D: Code Systems
E) Specialized Task
- Translation, summarization
- Data analysis, classification
- Multi-modal tasks
β Go to Section E: Specialized Tasks
How accurate do your answers need to be?
- Critical (Medical/Legal/Financial): >98% accuracy required
- High (Business/Professional): >90% accuracy required
- Medium (General Information): >80% accuracy acceptable
- Low (Entertainment/Creative): Accuracy less critical
What's your expected usage and budget?
- High Volume (>10K queries/day): Focus on automated metrics
- Medium Volume (1K-10K queries/day): Hybrid approach
- Low Volume (<1K queries/day): Can afford more human evaluation
How specialized is your domain?
- Highly Specialized: Requires domain expert evaluation
- Moderately Specialized: General experts sufficient
- General Domain: Standard evaluation approaches work
For Critical Accuracy + High Volume:
recommended_approach: "Multi-layered Automated + Expert Review"
primary_metrics:
- factual_accuracy: 0.30 # Use fact-checking APIs
- answer_completeness: 0.25 # LLM-as-judge evaluation
- source_attribution: 0.20 # Citation verification
- user_satisfaction: 0.15 # Direct user feedback
- safety_compliance: 0.10 # Automated safety checks
implementation_plan:
phase_1: "Automated fact-checking + safety"
phase_2: "LLM-as-judge for completeness"
phase_3: "Expert review sampling (5-10%)"
monthly_budget: "$500-2000"
setup_time: "2-4 weeks"For Medium Accuracy + Medium Volume:
recommended_approach: "LLM-as-Judge + User Feedback"
primary_metrics:
- answer_relevance: 0.25 # Semantic similarity
- factual_consistency: 0.25 # LLM verification
- user_helpfulness: 0.20 # User ratings
- response_completeness: 0.15 # Coverage analysis
- response_time: 0.15 # Performance metrics
implementation_plan:
phase_1: "Basic relevance + user feedback"
phase_2: "LLM-as-judge evaluation"
phase_3: "Optimization based on patterns"
monthly_budget: "$200-800"
setup_time: "1-2 weeks"What type of content are you generating?
- Marketing/Business: Brand consistency, persuasiveness
- Creative/Entertainment: Originality, engagement
- Educational: Accuracy, clarity, appropriateness
- Technical: Precision, completeness, accuracy
How sensitive is your brand to content issues?
- High Sensitivity: Financial, healthcare, children's content
- Medium Sensitivity: Professional services, B2B
- Low Sensitivity: Entertainment, casual content
Do you have content experts available?
- Yes, Dedicated Team: Can do regular human review
- Yes, Part-time: Limited expert review capacity
- No: Must rely on automated evaluation
For High Brand Sensitivity + Expert Team:
recommended_approach: "Human-First with AI Support"
primary_metrics:
- brand_alignment: 0.25 # Human expert review
- content_quality: 0.25 # Editorial assessment
- safety_compliance: 0.20 # Automated + human review
- audience_appropriateness: 0.15 # Target audience fit
- engagement_potential: 0.15 # Predicted engagement
implementation_plan:
phase_1: "Human review + basic safety"
phase_2: "LLM-as-judge for efficiency"
phase_3: "Predictive quality models"
human_review_rate: "50-100%"
monthly_budget: "$1000-5000"
setup_time: "2-3 weeks"For Creative Content + Limited Oversight:
recommended_approach: "AI-Enhanced with Sampling"
primary_metrics:
- creativity_score: 0.30 # Novelty and originality
- coherence_quality: 0.25 # Logical flow and structure
- style_consistency: 0.20 # Voice and tone matching
- user_engagement: 0.15 # Predicted engagement
- safety_check: 0.10 # Basic safety filtering
implementation_plan:
phase_1: "Automated creativity + safety"
phase_2: "User feedback integration"
phase_3: "Style optimization"
human_review_rate: "10-20%"
monthly_budget: "$300-1000"
setup_time: "1-2 weeks"What kind of documents are you working with?
- Structured (Databases, APIs): High accuracy requirements
- Semi-structured (Reports, Articles): Medium accuracy needs
- Unstructured (Notes, Conversations): Flexibility required
How important is answer verification?
- Critical: Every answer must be traceable to source
- Important: Most answers should be verifiable
- Moderate: General accuracy is sufficient
How often does your knowledge base change?
- Real-time: Constantly updating information
- Regular: Weekly/monthly updates
- Static: Infrequent changes
For Critical Verification + Structured Sources:
recommended_approach: "Faithfulness-First with Citation Tracking"
primary_metrics:
- answer_faithfulness: 0.35 # RAGAS faithfulness metric
- citation_accuracy: 0.25 # Source attribution verification
- context_relevance: 0.20 # Retrieved content quality
- answer_completeness: 0.15 # Coverage of user query
- retrieval_precision: 0.05 # Search quality metrics
implementation_plan:
phase_1: "RAGAS evaluation + basic citations"
phase_2: "Enhanced citation verification"
phase_3: "Context optimization"
tools_needed: ["RAGAS", "Citation verification", "Semantic search"]
monthly_budget: "$400-1500"
setup_time: "2-3 weeks"
tool_selection: "See Tool Comparison Matrix for detailed vendor selection"For Moderate Verification + Dynamic Content:
recommended_approach: "Balanced RAG with Monitoring"
primary_metrics:
- answer_relevance: 0.25 # Query-answer alignment
- context_utilization: 0.25 # Effective use of retrieved info
- factual_consistency: 0.20 # LLM-based fact checking
- user_satisfaction: 0.15 # Direct user feedback
- system_performance: 0.15 # Speed and reliability
implementation_plan:
phase_1: "Basic RAG metrics + user feedback"
phase_2: "LLM-as-judge evaluation"
phase_3: "Performance optimization"
tools_needed: ["RAGAS", "LLM-as-judge", "User feedback"]
monthly_budget: "$200-800"
setup_time: "1-2 weeks"How critical is the generated code?
- Production Systems: Must be secure, efficient, correct
- Development Tools: Helpful but can have minor issues
- Learning/Examples: Educational value most important
What programming contexts are you supporting?
- Enterprise Languages (Java, C#, etc.): High standards
- Scripting (Python, JavaScript): Moderate standards
- Multiple Languages: Varied requirements
What are your security needs?
- High Security: Financial, healthcare, enterprise
- Medium Security: Standard business applications
- Low Security: Internal tools, prototypes
For Production + High Security:
recommended_approach: "Multi-Stage Verification"
primary_metrics:
- functional_correctness: 0.30 # Automated testing
- security_compliance: 0.25 # Security scanning
- code_quality: 0.20 # Style and best practices
- performance_efficiency: 0.15 # Benchmarking
- documentation_quality: 0.10 # Comment and docs
implementation_plan:
phase_1: "Automated testing + security scanning"
phase_2: "Code quality analysis"
phase_3: "Performance optimization"
tools_needed: ["Testing frameworks", "Security scanners", "Code analyzers"]
monthly_budget: "$300-1200"
setup_time: "2-4 weeks"For Development Tools + Medium Security:
recommended_approach: "Balanced Code Evaluation"
primary_metrics:
- syntactic_correctness: 0.25 # Basic syntax validation
- logical_correctness: 0.25 # Functional testing
- helpfulness: 0.20 # User feedback
- code_style: 0.15 # Style guide compliance
- security_basics: 0.15 # Basic security checks
implementation_plan:
phase_1: "Syntax + basic testing"
phase_2: "User feedback integration"
phase_3: "Style and security enhancement"
tools_needed: ["Linters", "Test runners", "User feedback"]
monthly_budget: "$150-600"
setup_time: "1-2 weeks"How complex is your specialized task?
- High Complexity: Multi-step reasoning, domain expertise
- Medium Complexity: Structured but nuanced tasks
- Low Complexity: Well-defined, repeatable tasks
How variable are acceptable outputs?
- Low Variability: Specific format/content required
- Medium Variability: Some flexibility acceptable
- High Variability: Creative interpretation encouraged
Do you have task-specific evaluation expertise?
- Yes, Internal Experts: Domain specialists available
- Yes, External Access: Can consult experts
- No, Limited Expertise: Must rely on general approaches
For High Complexity + Expert Available:
recommended_approach: "Expert-Guided Evaluation"
primary_metrics:
- task_accuracy: 0.30 # Expert assessment
- methodology_correctness: 0.25 # Approach validation
- completeness: 0.20 # Coverage assessment
- presentation_quality: 0.15 # Output format and clarity
- efficiency: 0.10 # Resource utilization
implementation_plan:
phase_1: "Expert evaluation setup"
phase_2: "Pattern identification"
phase_3: "Automated screening development"
expert_involvement: "High (30-50% review rate)"
monthly_budget: "$800-3000"
setup_time: "3-6 weeks"For Low Complexity + Limited Expertise:
recommended_approach: "Template-Based Evaluation"
primary_metrics:
- format_compliance: 0.30 # Template adherence
- basic_accuracy: 0.25 # Factual correctness
- completeness_check: 0.20 # Required elements present
- user_satisfaction: 0.15 # Feedback collection
- processing_efficiency: 0.10 # Performance metrics
implementation_plan:
phase_1: "Template validation + basic checks"
phase_2: "User feedback integration"
phase_3: "Quality optimization"
automation_rate: "High (80-90%)"
monthly_budget: "$100-500"
setup_time: "1-2 weeks"Use this matrix for rapid evaluation approach selection:
| Your Situation | Recommended Approach | Primary Tool | Monthly Budget |
|---|---|---|---|
| High Stakes + High Volume | Automated + Expert Sample | LLM-judge + Human | $1000-3000 |
| High Stakes + Low Volume | Human-First | Expert Review | $2000-5000 |
| Medium Stakes + High Volume | LLM-as-Judge | GPT-4/Claude | $300-1000 |
| Medium Stakes + Low Volume | Hybrid Approach | Mixed Tools | $200-800 |
| Low Stakes + High Volume | Automated | Open Source | $50-300 |
| Low Stakes + Low Volume | Basic Automated | Simple Tools | $25-150 |
π‘ Need specific metrics? See the Metrics Selection Guide for detailed metric catalogs and implementation resources.
Path A: Safety-First Approach (Recommended for most)
- Implement toxicity detection (2 hours)
- Add basic relevance checking (2 hours)
- Set up monitoring dashboard (4 hours)
- Configure alerts (1 hour)
Path B: Quality-First Approach (For content-critical applications)
- Set up LLM-as-judge evaluation (4 hours)
- Create quality rubrics (3 hours)
- Implement human review sampling (3 hours)
- Build feedback loops (2 hours)
Path C: User-First Approach (For customer-facing applications)
- Implement user feedback collection (3 hours)
- Set up task success tracking (3 hours)
- Add experience metrics (2 hours)
- Create improvement workflows (4 hours)
Before finalizing your evaluation approach, verify:
- Business Alignment: Metrics directly support business goals
- Resource Fit: Implementation within available time/budget
- Technical Feasibility: Required tools and expertise available
- Scalability: Approach works at expected usage levels
- Improvement Path: Clear plan for enhancing evaluation over time
If you're still unsure:
- Start with the Safety-First approach (lowest risk)
- Implement basic user feedback collection
- Run for 2-4 weeks to gather data
- Use results to refine your evaluation strategy
Common Decision Points:
- Budget Limited: Focus on automated metrics first
- Quality Critical: Invest in human evaluation early
- Scale Uncertain: Start simple, plan for growth
- Domain Complex: Engage experts from the beginning
Ready to implement? Choose your next step:
- Starter Evaluation Toolkit: Get hands-on with day 1 implementation
- Master Implementation Roadmap: Plan your long-term evaluation strategy
- Cost-Benefit Calculator: Calculate ROI and justify your evaluation investment
- Starter Evaluation Toolkit: Day 1 implementation with code examples
- Implementation Guides: Detailed technical setup instructions
- Automation Templates: Production-ready deployment configurations
- Tool Comparison Matrix: Comprehensive vendor selection and platform guidance (Definitive Source)
- Cost-Benefit Calculator: ROI analysis and budget optimization
- Master Implementation Roadmap: Long-term planning with specialized templates
- Decision Trees: Task-specific metric selection guidance (Primary Authority)
- Quality Dimensions: Comprehensive quality framework
- Quick Assessment Tool: 2-minute evaluation readiness check
- Human Evaluation Guidelines: Standardized annotation workflows
This wizard provides a systematic way to select evaluation approaches that match your specific context and constraints while providing a clear path forward.