Skip to content

Implement test suite with proper Claude API test separation #15

@durapensa

Description

@durapensa

Test Suite with Smart Claude API Testing Strategy

Overview

Implement a comprehensive test suite that properly separates tests by Claude API dependency, ensuring fast CI/CD execution while maintaining thorough coverage of AI-powered analysis tools.

Core Challenge: Claude API Testing

Problems with Naive Approach

  • GitHub Actions: No access to ANTHROPIC_API_KEY
  • Token costs: Real Claude calls expensive (~$0.01-0.10 per test)
  • Speed: Claude API adds 5-30s per test call
  • Non-determinism: API responses vary, causing flaky tests
  • Rate limits: Frequent API calls may hit quotas

Smart Testing Architecture Solution

Test Categories by Claude Dependency

1. Fast Tests (No Claude API) - tests/fast/

# Run on every commit, GitHub Actions friendly
tests/fast/
├── unit/
│   ├── test_ks_env_functions.sh      # ks_validate_days, ks_collect_files
│   ├── test_input_validation.sh      # Parameter sanitization  
│   └── test_file_processing.sh       # JSONL parsing, file collection
├── integration/
│   ├── test_capture_tools.sh         # events, query (no analysis)
│   ├── test_process_tools.sh         # rotate-logs, validate-jsonl
│   └── test_error_handling.sh        # Malformed data, permissions
└── security/
    ├── test_injection_prevention.sh  # Input sanitization
    └── test_path_validation.sh       # File path security

# Execution time: <30 seconds total
# Coverage: ~70% of codebase (all non-AI functionality)

2. Mocked Tests (Fake Claude API) - tests/mocked/

# GitHub Actions friendly, consistent results
tests/mocked/
├── fixtures/
│   ├── claude_responses/
│   │   ├── themes_response_sample1.json
│   │   ├── connections_response_sample1.json
│   │   └── malformed_response.json
│   └── test_events/
│       ├── minimal_dataset.jsonl     # 5 events, predictable
│       ├── theme_dataset.jsonl       # Events → known themes  
│       └── connection_dataset.jsonl  # Events → known connections
├── test_extract_themes_mocked.sh     # Mock ks_claude() function
├── test_find_connections_mocked.sh   # Predictable responses
└── test_error_scenarios_mocked.sh    # API failures, timeouts

# Mock Implementation:
ks_claude() {
    # Override in test environment
    local prompt="$*"
    case "$prompt" in
        *"extract themes"*) cat tests/mocked/fixtures/claude_responses/themes_response_sample1.json ;;
        *"find connections"*) cat tests/mocked/fixtures/claude_responses/connections_response_sample1.json ;;
        *) echo '{"error": "unmocked prompt"}' ;;
    esac
}

# Execution time: <60 seconds total  
# Coverage: 95% of analysis tool functionality with consistent results

3. Real API Tests (Actual Claude) - tests/e2e/

# Local development only, requires ANTHROPIC_API_KEY
tests/e2e/
├── test_analysis_integration.sh      # Real Claude API calls
├── test_large_dataset_analysis.sh    # Performance with real data
└── test_api_error_handling.sh        # Real API failure scenarios

# Smart optimizations:
# - Minimal datasets (5-10 events max)
# - Cached results to avoid repeated calls
# - Optional --use-cached flag for development

# Execution time: 2-5 minutes (limited Claude calls)
# Coverage: End-to-end validation with real AI

Optimized Test Data Strategy

Minimal, Predictable Datasets

# tests/fixtures/minimal_theme_dataset.jsonl (5 events)
{"ts":"2025-01-01T10:00:00Z","type":"thought","topic":"memory","content":"Human memory is associative, not indexed"}
{"ts":"2025-01-01T10:01:00Z","type":"thought","topic":"memory","content":"Computer memory is linear and addressable"}  
{"ts":"2025-01-01T10:02:00Z","type":"connection","topic":"memory-systems","content":"Biological vs digital memory architectures"}
{"ts":"2025-01-01T10:03:00Z","type":"insight","topic":"knowledge-systems","content":"Event sourcing mirrors episodic memory"}
{"ts":"2025-01-01T10:04:00Z","type":"thought","topic":"temporal-meaning","content":"Time shapes knowledge, not just stores it"}

# Expected themes: Memory Systems, Knowledge Architecture, Temporal Meaning
# Designed to produce predictable, testable analysis results

Pre-Generated Claude Responses

// tests/fixtures/claude_responses/themes_minimal_dataset.json
{
  "themes": [
    {
      "name": "Memory System Architecture", 
      "description": "Comparison of biological vs computational memory models",
      "frequency": 3,
      "supporting_quotes": ["Human memory is associative", "Computer memory is linear"]
    },
    {
      "name": "Temporal Knowledge Dynamics",
      "description": "Time as constitutive element of knowledge formation", 
      "frequency": 2,
      "supporting_quotes": ["Event sourcing mirrors episodic memory", "Time shapes knowledge"]
    }
  ]
}

CI/CD Integration Strategy

GitHub Actions Workflow

# .github/workflows/test.yml
name: Knowledge System Tests
on: [push, pull_request]

jobs:
  fast-tests:
    name: Fast Tests (No Claude API)
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup system
        run: ./setup.sh
      - name: Run fast test suite
        run: ./tests/run_fast_tests.sh
        
  mocked-tests:
    name: Mocked Tests (Fake Claude API) 
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup system  
        run: ./setup.sh
      - name: Run mocked analysis tests
        run: ./tests/run_mocked_tests.sh
        
  # Note: No real Claude API tests in CI
  # Those run manually or in nightly builds with secrets

Local Development Workflow

# Quick development cycle
./tests/run_fast_tests.sh              # 30s, no API calls

# Full validation (requires API key)
export ANTHROPIC_API_KEY="your-key"
./tests/run_all_tests.sh               # 5m, includes real Claude calls

# CI simulation (what GitHub Actions runs)
./tests/run_ci_tests.sh                # 90s, fast + mocked only

Test Framework: bats-core with Smart Mocking

Mock Function Override

#\!/usr/bin/env bats

# tests/mocked/test_extract_themes_mocked.sh

setup() {
    export TEST_KS_ROOT=$(mktemp -d)
    export KS_HOT_LOG="$TEST_KS_ROOT/hot.jsonl"
    
    # Override Claude function with mock
    ks_claude() {
        cat "$BATS_TEST_DIRNAME/../fixtures/claude_responses/themes_minimal.json"
    }
    
    # Copy test data
    cp "$BATS_TEST_DIRNAME/../fixtures/minimal_theme_dataset.jsonl" "$KS_HOT_LOG"
    
    source "$BATS_TEST_DIRNAME/../../.ks-env"
}

@test "extract-themes produces expected theme count with mocked Claude" {
    run ./tools/analyze/extract-themes --days 1 --format json
    [ "$status" -eq 0 ]
    
    # Parse and validate expected themes
    theme_count=$(echo "$output" | jq '.themes | length')
    [ "$theme_count" -eq 2 ]
    
    # Validate specific theme names
    [[ "$output" == *"Memory System Architecture"* ]]
    [[ "$output" == *"Temporal Knowledge Dynamics"* ]]
}

@test "extract-themes handles mocked API errors gracefully" {
    # Override with error response
    ks_claude() {
        echo '{"error": "API temporarily unavailable"}'
        return 1
    }
    
    run ./tools/analyze/extract-themes --days 1
    [ "$status" -ne 0 ]
    [[ "$output" == *"Error"* ]]
}

Performance Testing with Mocked Claude

Benchmark Infrastructure

# tests/performance/benchmark_with_mocks.sh

# Test jq optimization performance without Claude API overhead
benchmark_file_processing() {
    local event_count=$1
    
    # Generate test dataset
    generate_test_events $event_count > "$TEST_KS_ROOT/large.jsonl"
    
    # Mock Claude to return instantly
    ks_claude() { echo '{"themes":[]}'; }
    
    # Measure pure file processing performance
    time ./tools/analyze/extract-themes --days 1 --format json
}

# Results show actual optimization impact without API latency

Token Cost Optimization

Smart E2E Testing

# tests/e2e/test_with_minimal_claude_usage.sh

# Cache Claude responses to avoid repeated calls
CLAUDE_CACHE_DIR="$HOME/.ks-test-cache"

cached_claude() {
    local cache_key=$(echo "$*" | sha256sum | cut -d' ' -f1)
    local cache_file="$CLAUDE_CACHE_DIR/$cache_key"
    
    if [ -f "$cache_file" ]; then
        cat "$cache_file"
    else
        # Real Claude call - cache the result
        mkdir -p "$CLAUDE_CACHE_DIR"
        claude "$@" | tee "$cache_file"
    fi
}

# Development workflow:
# 1. First run uses real Claude API (builds cache)
# 2. Subsequent runs use cached responses (free + fast)
# 3. --refresh-cache flag forces real API calls when needed

Implementation Phases

Phase 1: Fast Test Foundation (1 day)

  • Set up bats-core testing framework
  • Implement all fast tests (no Claude API)
  • GitHub Actions integration for fast tests
  • Test data fixtures and generators

Phase 2: Mocked Analysis Tests (1 day)

  • Create Claude response fixtures
  • Implement ks_claude() mocking system
  • Mocked tests for extract-themes, find-connections
  • Error scenario testing with mocked failures

Phase 3: Smart E2E Testing (1 day)

  • Caching system for real Claude responses
  • Minimal dataset E2E tests
  • Local-only test runner with API key checks
  • Performance testing with cache optimization

Success Criteria

Coverage Targets

  • Fast tests: 70% coverage, <30s execution, GitHub Actions ready
  • Mocked tests: 95% analysis functionality, <60s execution
  • E2E tests: Real Claude validation, <5 API calls total

Cost Management

  • Zero tokens spent in CI/CD (fast + mocked tests only)
  • <$0.50 per full E2E test run (minimal, cached Claude usage)
  • Cached responses prevent repeated token costs in development

Developer Experience

  • Fast feedback loop: 30s for quick validation
  • Complete validation: 5m for full test suite with real Claude
  • CI/CD friendly: No secrets required for most tests

This architecture ensures robust testing without breaking the bank on API costs or slowing down development velocity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority: highCritical for project progresstype: testingTest frameworks and automation

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions