-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
priority: highCritical for project progressCritical for project progresstype: testingTest frameworks and automationTest frameworks and automation
Milestone
Description
Test Suite with Smart Claude API Testing Strategy
Overview
Implement a comprehensive test suite that properly separates tests by Claude API dependency, ensuring fast CI/CD execution while maintaining thorough coverage of AI-powered analysis tools.
Core Challenge: Claude API Testing
Problems with Naive Approach
- ❌ GitHub Actions: No access to ANTHROPIC_API_KEY
- ❌ Token costs: Real Claude calls expensive (~$0.01-0.10 per test)
- ❌ Speed: Claude API adds 5-30s per test call
- ❌ Non-determinism: API responses vary, causing flaky tests
- ❌ Rate limits: Frequent API calls may hit quotas
Smart Testing Architecture Solution
Test Categories by Claude Dependency
1. Fast Tests (No Claude API) - tests/fast/
# Run on every commit, GitHub Actions friendly
tests/fast/
├── unit/
│ ├── test_ks_env_functions.sh # ks_validate_days, ks_collect_files
│ ├── test_input_validation.sh # Parameter sanitization
│ └── test_file_processing.sh # JSONL parsing, file collection
├── integration/
│ ├── test_capture_tools.sh # events, query (no analysis)
│ ├── test_process_tools.sh # rotate-logs, validate-jsonl
│ └── test_error_handling.sh # Malformed data, permissions
└── security/
├── test_injection_prevention.sh # Input sanitization
└── test_path_validation.sh # File path security
# Execution time: <30 seconds total
# Coverage: ~70% of codebase (all non-AI functionality)
2. Mocked Tests (Fake Claude API) - tests/mocked/
# GitHub Actions friendly, consistent results
tests/mocked/
├── fixtures/
│ ├── claude_responses/
│ │ ├── themes_response_sample1.json
│ │ ├── connections_response_sample1.json
│ │ └── malformed_response.json
│ └── test_events/
│ ├── minimal_dataset.jsonl # 5 events, predictable
│ ├── theme_dataset.jsonl # Events → known themes
│ └── connection_dataset.jsonl # Events → known connections
├── test_extract_themes_mocked.sh # Mock ks_claude() function
├── test_find_connections_mocked.sh # Predictable responses
└── test_error_scenarios_mocked.sh # API failures, timeouts
# Mock Implementation:
ks_claude() {
# Override in test environment
local prompt="$*"
case "$prompt" in
*"extract themes"*) cat tests/mocked/fixtures/claude_responses/themes_response_sample1.json ;;
*"find connections"*) cat tests/mocked/fixtures/claude_responses/connections_response_sample1.json ;;
*) echo '{"error": "unmocked prompt"}' ;;
esac
}
# Execution time: <60 seconds total
# Coverage: 95% of analysis tool functionality with consistent results
3. Real API Tests (Actual Claude) - tests/e2e/
# Local development only, requires ANTHROPIC_API_KEY
tests/e2e/
├── test_analysis_integration.sh # Real Claude API calls
├── test_large_dataset_analysis.sh # Performance with real data
└── test_api_error_handling.sh # Real API failure scenarios
# Smart optimizations:
# - Minimal datasets (5-10 events max)
# - Cached results to avoid repeated calls
# - Optional --use-cached flag for development
# Execution time: 2-5 minutes (limited Claude calls)
# Coverage: End-to-end validation with real AI
Optimized Test Data Strategy
Minimal, Predictable Datasets
# tests/fixtures/minimal_theme_dataset.jsonl (5 events)
{"ts":"2025-01-01T10:00:00Z","type":"thought","topic":"memory","content":"Human memory is associative, not indexed"}
{"ts":"2025-01-01T10:01:00Z","type":"thought","topic":"memory","content":"Computer memory is linear and addressable"}
{"ts":"2025-01-01T10:02:00Z","type":"connection","topic":"memory-systems","content":"Biological vs digital memory architectures"}
{"ts":"2025-01-01T10:03:00Z","type":"insight","topic":"knowledge-systems","content":"Event sourcing mirrors episodic memory"}
{"ts":"2025-01-01T10:04:00Z","type":"thought","topic":"temporal-meaning","content":"Time shapes knowledge, not just stores it"}
# Expected themes: Memory Systems, Knowledge Architecture, Temporal Meaning
# Designed to produce predictable, testable analysis results
Pre-Generated Claude Responses
// tests/fixtures/claude_responses/themes_minimal_dataset.json
{
"themes": [
{
"name": "Memory System Architecture",
"description": "Comparison of biological vs computational memory models",
"frequency": 3,
"supporting_quotes": ["Human memory is associative", "Computer memory is linear"]
},
{
"name": "Temporal Knowledge Dynamics",
"description": "Time as constitutive element of knowledge formation",
"frequency": 2,
"supporting_quotes": ["Event sourcing mirrors episodic memory", "Time shapes knowledge"]
}
]
}
CI/CD Integration Strategy
GitHub Actions Workflow
# .github/workflows/test.yml
name: Knowledge System Tests
on: [push, pull_request]
jobs:
fast-tests:
name: Fast Tests (No Claude API)
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup system
run: ./setup.sh
- name: Run fast test suite
run: ./tests/run_fast_tests.sh
mocked-tests:
name: Mocked Tests (Fake Claude API)
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup system
run: ./setup.sh
- name: Run mocked analysis tests
run: ./tests/run_mocked_tests.sh
# Note: No real Claude API tests in CI
# Those run manually or in nightly builds with secrets
Local Development Workflow
# Quick development cycle
./tests/run_fast_tests.sh # 30s, no API calls
# Full validation (requires API key)
export ANTHROPIC_API_KEY="your-key"
./tests/run_all_tests.sh # 5m, includes real Claude calls
# CI simulation (what GitHub Actions runs)
./tests/run_ci_tests.sh # 90s, fast + mocked only
Test Framework: bats-core with Smart Mocking
Mock Function Override
#\!/usr/bin/env bats
# tests/mocked/test_extract_themes_mocked.sh
setup() {
export TEST_KS_ROOT=$(mktemp -d)
export KS_HOT_LOG="$TEST_KS_ROOT/hot.jsonl"
# Override Claude function with mock
ks_claude() {
cat "$BATS_TEST_DIRNAME/../fixtures/claude_responses/themes_minimal.json"
}
# Copy test data
cp "$BATS_TEST_DIRNAME/../fixtures/minimal_theme_dataset.jsonl" "$KS_HOT_LOG"
source "$BATS_TEST_DIRNAME/../../.ks-env"
}
@test "extract-themes produces expected theme count with mocked Claude" {
run ./tools/analyze/extract-themes --days 1 --format json
[ "$status" -eq 0 ]
# Parse and validate expected themes
theme_count=$(echo "$output" | jq '.themes | length')
[ "$theme_count" -eq 2 ]
# Validate specific theme names
[[ "$output" == *"Memory System Architecture"* ]]
[[ "$output" == *"Temporal Knowledge Dynamics"* ]]
}
@test "extract-themes handles mocked API errors gracefully" {
# Override with error response
ks_claude() {
echo '{"error": "API temporarily unavailable"}'
return 1
}
run ./tools/analyze/extract-themes --days 1
[ "$status" -ne 0 ]
[[ "$output" == *"Error"* ]]
}
Performance Testing with Mocked Claude
Benchmark Infrastructure
# tests/performance/benchmark_with_mocks.sh
# Test jq optimization performance without Claude API overhead
benchmark_file_processing() {
local event_count=$1
# Generate test dataset
generate_test_events $event_count > "$TEST_KS_ROOT/large.jsonl"
# Mock Claude to return instantly
ks_claude() { echo '{"themes":[]}'; }
# Measure pure file processing performance
time ./tools/analyze/extract-themes --days 1 --format json
}
# Results show actual optimization impact without API latency
Token Cost Optimization
Smart E2E Testing
# tests/e2e/test_with_minimal_claude_usage.sh
# Cache Claude responses to avoid repeated calls
CLAUDE_CACHE_DIR="$HOME/.ks-test-cache"
cached_claude() {
local cache_key=$(echo "$*" | sha256sum | cut -d' ' -f1)
local cache_file="$CLAUDE_CACHE_DIR/$cache_key"
if [ -f "$cache_file" ]; then
cat "$cache_file"
else
# Real Claude call - cache the result
mkdir -p "$CLAUDE_CACHE_DIR"
claude "$@" | tee "$cache_file"
fi
}
# Development workflow:
# 1. First run uses real Claude API (builds cache)
# 2. Subsequent runs use cached responses (free + fast)
# 3. --refresh-cache flag forces real API calls when needed
Implementation Phases
Phase 1: Fast Test Foundation (1 day)
- Set up bats-core testing framework
- Implement all fast tests (no Claude API)
- GitHub Actions integration for fast tests
- Test data fixtures and generators
Phase 2: Mocked Analysis Tests (1 day)
- Create Claude response fixtures
- Implement
ks_claude()
mocking system - Mocked tests for extract-themes, find-connections
- Error scenario testing with mocked failures
Phase 3: Smart E2E Testing (1 day)
- Caching system for real Claude responses
- Minimal dataset E2E tests
- Local-only test runner with API key checks
- Performance testing with cache optimization
Success Criteria
Coverage Targets
- Fast tests: 70% coverage, <30s execution, GitHub Actions ready
- Mocked tests: 95% analysis functionality, <60s execution
- E2E tests: Real Claude validation, <5 API calls total
Cost Management
- Zero tokens spent in CI/CD (fast + mocked tests only)
- <$0.50 per full E2E test run (minimal, cached Claude usage)
- Cached responses prevent repeated token costs in development
Developer Experience
- Fast feedback loop: 30s for quick validation
- Complete validation: 5m for full test suite with real Claude
- CI/CD friendly: No secrets required for most tests
This architecture ensures robust testing without breaking the bank on API costs or slowing down development velocity.
Metadata
Metadata
Assignees
Labels
priority: highCritical for project progressCritical for project progresstype: testingTest frameworks and automationTest frameworks and automation