NornicDB Eval Harness

Search Quality Evaluation & Validation

Overview

The eval harness provides automated testing and validation of NornicDB's search quality. It computes standard Information Retrieval (IR) metrics and reports pass/fail based on configurable thresholds.

Quick Start

# Run against running server with built-in tests
cd nornicdb
go run ./cmd/eval

# Run with custom test suite
go run ./cmd/eval -suite path/to/tests.json

# Output JSON for CI/CD
go run ./cmd/eval -output json -save results.json

Metrics Computed

Metric	Description	Range
Precision@K	Fraction of top-K results that are relevant	0-1
Recall@K	Fraction of all relevant docs in top-K	0-1
MRR	Mean Reciprocal Rank - where first relevant result appears	0-1
NDCG@K	Normalized Discounted Cumulative Gain - ranking quality	0-1
MAP	Mean Average Precision	0-1
Hit Rate	Fraction of queries with at least one relevant result	0-1

ELI12 (Explain Like I'm 12)

Think of it like grading a spelling bee:

Precision: "How many of your first 10 guesses were correct?"
Recall: "Of all the correct answers, how many did you find?"
MRR: "How quickly did you get your first right answer?"
Hit Rate: "Did you get at least one right?"

Command-Line Options

go run ./cmd/eval [flags]

Flags:
  -url string
        NornicDB server URL (default "http://localhost:7474")
  -suite string
        Path to test suite JSON file
  -output string
        Output format: summary, detailed, json, compact (default "summary")
  -save string
        Save results to JSON file
  -threshold string
        Override thresholds (format: p10=0.5,mrr=0.5,hit=0.8)
  -create-sample
        Create sample test data in the database

Test Suite Format

{
  "name": "my-test-suite",
  "description": "Search quality tests",
  "version": "1.0.0",
  "test_cases": [
    {
      "name": "ML Concept Search",
      "query": "machine learning neural networks",
      "expected": ["node-id-1", "node-id-2"],
      "tags": ["ml", "concepts"]
    },
    {
      "name": "Graded Relevance Test",
      "query": "database architecture",
      "expected": ["db-1", "db-2", "db-3"],
      "relevance_grades": {
        "db-1": 3,
        "db-2": 2,
        "db-3": 1
      },
      "tags": ["database"]
    }
  ]
}

Test Case Fields

Field	Type	Description
`name`	string	Human-readable test name
`query`	string	Search query text
`expected`	[]string	Node IDs that should be returned
`relevance_grades`	map[string]int	Optional graded relevance (0-3) for NDCG
`tags`	[]string	Optional tags for filtering

Default Thresholds

Thresholds{
    Precision10: 0.5,  // At least 50% of top-10 relevant
    Recall10:    0.3,  // At least 30% of relevant in top-10
    MRR:         0.5,  // First relevant in top 2 on average
    NDCG10:      0.5,  // Reasonable ranking quality
    HitRate:     0.8,  // 80% of queries have at least one hit
}

Output Formats

Summary (default)

╔════════════════════════════════════════════════════════════════╗
║           NornicDB Search Evaluation Results                   ║
╚════════════════════════════════════════════════════════════════╝

📊 Suite: my-tests
📅 Time:  2025-12-01T09:10:12-07:00
⏱️  Duration: 125ms

✅ Tests: 5/5 passed (100.0%)

┌─────────────────────────────────────────────────────────────────┐
│                     Aggregate Metrics                           │
├─────────────────────────────────────────────────────────────────┤
│ ✓ MRR            [████████████████████] 1.000 (target: 0.50)
│ ✓ Recall@10      [████████████████████] 1.000 (target: 0.30)
│ ✓ Hit Rate       [████████████████████] 1.000 (target: 0.80)
└─────────────────────────────────────────────────────────────────┘

Detailed

Includes per-test breakdown:

✅ Test 1: ML Search
   Query: "machine learning"
   Method: http | Duration: 1.552ms
   P@10: 0.10 | R@10: 1.00 | MRR: 1.00 | NDCG@10: 1.00
   Expected: 1 | Returned: 1 | Hits: 1

Compact

One-line summary for CI logs:

[PASS] 5/5 tests | P@10=0.10 R@10=1.00 MRR=1.00 NDCG=1.00 HitRate=1.00 | 8ms

JSON

Full structured output for programmatic processing.

CI/CD Integration

GitHub Actions Example

- name: Run Search Quality Tests
  run: |
    cd nornicdb
    go run ./cmd/eval \
      -suite tests/search_quality.json \
      -output json \
      -save eval-results.json \
      -threshold="hit=0.9,mrr=0.7"
    
- name: Upload Results
  uses: actions/upload-artifact@v3
  with:
    name: eval-results
    path: nornicdb/eval-results.json

Exit Codes

0: All tests passed
1: One or more tests failed

Programmatic Usage

import (
    "github.com/orneryd/nornicdb/pkg/eval"
    "github.com/orneryd/nornicdb/pkg/search"
)

// Create harness
harness := eval.NewHarness(searchService)

// Add test cases
harness.AddTestCase(eval.TestCase{
    Name:     "ML concepts",
    Query:    "machine learning",
    Expected: []string{"node-1", "node-2"},
})

// Set custom thresholds
harness.SetThresholds(eval.Thresholds{
    MRR:     0.7,
    HitRate: 0.9,
})

// Run evaluation
result, err := harness.Run(ctx)

// Output results
reporter := eval.NewReporter(os.Stdout)
reporter.PrintSummary(result)

Best Practices

1. Use Real Node IDs

Test cases should use actual storage node IDs, not user-defined id properties:

// ✅ Good - uses actual storage IDs
"expected": ["n1", "node-abc123"]

// ❌ Bad - uses property values
"expected": ["my-custom-id"]

2. Use Graded Relevance for NDCG

For meaningful NDCG scores, provide relevance grades:

"relevance_grades": {
    "highly-relevant-doc": 3,
    "relevant-doc": 2,
    "marginal-doc": 1,
    "irrelevant-doc": 0
}

3. Set Realistic Thresholds

Start with lenient thresholds and tighten as search quality improves:

# Development
-threshold="hit=0.5,mrr=0.3"

# Production
-threshold="hit=0.9,mrr=0.7,p10=0.5"

4. Tag Tests for Filtering

Use tags to organize and filter tests:

"tags": ["semantic", "ml", "critical"]

Troubleshooting

Low Precision but High Recall

Search returns many results but expected docs are scattered.

Fix: Improve ranking algorithm or add MMR diversification

Zero Hit Rate

No expected docs found in any results.

Check: Are expected IDs correct (storage IDs, not properties)?
Check: Has the search index been rebuilt?

curl -X POST http://localhost:7474/nornicdb/search/rebuild

Slow Evaluation

Reduce limit in search options
Use fewer test cases for quick iteration
Run full suite only in CI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NornicDB Eval Harness

Overview

Quick Start

Metrics Computed

ELI12 (Explain Like I'm 12)

Command-Line Options

Test Suite Format

Test Case Fields

Default Thresholds

Output Formats

Summary (default)

Detailed

Compact

JSON

CI/CD Integration

GitHub Actions Example

Exit Codes

Programmatic Usage

Best Practices

1. Use Real Node IDs

2. Use Graded Relevance for NDCG

3. Set Realistic Thresholds

4. Tag Tests for Filtering

Troubleshooting

Low Precision but High Recall

Zero Hit Rate

Slow Evaluation

Related Documentation

FilesExpand file tree

search-evaluation.md

Latest commit

History

search-evaluation.md

File metadata and controls

NornicDB Eval Harness

Overview

Quick Start

Metrics Computed

ELI12 (Explain Like I'm 12)

Command-Line Options

Test Suite Format

Test Case Fields

Default Thresholds

Output Formats

Summary (default)

Detailed

Compact

JSON

CI/CD Integration

GitHub Actions Example

Exit Codes

Programmatic Usage

Best Practices

1. Use Real Node IDs

2. Use Graded Relevance for NDCG

3. Set Realistic Thresholds

4. Tag Tests for Filtering

Troubleshooting

Low Precision but High Recall

Zero Hit Rate

Slow Evaluation

Related Documentation