Skip to content

Latest commit

 

History

History
416 lines (335 loc) · 10.3 KB

File metadata and controls

416 lines (335 loc) · 10.3 KB

Semantic Search Architecture - DAML Reason Tool

Branch: semantic-search
Date: November 17, 2025

Overview

The DAML Reason tool has been overhauled to use ChromaDB-powered semantic search instead of rule-based pattern matching. This provides intelligent, context-aware recommendations using vector embeddings.

The New Flow (Clean & Simple)

Raw Files from Repos → ChromaDB → DAML Reason finds similar files → LLM reasons → Return

No enrichment. No caching complexity. No preprocessing.

Just: Raw files → Vector search → Smart recommendations

Architecture

1. Raw File Loading

canonical_docs_path = Path(os.environ.get("CANONICAL_DOCS_PATH"))
loader = DirectFileResourceLoader(canonical_docs_path)
raw_resources = loader.scan_repositories(force_refresh=False)
  • Loads 14,707+ files directly from cloned repos (daml, canton, daml-finance)
  • No enrichment, no synthetic metadata
  • Uses disk cache for fast startup: ~/.canton-mcp/resource-cache-*.json

2. ChromaDB Indexing

search_engine = create_semantic_search(
    raw_resources=all_resources,
    force_reindex=False  # Persists across restarts
)
  • Indexes 11,742 resources (.daml, .md, .scala files)
  • Uses first 2000 chars of each file for embedding
  • Persists to .chroma_db/ directory
  • Uses default sentence-transformers embeddings
  • One-time indexing (~2-3 minutes first run, instant after)

3. Semantic Search

similar_files = semantic_search.search_similar_files(
    code=user_code_or_intent,
    top_k=5,
    raw_resources=all_resources
)
  • Finds similar files in ~50ms using vector similarity
  • Returns top-k most similar files with scores
  • No LLM calls for search (fast & cheap)

4. LLM Reasoning (Future Enhancement)

  • ChromaDB finds similar files
  • LLM reasons about relevance in context
  • Returns curated recommendations

Key Components

src/canton_mcp_server/core/semantic_search.py

Main Classes:

  • DAMLSemanticSearch: ChromaDB wrapper for semantic similarity
  • create_semantic_search(): Factory function with auto-initialization

Key Methods:

  • index_resources(): Index raw files into ChromaDB
  • search_similar_files(): Find k-most-similar files
  • get_stats(): Get indexing statistics

Features:

  • Persistent storage (.chroma_db/)
  • Smart ID generation (prevents collisions)
  • Comprehensive diagnostics & logging
  • Graceful degradation (works without ChromaDB)

src/canton_mcp_server/tools/daml_reason_tool.py

Updated Flow:

Case 1: No code provided

# Search for similar patterns using business intent
similar_files = semantic_search.search_similar_files(
    code=business_intent,  # Uses intent as query
    top_k=5
)

Example: "Create a simple IOU contract" → Finds Iou.daml files

Case 2: Code validation passed

# Return approval, no recommendations needed
yield ctx.structured(DamlReasonResult(
    action="approved",
    valid=True,
    confidence=0.8
))

Case 3: Code validation failed

# Search for similar code patterns
similar_files = semantic_search.search_similar_files(
    code=daml_code,  # Uses actual code as query
    top_k=5
)

Example: Bad authorization code → Finds similar authorization patterns

What Was Removed

Removed Components:

  • StructuredIngestionEngine - No longer needed
  • CanonicalResourceRecommender - Replaced by semantic search
  • LLMEnrichmentEngine - Not needed for search
  • RecommendationRequest - Simplified to direct search
  • _infer_use_case() - No longer needed (semantic search handles it)
  • _normalize_use_case() - No longer needed

What Remains:

  • DirectFileResourceLoader - Still loads raw files
  • SafetyChecker - Still validates DAML code
  • DAMLSemanticSearch - New! Finds similar files

Performance

First Run (Indexing):

📚 Loaded 14,707 raw resources (10ms)
🔄 Indexing 11,742 resources into ChromaDB... (162s)
✅ Indexed 11,742 resources

~2-3 minutes to index all DAML resources (one-time)

Subsequent Runs:

✅ ChromaDB collection initialized (11742 items) (50ms)
✅ Index up-to-date (instant)
🔍 Searching ChromaDB... (50ms)

Instant startup - indexes persist in .chroma_db/

Search Performance:

  • Query time: ~50ms per search
  • Memory: ~500MB for ChromaDB index
  • Disk: ~200MB for persisted index

Example Queries

1. Find IOU Patterns

Business Intent: "Create a simple IOU contract for tracking debt"
Results:
  - daml-iou (score: 0.064)
  - canton-iou (score: 0.064)
  - daml-iou12 (score: 0.010)

2. Find Similar Code

DAML Code: "template AssetContract with owner, issuer..."
Results:
  - daml-asset-transfer (score: 0.187)
  - canton-asset-holding (score: 0.143)
  - daml-ownership-pattern (score: 0.092)

3. Find Authorization Patterns

DAML Code: "choice Approve with approver controller..."
Results:
  - daml-multi-party-approval (score: 0.243)
  - canton-authorization-example (score: 0.158)
  - daml-choice-authority (score: 0.054)

Configuration

Environment Variables:

# Required: Path to cloned canonical repos
CANONICAL_DOCS_PATH=/path/to/canonical-daml-docs

# Optional: Force re-indexing on startup
FORCE_REINDEX=false

# Optional: ChromaDB persist directory (default: .chroma_db)
CHROMA_DB_PATH=.chroma_db

ChromaDB Settings:

# Defaults (can be customized)
collection_name = "daml_resources"
persist_directory = ".chroma_db"
embedding_function = None  # Uses sentence-transformers default

Dependencies

New Dependency:

dependencies = [
    # ... existing deps ...
    "chromadb>=0.4.0",
]

Sub-dependencies (auto-installed):

  • sentence-transformers - Embeddings model
  • onnxruntime - Model inference
  • tokenizers - Text tokenization
  • huggingface-hub - Model downloads

Debugging

Enable Debug Logging:

import logging
logging.getLogger("canton_mcp_server.core.semantic_search").setLevel(logging.DEBUG)

Log Output:

🔍 Searching ChromaDB with code (hash: 63e215f7, preview: 'Create a simple IOU...')
🔍 Found 5 semantically similar files via ChromaDB:
  1. daml-iou                similarity=0.064 distance=0.936
  2. canton-iou              similarity=0.064 distance=0.936
  3. daml-iou12              similarity=0.010 distance=0.990

Check Index Status:

stats = semantic_search.get_stats()
print(f"Indexed: {stats['indexed_count']} resources")
print(f"Available: {stats['available']}")

Testing

Test Pattern Recommendations:

curl -X POST http://localhost:7284/mcp \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tools/call",
    "params": {
      "name": "daml_reason",
      "arguments": {
        "businessIntent": "Create a simple IOU contract"
      }
    }
  }'

Test Code Validation:

curl -X POST http://localhost:7284/mcp \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 2,
    "method": "tools/call",
    "params": {
      "name": "daml_reason",
      "arguments": {
        "businessIntent": "Track asset ownership",
        "damlCode": "template AssetContract..."
      }
    }
  }'

Migration Notes

From Old System:

# OLD (rule-based)
use_case = _infer_use_case(business_intent)
recommendations = recommender.recommend_resources(
    RecommendationRequest(use_case=use_case, ...)
)

# NEW (semantic search)
similar_files = semantic_search.search_similar_files(
    code=business_intent,
    top_k=5
)

Key Differences:

Old System New System
Rule-based keyword matching Vector similarity search
Requires enrichment cache Uses raw files
Pre-defined use cases Context-aware matching
Static recommendations Dynamic similarity scoring
~100ms search ~50ms search

Advantages

Simplicity:

  • No enrichment pipeline
  • No cached metadata
  • No complex preprocessing
  • Direct file-to-embedding

Performance:

  • Persistent indexes (instant startup)
  • 50ms queries (fast)
  • Scales to 100k+ files

Accuracy:

  • Vector embeddings capture semantic meaning
  • Context-aware recommendations
  • Learns from actual code structure

Maintainability:

  • Clean architecture
  • Single source of truth (raw files)
  • Easy to debug and test

Future Enhancements

Phase 2: LLM Reasoning Layer

# ChromaDB finds similar files (fast)
similar_files = semantic_search.search_similar_files(code, top_k=10)

# LLM reasons about relevance (smart)
relevant_patterns = llm.filter_and_explain(
    similar_files=similar_files,
    user_context=business_intent,
    user_code=daml_code
)

Benefits:

  • ChromaDB: Speed (50ms, cheap)
  • LLM: Intelligence (contextual reasoning)
  • Best of both worlds

Phase 3: Feedback Loop

# Track which recommendations were helpful
semantic_search.record_feedback(
    query=user_code,
    selected_file=chosen_pattern,
    helpful=True
)

# Fine-tune embeddings over time

Troubleshooting

Issue: "ChromaDB not available"

Solution: Install ChromaDB:

uv sync  # or pip install chromadb

Issue: Slow indexing (>5 minutes)

Solution: Check resource count:

# Should be ~11,742 files
# If much higher, check file filtering in semantic_search.py

Issue: Poor recommendations

Solution: Increase top_k:

similar_files = semantic_search.search_similar_files(
    code=user_code,
    top_k=10  # Increased from 5
)

Issue: Missing canonical repos

Solution: Verify path:

ls $CANONICAL_DOCS_PATH
# Should show: daml/ canton/ daml-finance/

Summary

The semantic search overhaul makes the DAML Reason tool:

  • Simpler: No enrichment, no caching complexity
  • Faster: 50ms queries, instant startup (after first index)
  • Smarter: Vector embeddings understand code semantics
  • Cleaner: Raw files → ChromaDB → Recommendations

Core Principle: Let ChromaDB handle similarity search (fast), let LLM handle reasoning (smart).


Status: ✅ Implemented and tested
Performance: 🚀 11,742 resources indexed in 162s, 50ms queries
Accuracy: 📊 Finding relevant IOUs, assets, authorization patterns
Next: Add LLM reasoning layer for deeper context understanding