Branch: semantic-search
Date: November 17, 2025
The DAML Reason tool has been overhauled to use ChromaDB-powered semantic search instead of rule-based pattern matching. This provides intelligent, context-aware recommendations using vector embeddings.
Raw Files from Repos → ChromaDB → DAML Reason finds similar files → LLM reasons → Return
No enrichment. No caching complexity. No preprocessing.
Just: Raw files → Vector search → Smart recommendations
canonical_docs_path = Path(os.environ.get("CANONICAL_DOCS_PATH"))
loader = DirectFileResourceLoader(canonical_docs_path)
raw_resources = loader.scan_repositories(force_refresh=False)- Loads 14,707+ files directly from cloned repos (daml, canton, daml-finance)
- No enrichment, no synthetic metadata
- Uses disk cache for fast startup:
~/.canton-mcp/resource-cache-*.json
search_engine = create_semantic_search(
raw_resources=all_resources,
force_reindex=False # Persists across restarts
)- Indexes 11,742 resources (.daml, .md, .scala files)
- Uses first 2000 chars of each file for embedding
- Persists to
.chroma_db/directory - Uses default sentence-transformers embeddings
- One-time indexing (~2-3 minutes first run, instant after)
similar_files = semantic_search.search_similar_files(
code=user_code_or_intent,
top_k=5,
raw_resources=all_resources
)- Finds similar files in ~50ms using vector similarity
- Returns top-k most similar files with scores
- No LLM calls for search (fast & cheap)
- ChromaDB finds similar files
- LLM reasons about relevance in context
- Returns curated recommendations
Main Classes:
DAMLSemanticSearch: ChromaDB wrapper for semantic similaritycreate_semantic_search(): Factory function with auto-initialization
Key Methods:
index_resources(): Index raw files into ChromaDBsearch_similar_files(): Find k-most-similar filesget_stats(): Get indexing statistics
Features:
- Persistent storage (
.chroma_db/) - Smart ID generation (prevents collisions)
- Comprehensive diagnostics & logging
- Graceful degradation (works without ChromaDB)
Updated Flow:
# Search for similar patterns using business intent
similar_files = semantic_search.search_similar_files(
code=business_intent, # Uses intent as query
top_k=5
)Example: "Create a simple IOU contract" → Finds Iou.daml files
# Return approval, no recommendations needed
yield ctx.structured(DamlReasonResult(
action="approved",
valid=True,
confidence=0.8
))# Search for similar code patterns
similar_files = semantic_search.search_similar_files(
code=daml_code, # Uses actual code as query
top_k=5
)Example: Bad authorization code → Finds similar authorization patterns
StructuredIngestionEngine- No longer neededCanonicalResourceRecommender- Replaced by semantic searchLLMEnrichmentEngine- Not needed for searchRecommendationRequest- Simplified to direct search_infer_use_case()- No longer needed (semantic search handles it)_normalize_use_case()- No longer needed
DirectFileResourceLoader- Still loads raw filesSafetyChecker- Still validates DAML codeDAMLSemanticSearch- New! Finds similar files
📚 Loaded 14,707 raw resources (10ms)
🔄 Indexing 11,742 resources into ChromaDB... (162s)
✅ Indexed 11,742 resources
~2-3 minutes to index all DAML resources (one-time)
✅ ChromaDB collection initialized (11742 items) (50ms)
✅ Index up-to-date (instant)
🔍 Searching ChromaDB... (50ms)
Instant startup - indexes persist in .chroma_db/
- Query time: ~50ms per search
- Memory: ~500MB for ChromaDB index
- Disk: ~200MB for persisted index
Business Intent: "Create a simple IOU contract for tracking debt"
Results:
- daml-iou (score: 0.064)
- canton-iou (score: 0.064)
- daml-iou12 (score: 0.010)DAML Code: "template AssetContract with owner, issuer..."
Results:
- daml-asset-transfer (score: 0.187)
- canton-asset-holding (score: 0.143)
- daml-ownership-pattern (score: 0.092)DAML Code: "choice Approve with approver controller..."
Results:
- daml-multi-party-approval (score: 0.243)
- canton-authorization-example (score: 0.158)
- daml-choice-authority (score: 0.054)# Required: Path to cloned canonical repos
CANONICAL_DOCS_PATH=/path/to/canonical-daml-docs
# Optional: Force re-indexing on startup
FORCE_REINDEX=false
# Optional: ChromaDB persist directory (default: .chroma_db)
CHROMA_DB_PATH=.chroma_db# Defaults (can be customized)
collection_name = "daml_resources"
persist_directory = ".chroma_db"
embedding_function = None # Uses sentence-transformers defaultdependencies = [
# ... existing deps ...
"chromadb>=0.4.0",
]sentence-transformers- Embeddings modelonnxruntime- Model inferencetokenizers- Text tokenizationhuggingface-hub- Model downloads
import logging
logging.getLogger("canton_mcp_server.core.semantic_search").setLevel(logging.DEBUG)🔍 Searching ChromaDB with code (hash: 63e215f7, preview: 'Create a simple IOU...')
🔍 Found 5 semantically similar files via ChromaDB:
1. daml-iou similarity=0.064 distance=0.936
2. canton-iou similarity=0.064 distance=0.936
3. daml-iou12 similarity=0.010 distance=0.990
stats = semantic_search.get_stats()
print(f"Indexed: {stats['indexed_count']} resources")
print(f"Available: {stats['available']}")curl -X POST http://localhost:7284/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "daml_reason",
"arguments": {
"businessIntent": "Create a simple IOU contract"
}
}
}'curl -X POST http://localhost:7284/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "daml_reason",
"arguments": {
"businessIntent": "Track asset ownership",
"damlCode": "template AssetContract..."
}
}
}'# OLD (rule-based)
use_case = _infer_use_case(business_intent)
recommendations = recommender.recommend_resources(
RecommendationRequest(use_case=use_case, ...)
)
# NEW (semantic search)
similar_files = semantic_search.search_similar_files(
code=business_intent,
top_k=5
)| Old System | New System |
|---|---|
| Rule-based keyword matching | Vector similarity search |
| Requires enrichment cache | Uses raw files |
| Pre-defined use cases | Context-aware matching |
| Static recommendations | Dynamic similarity scoring |
| ~100ms search | ~50ms search |
- No enrichment pipeline
- No cached metadata
- No complex preprocessing
- Direct file-to-embedding
- Persistent indexes (instant startup)
- 50ms queries (fast)
- Scales to 100k+ files
- Vector embeddings capture semantic meaning
- Context-aware recommendations
- Learns from actual code structure
- Clean architecture
- Single source of truth (raw files)
- Easy to debug and test
# ChromaDB finds similar files (fast)
similar_files = semantic_search.search_similar_files(code, top_k=10)
# LLM reasons about relevance (smart)
relevant_patterns = llm.filter_and_explain(
similar_files=similar_files,
user_context=business_intent,
user_code=daml_code
)Benefits:
- ChromaDB: Speed (50ms, cheap)
- LLM: Intelligence (contextual reasoning)
- Best of both worlds
# Track which recommendations were helpful
semantic_search.record_feedback(
query=user_code,
selected_file=chosen_pattern,
helpful=True
)
# Fine-tune embeddings over timeSolution: Install ChromaDB:
uv sync # or pip install chromadbSolution: Check resource count:
# Should be ~11,742 files
# If much higher, check file filtering in semantic_search.pySolution: Increase top_k:
similar_files = semantic_search.search_similar_files(
code=user_code,
top_k=10 # Increased from 5
)Solution: Verify path:
ls $CANONICAL_DOCS_PATH
# Should show: daml/ canton/ daml-finance/The semantic search overhaul makes the DAML Reason tool:
- Simpler: No enrichment, no caching complexity
- Faster: 50ms queries, instant startup (after first index)
- Smarter: Vector embeddings understand code semantics
- Cleaner: Raw files → ChromaDB → Recommendations
Core Principle: Let ChromaDB handle similarity search (fast), let LLM handle reasoning (smart).
Status: ✅ Implemented and tested
Performance: 🚀 11,742 resources indexed in 162s, 50ms queries
Accuracy: 📊 Finding relevant IOUs, assets, authorization patterns
Next: Add LLM reasoning layer for deeper context understanding