colliery-io
diff --git a/‎.metis/backlog/features/GQLITE-T-0093.md‎
Lines changed: 246 additions & 0 deletions b/‎.metis/backlog/features/GQLITE-T-0093.md‎
Lines changed: 246 additions & 0 deletions
diff --git a/‎.metis/backlog/tech-debt/GQLITE-T-0094.md‎
Lines changed: 132 additions & 0 deletions b/‎.metis/backlog/tech-debt/GQLITE-T-0094.md‎
Lines changed: 132 additions & 0 deletions
@@ -0,0 +1,246 @@
+---
+id: bulk-insert-operations-for-nodes
+level: task
+title: "Bulk Insert Operations for Nodes and Edges"
+short_code: "GQLITE-T-0093"
+created_at: 2026-01-10T04:16:05.119817+00:00
+updated_at: 2026-01-10T04:16:05.119817+00:00
+parent: 
+blocked_by: []
+archived: false
+
+tags:
+  - "#task"
+  - "#phase/backlog"
+  - "#feature"
+
+
+exit_criteria_met: false
+strategy_id: NULL
+initiative_id: NULL
+---
+
+# Bulk Insert Operations for Nodes and Edges
+
+Add true bulk insert methods to graphqlite that bypass individual Cypher query overhead, enabling high-performance graph construction from external data sources.
+
+## Objective
+
+Enable efficient bulk insertion of nodes and edges by providing native bulk insert APIs that bypass per-insert Cypher parsing overhead, reducing graph construction time by 30-100x.
+
+## Problem
+
+When building graphs from parsed source code (or any external data), we need to insert thousands of nodes and edges efficiently. The current approach has significant overhead:
+
+**Current Node Insertion:**
+```rust
+// upsert_nodes_batch is just a loop calling upsert_node individually
+for (node_id, props, label) in nodes {
+    self.upsert_node(node_id, props, label)?;  // Individual query per node
+}
+```
+
+**Current Edge Insertion:**
+```rust
+// upsert_edge requires internal ID lookup via Cypher MATCH
+self.graph.upsert_edge(&source_id, &target_id, props, rel_type)?;
+```
+
+**Benchmark Results (muninn codebase - 50 files):**
+- Parse time (tree-sitter): 214ms
+- Store time (graphqlite): 29,315ms
+- **99.3% of indexing time is spent in graph storage**
+
+The bottleneck is not SQLite itself (which can handle millions of inserts per second), but the per-insert overhead of:
+1. Cypher query parsing
+2. Property map construction
+3. For edges: MATCH query to resolve external IDs to internal row IDs
+
+## Proposed Solution
+
+### 1. Bulk Node Insert
+
+```rust
+/// Insert multiple nodes in a single transaction with minimal overhead.
+/// Returns a map of external_id -> internal_id for subsequent edge insertion.
+fn insert_nodes_bulk<I, N, P, K, V, L>(
+    &self,
+    nodes: I,
+) -> Result<HashMap<String, i64>>
+where
+    I: IntoIterator<Item = (N, P, L)>,
+    N: AsRef<str>,           // external node ID
+    P: IntoIterator<Item = (K, V)>,
+    K: AsRef<str>,
+    V: Into<Value>,
+    L: AsRef<str>,           // label
+```
+
+**Implementation approach:**
+- Begin transaction
+- Batch INSERT into `nodes` table
+- Batch INSERT into `node_labels` table
+- Batch INSERT into `node_props_*` tables
+- Commit transaction
+- Return external_id -> internal_id mapping
+
+### 2. Bulk Edge Insert (with ID mapping)
+
+```rust
+/// Insert multiple edges using pre-resolved internal IDs.
+/// Use the mapping returned from insert_nodes_bulk.
+fn insert_edges_bulk<I, P, K, V, R>(
+    &self,
+    edges: I,
+    id_map: &HashMap<String, i64>,
+) -> Result<()>
+where
+    I: IntoIterator<Item = (String, String, P, R)>,  // (source_ext_id, target_ext_id, props, rel_type)
+    P: IntoIterator<Item = (K, V)>,
+    K: AsRef<str>,
+    V: Into<Value>,
+    R: AsRef<str>,
+```
+
+**Implementation approach:**
+- Begin transaction
+- Look up internal IDs from provided mapping (in-memory, no DB query)
+- Batch INSERT into `edges` table
+- Batch INSERT into `edge_props_*` tables
+- Commit transaction
+
+### 3. Alternative: Raw SQL Access
+
+If bulk methods are complex to implement, exposing raw SQL execution would allow users to optimize their specific use case:
+
+```rust
+/// Execute raw SQL for advanced use cases.
+fn execute_sql(&self, sql: &str) -> Result<()>;
+
+/// Execute raw SQL with parameters.
+fn execute_sql_params(&self, sql: &str, params: &[Value]) -> Result<()>;
+```
+
+## Example Usage
+
+```rust
+// Build graph from parsed source code
+let symbols: Vec<Symbol> = parse_files(&files);
+let edges: Vec<Edge> = extract_relationships(&symbols);
+
+// Bulk insert nodes, get ID mapping
+let id_map = graph.insert_nodes_bulk(
+    symbols.iter().map(|s| (s.id(), s.properties(), s.label()))
+)?;
+
+// Bulk insert edges using the mapping
+graph.insert_edges_bulk(
+    edges.iter().map(|e| (e.source_id, e.target_id, e.properties(), e.rel_type)),
+    &id_map,
+)?;
+```
+
+## Expected Performance Improvement
+
+Based on SQLite's raw insert performance and our current bottleneck analysis:
+
+| Operation | Current | Expected with Bulk |
+|-----------|---------|-------------------|
+| 1600 nodes | ~10s | <100ms |
+| 7300 edges | ~20s | <500ms |
+| **Total** | ~30s | <1s |
+
+This would make graph indexing fast enough to run on every file save in watch mode.
+
+## Workaround Attempted
+
+We tried using raw Cypher with batched CREATE statements:
+
+```cypher
+CREATE (n0:Function {id: 'x', ...}), (n1:Struct {id: 'y', ...}), ...
+```
+
+This works for nodes but hits SQLite limits:
+- `too many FROM clause terms, max: 200`
+- `at most 64 tables in a join`
+
+For edges, any MATCH-based approach triggers expensive joins:
+```cypher
+MATCH (s0 {id: 'x'}), (t0 {id: 'y'}) CREATE (s0)-[:CALLS]->(t0)
+-- Each node match = table join
+```
+
+## Backlog Item Details
+
+### Type
+- [x] Feature - New functionality or enhancement
+
+### Priority
+- [x] P1 - High (important for user experience)
+
+### Business Justification
+- **User Value**: Enables practical use of graphqlite for code indexing and other large-scale graph construction use cases
+- **Business Value**: Unlocks the primary use case for muninn (code graph indexing for AI-assisted development)
+- **Effort Estimate**: L
+
+## Acceptance Criteria
+
+- [ ] `insert_nodes_bulk` method implemented with batch INSERT operations
+- [ ] `insert_edges_bulk` method implemented using in-memory ID mapping
+- [ ] Both methods wrapped in transactions for atomicity
+- [ ] Python bindings exposed for bulk operations
+- [ ] Benchmark shows 30x+ improvement for 1000+ node/edge insertions
+- [ ] Documentation with usage examples
+
+## Implementation Notes
+
+### Technical Approach
+1. Add bulk insert methods to core Rust `Graph` struct
+2. Use prepared statements with batch parameter binding
+3. Return HashMap for external->internal ID mapping from node bulk insert
+4. Expose via Python bindings with appropriate type conversions
+
+### Dependencies
+- Related to GQLITE-T-0094 (transaction-based batch bindings)
+
+### Risk Considerations
+- Schema evolution: bulk inserts bypass Cypher so must directly match table structure
+- Memory usage: collecting ID mappings for very large graphs may need streaming approach
+
+## Context
+
+- **Project**: muninn - code graph indexing for AI-assisted development
+- **Scale**: Typical codebase has 100-1000 files, 10k-100k symbols, 50k-500k edges
+- **Use case**: Index on startup, incremental updates on file change
+
+## Status Updates
+
+### 2026-01-10: Initial Implementation Complete
+
+Implemented bulk insert operations for both Rust and Python bindings:
+
+**New API Methods:**
+- `insert_nodes_bulk(nodes)` - Insert nodes, returns HashMap<external_id, rowid>
+- `insert_edges_bulk(edges, id_map)` - Insert edges using ID map
+- `insert_graph_bulk(nodes, edges)` - Convenience method for both
+- `resolve_node_ids(ids)` - Resolve existing node IDs
+
+**Performance Results (in-memory, 1000 nodes + 5000 edges):**
+
+| Language | Nodes | Edges | Total |
+|----------|-------|-------|-------|
+| Rust | 15.6ms (64k/s) | 140ms (35k/s) | 156ms |
+| Python | 11ms (94k/s) | 39ms (128k/s) | 49ms |
+
+**Improvement vs Original:**
+- Original approach: ~29 seconds for similar workload
+- New bulk insert: ~50-156ms
+- **Speedup: 185-580x faster**
+
+**Files Added/Modified:**
+- `bindings/rust/src/graph/bulk.rs` - Rust implementation
+- `bindings/rust/src/graph/mod.rs` - Module export
+- `bindings/rust/src/lib.rs` - Public export
+- `bindings/python/src/graphqlite/graph/bulk.py` - Python implementation
+- `bindings/python/src/graphqlite/graph/__init__.py` - Module export
+- `bindings/python/src/graphqlite/__init__.py` - Public export
@@ -0,0 +1,132 @@
+---
+id: update-batch-bindings-to-use
+level: task
+title: "Update batch bindings to use transactions instead of for loops"
+short_code: "GQLITE-T-0094"
+created_at: 2026-01-10T04:16:05.171987+00:00
+updated_at: 2026-01-10T13:55:47.109934+00:00
+parent: 
+blocked_by: []
+archived: false
+
+tags:
+  - "#task"
+  - "#tech-debt"
+  - "#phase/completed"
+
+
+exit_criteria_met: false
+strategy_id: NULL
+initiative_id: NULL
+---
+
+# Update batch bindings to use transactions instead of for loops
+
+Refactor existing batch methods (`upsert_nodes_batch`, `upsert_edges_batch`) to wrap operations in a single transaction rather than executing individual upserts in a loop.
+
+## Objective
+
+Improve batch operation performance by wrapping multiple upsert calls in a single SQLite transaction, reducing fsync overhead and providing atomicity guarantees.
+
+## Problem
+
+The current batch methods are implemented as simple for loops:
+
+```rust
+// Current implementation - no transaction wrapping
+pub fn upsert_nodes_batch(...) {
+    for (node_id, props, label) in nodes {
+        self.upsert_node(node_id, props, label)?;  // Each call is its own transaction
+    }
+}
+```
+
+Without explicit transaction wrapping, SQLite auto-commits after each statement. This means:
+1. Each insert triggers an fsync to disk (slow)
+2. No atomicity - partial failures leave inconsistent state
+3. Unnecessary overhead from repeated transaction begin/commit
+
+## Proposed Solution
+
+Wrap batch operations in explicit transactions:
+
+```rust
+pub fn upsert_nodes_batch(...) -> Result<()> {
+    self.begin_transaction()?;
+    for (node_id, props, label) in nodes {
+        if let Err(e) = self.upsert_node(node_id, props, label) {
+            self.rollback()?;
+            return Err(e);
+        }
+    }
+    self.commit()?;
+    Ok(())
+}
+```
+
+## Backlog Item Details
+
+### Type
+- [x] Tech Debt - Code improvement or refactoring
+
+### Priority
+- [x] P1 - High (important for user experience)
+
+### Technical Debt Impact
+- **Current Problems**: Batch operations are slow due to per-operation transaction overhead; no atomicity guarantees
+- **Benefits of Fixing**: 5-10x performance improvement for batch operations; atomic batch inserts (all-or-nothing)
+- **Risk Assessment**: Low risk - straightforward refactoring with clear semantics
+
+## Acceptance Criteria
+
+## Acceptance Criteria
+
+## Acceptance Criteria
+
+## Acceptance Criteria
+
+- [ ] `upsert_nodes_batch` wraps all operations in a single transaction
+- [ ] `upsert_edges_batch` wraps all operations in a single transaction
+- [ ] Transaction rolls back on any individual operation failure
+- [ ] Python bindings maintain the same API (transparent improvement)
+- [ ] Benchmark shows measurable improvement for 100+ item batches
+- [ ] Unit tests verify atomicity (partial failure = full rollback)
+
+## Implementation Notes
+
+### Technical Approach
+1. Add `begin_transaction()`, `commit()`, and `rollback()` methods to Graph if not already present
+2. Modify `upsert_nodes_batch` to wrap operations in transaction
+3. Modify `upsert_edges_batch` to wrap operations in transaction
+4. Ensure proper error handling with rollback on failure
+5. Consider adding optional transaction parameter for caller-controlled transactions
+
+### Dependencies
+- None - can be implemented independently
+- Related to GQLITE-T-0093 (bulk insert feature) which will need similar transaction handling
+
+### Risk Considerations
+- Nested transaction handling if caller is already in a transaction
+- Large batches may hold locks longer - consider chunking for very large batches
+
+## Status Updates
+
+### Resolution (2026-01-10)
+
+**Outcome**: Resolved differently than originally planned.
+
+Transaction wrapping for batch methods conflicts with the Cypher extension's internal transaction management, causing syntax errors and rollback failures.
+
+**Solution implemented**:
+1. **Bulk insert methods** (GQLITE-T-0093) provide the high-performance atomic batch operations users need
+2. **Batch methods** remain as convenience wrappers with documented limitations
+
+**Key differences**:
+| Aspect | `upsert_*_batch` | `insert_*_bulk` |
+|--------|------------------|-----------------|
+| Semantics | Upsert (MERGE) | Insert only |
+| Atomicity | No | Yes |
+| Performance | ~1x (no improvement) | 100-500x faster |
+| Use case | Mixed workloads | Building new graphs |
+
+**Documentation updated** to clearly state that batch methods do not provide atomicity, and users should use bulk methods for atomic operations.