Skip to content

Commit 269c6cd

Browse files
authored
Feat/bulk insert operations (#11)
feat: add bulk insert operations for high-performance graph construction Add insert_nodes_bulk, insert_edges_bulk, insert_graph_bulk, and resolve_node_ids methods to both Rust and Python bindings for 100-500x faster graph construction compared to individual upserts. Key design: - Bypass Cypher parsing entirely, use direct SQL - Build HashMap<external_id, internal_rowid> during node insertion - Use ID map for edge insertion to avoid expensive MATCH queries - All operations wrapped in transactions for atomicity Also clarifies that existing batch methods (upsert_*_batch) do not provide atomicity due to Cypher extension transaction conflicts. Users needing atomic batch operations should use bulk methods. Closes GQLITE-T-0093, GQLITE-T-0094
1 parent 0216c4b commit 269c6cd

File tree

14 files changed

+2065
-10
lines changed

14 files changed

+2065
-10
lines changed
Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
---
2+
id: bulk-insert-operations-for-nodes
3+
level: task
4+
title: "Bulk Insert Operations for Nodes and Edges"
5+
short_code: "GQLITE-T-0093"
6+
created_at: 2026-01-10T04:16:05.119817+00:00
7+
updated_at: 2026-01-10T04:16:05.119817+00:00
8+
parent:
9+
blocked_by: []
10+
archived: false
11+
12+
tags:
13+
- "#task"
14+
- "#phase/backlog"
15+
- "#feature"
16+
17+
18+
exit_criteria_met: false
19+
strategy_id: NULL
20+
initiative_id: NULL
21+
---
22+
23+
# Bulk Insert Operations for Nodes and Edges
24+
25+
Add true bulk insert methods to graphqlite that bypass individual Cypher query overhead, enabling high-performance graph construction from external data sources.
26+
27+
## Objective
28+
29+
Enable efficient bulk insertion of nodes and edges by providing native bulk insert APIs that bypass per-insert Cypher parsing overhead, reducing graph construction time by 30-100x.
30+
31+
## Problem
32+
33+
When building graphs from parsed source code (or any external data), we need to insert thousands of nodes and edges efficiently. The current approach has significant overhead:
34+
35+
**Current Node Insertion:**
36+
```rust
37+
// upsert_nodes_batch is just a loop calling upsert_node individually
38+
for (node_id, props, label) in nodes {
39+
self.upsert_node(node_id, props, label)?; // Individual query per node
40+
}
41+
```
42+
43+
**Current Edge Insertion:**
44+
```rust
45+
// upsert_edge requires internal ID lookup via Cypher MATCH
46+
self.graph.upsert_edge(&source_id, &target_id, props, rel_type)?;
47+
```
48+
49+
**Benchmark Results (muninn codebase - 50 files):**
50+
- Parse time (tree-sitter): 214ms
51+
- Store time (graphqlite): 29,315ms
52+
- **99.3% of indexing time is spent in graph storage**
53+
54+
The bottleneck is not SQLite itself (which can handle millions of inserts per second), but the per-insert overhead of:
55+
1. Cypher query parsing
56+
2. Property map construction
57+
3. For edges: MATCH query to resolve external IDs to internal row IDs
58+
59+
## Proposed Solution
60+
61+
### 1. Bulk Node Insert
62+
63+
```rust
64+
/// Insert multiple nodes in a single transaction with minimal overhead.
65+
/// Returns a map of external_id -> internal_id for subsequent edge insertion.
66+
fn insert_nodes_bulk<I, N, P, K, V, L>(
67+
&self,
68+
nodes: I,
69+
) -> Result<HashMap<String, i64>>
70+
where
71+
I: IntoIterator<Item = (N, P, L)>,
72+
N: AsRef<str>, // external node ID
73+
P: IntoIterator<Item = (K, V)>,
74+
K: AsRef<str>,
75+
V: Into<Value>,
76+
L: AsRef<str>, // label
77+
```
78+
79+
**Implementation approach:**
80+
- Begin transaction
81+
- Batch INSERT into `nodes` table
82+
- Batch INSERT into `node_labels` table
83+
- Batch INSERT into `node_props_*` tables
84+
- Commit transaction
85+
- Return external_id -> internal_id mapping
86+
87+
### 2. Bulk Edge Insert (with ID mapping)
88+
89+
```rust
90+
/// Insert multiple edges using pre-resolved internal IDs.
91+
/// Use the mapping returned from insert_nodes_bulk.
92+
fn insert_edges_bulk<I, P, K, V, R>(
93+
&self,
94+
edges: I,
95+
id_map: &HashMap<String, i64>,
96+
) -> Result<()>
97+
where
98+
I: IntoIterator<Item = (String, String, P, R)>, // (source_ext_id, target_ext_id, props, rel_type)
99+
P: IntoIterator<Item = (K, V)>,
100+
K: AsRef<str>,
101+
V: Into<Value>,
102+
R: AsRef<str>,
103+
```
104+
105+
**Implementation approach:**
106+
- Begin transaction
107+
- Look up internal IDs from provided mapping (in-memory, no DB query)
108+
- Batch INSERT into `edges` table
109+
- Batch INSERT into `edge_props_*` tables
110+
- Commit transaction
111+
112+
### 3. Alternative: Raw SQL Access
113+
114+
If bulk methods are complex to implement, exposing raw SQL execution would allow users to optimize their specific use case:
115+
116+
```rust
117+
/// Execute raw SQL for advanced use cases.
118+
fn execute_sql(&self, sql: &str) -> Result<()>;
119+
120+
/// Execute raw SQL with parameters.
121+
fn execute_sql_params(&self, sql: &str, params: &[Value]) -> Result<()>;
122+
```
123+
124+
## Example Usage
125+
126+
```rust
127+
// Build graph from parsed source code
128+
let symbols: Vec<Symbol> = parse_files(&files);
129+
let edges: Vec<Edge> = extract_relationships(&symbols);
130+
131+
// Bulk insert nodes, get ID mapping
132+
let id_map = graph.insert_nodes_bulk(
133+
symbols.iter().map(|s| (s.id(), s.properties(), s.label()))
134+
)?;
135+
136+
// Bulk insert edges using the mapping
137+
graph.insert_edges_bulk(
138+
edges.iter().map(|e| (e.source_id, e.target_id, e.properties(), e.rel_type)),
139+
&id_map,
140+
)?;
141+
```
142+
143+
## Expected Performance Improvement
144+
145+
Based on SQLite's raw insert performance and our current bottleneck analysis:
146+
147+
| Operation | Current | Expected with Bulk |
148+
|-----------|---------|-------------------|
149+
| 1600 nodes | ~10s | <100ms |
150+
| 7300 edges | ~20s | <500ms |
151+
| **Total** | ~30s | <1s |
152+
153+
This would make graph indexing fast enough to run on every file save in watch mode.
154+
155+
## Workaround Attempted
156+
157+
We tried using raw Cypher with batched CREATE statements:
158+
159+
```cypher
160+
CREATE (n0:Function {id: 'x', ...}), (n1:Struct {id: 'y', ...}), ...
161+
```
162+
163+
This works for nodes but hits SQLite limits:
164+
- `too many FROM clause terms, max: 200`
165+
- `at most 64 tables in a join`
166+
167+
For edges, any MATCH-based approach triggers expensive joins:
168+
```cypher
169+
MATCH (s0 {id: 'x'}), (t0 {id: 'y'}) CREATE (s0)-[:CALLS]->(t0)
170+
-- Each node match = table join
171+
```
172+
173+
## Backlog Item Details
174+
175+
### Type
176+
- [x] Feature - New functionality or enhancement
177+
178+
### Priority
179+
- [x] P1 - High (important for user experience)
180+
181+
### Business Justification
182+
- **User Value**: Enables practical use of graphqlite for code indexing and other large-scale graph construction use cases
183+
- **Business Value**: Unlocks the primary use case for muninn (code graph indexing for AI-assisted development)
184+
- **Effort Estimate**: L
185+
186+
## Acceptance Criteria
187+
188+
- [ ] `insert_nodes_bulk` method implemented with batch INSERT operations
189+
- [ ] `insert_edges_bulk` method implemented using in-memory ID mapping
190+
- [ ] Both methods wrapped in transactions for atomicity
191+
- [ ] Python bindings exposed for bulk operations
192+
- [ ] Benchmark shows 30x+ improvement for 1000+ node/edge insertions
193+
- [ ] Documentation with usage examples
194+
195+
## Implementation Notes
196+
197+
### Technical Approach
198+
1. Add bulk insert methods to core Rust `Graph` struct
199+
2. Use prepared statements with batch parameter binding
200+
3. Return HashMap for external->internal ID mapping from node bulk insert
201+
4. Expose via Python bindings with appropriate type conversions
202+
203+
### Dependencies
204+
- Related to GQLITE-T-0094 (transaction-based batch bindings)
205+
206+
### Risk Considerations
207+
- Schema evolution: bulk inserts bypass Cypher so must directly match table structure
208+
- Memory usage: collecting ID mappings for very large graphs may need streaming approach
209+
210+
## Context
211+
212+
- **Project**: muninn - code graph indexing for AI-assisted development
213+
- **Scale**: Typical codebase has 100-1000 files, 10k-100k symbols, 50k-500k edges
214+
- **Use case**: Index on startup, incremental updates on file change
215+
216+
## Status Updates
217+
218+
### 2026-01-10: Initial Implementation Complete
219+
220+
Implemented bulk insert operations for both Rust and Python bindings:
221+
222+
**New API Methods:**
223+
- `insert_nodes_bulk(nodes)` - Insert nodes, returns HashMap<external_id, rowid>
224+
- `insert_edges_bulk(edges, id_map)` - Insert edges using ID map
225+
- `insert_graph_bulk(nodes, edges)` - Convenience method for both
226+
- `resolve_node_ids(ids)` - Resolve existing node IDs
227+
228+
**Performance Results (in-memory, 1000 nodes + 5000 edges):**
229+
230+
| Language | Nodes | Edges | Total |
231+
|----------|-------|-------|-------|
232+
| Rust | 15.6ms (64k/s) | 140ms (35k/s) | 156ms |
233+
| Python | 11ms (94k/s) | 39ms (128k/s) | 49ms |
234+
235+
**Improvement vs Original:**
236+
- Original approach: ~29 seconds for similar workload
237+
- New bulk insert: ~50-156ms
238+
- **Speedup: 185-580x faster**
239+
240+
**Files Added/Modified:**
241+
- `bindings/rust/src/graph/bulk.rs` - Rust implementation
242+
- `bindings/rust/src/graph/mod.rs` - Module export
243+
- `bindings/rust/src/lib.rs` - Public export
244+
- `bindings/python/src/graphqlite/graph/bulk.py` - Python implementation
245+
- `bindings/python/src/graphqlite/graph/__init__.py` - Module export
246+
- `bindings/python/src/graphqlite/__init__.py` - Public export
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
---
2+
id: update-batch-bindings-to-use
3+
level: task
4+
title: "Update batch bindings to use transactions instead of for loops"
5+
short_code: "GQLITE-T-0094"
6+
created_at: 2026-01-10T04:16:05.171987+00:00
7+
updated_at: 2026-01-10T13:55:47.109934+00:00
8+
parent:
9+
blocked_by: []
10+
archived: false
11+
12+
tags:
13+
- "#task"
14+
- "#tech-debt"
15+
- "#phase/completed"
16+
17+
18+
exit_criteria_met: false
19+
strategy_id: NULL
20+
initiative_id: NULL
21+
---
22+
23+
# Update batch bindings to use transactions instead of for loops
24+
25+
Refactor existing batch methods (`upsert_nodes_batch`, `upsert_edges_batch`) to wrap operations in a single transaction rather than executing individual upserts in a loop.
26+
27+
## Objective
28+
29+
Improve batch operation performance by wrapping multiple upsert calls in a single SQLite transaction, reducing fsync overhead and providing atomicity guarantees.
30+
31+
## Problem
32+
33+
The current batch methods are implemented as simple for loops:
34+
35+
```rust
36+
// Current implementation - no transaction wrapping
37+
pub fn upsert_nodes_batch(...) {
38+
for (node_id, props, label) in nodes {
39+
self.upsert_node(node_id, props, label)?; // Each call is its own transaction
40+
}
41+
}
42+
```
43+
44+
Without explicit transaction wrapping, SQLite auto-commits after each statement. This means:
45+
1. Each insert triggers an fsync to disk (slow)
46+
2. No atomicity - partial failures leave inconsistent state
47+
3. Unnecessary overhead from repeated transaction begin/commit
48+
49+
## Proposed Solution
50+
51+
Wrap batch operations in explicit transactions:
52+
53+
```rust
54+
pub fn upsert_nodes_batch(...) -> Result<()> {
55+
self.begin_transaction()?;
56+
for (node_id, props, label) in nodes {
57+
if let Err(e) = self.upsert_node(node_id, props, label) {
58+
self.rollback()?;
59+
return Err(e);
60+
}
61+
}
62+
self.commit()?;
63+
Ok(())
64+
}
65+
```
66+
67+
## Backlog Item Details
68+
69+
### Type
70+
- [x] Tech Debt - Code improvement or refactoring
71+
72+
### Priority
73+
- [x] P1 - High (important for user experience)
74+
75+
### Technical Debt Impact
76+
- **Current Problems**: Batch operations are slow due to per-operation transaction overhead; no atomicity guarantees
77+
- **Benefits of Fixing**: 5-10x performance improvement for batch operations; atomic batch inserts (all-or-nothing)
78+
- **Risk Assessment**: Low risk - straightforward refactoring with clear semantics
79+
80+
## Acceptance Criteria
81+
82+
## Acceptance Criteria
83+
84+
## Acceptance Criteria
85+
86+
## Acceptance Criteria
87+
88+
- [ ] `upsert_nodes_batch` wraps all operations in a single transaction
89+
- [ ] `upsert_edges_batch` wraps all operations in a single transaction
90+
- [ ] Transaction rolls back on any individual operation failure
91+
- [ ] Python bindings maintain the same API (transparent improvement)
92+
- [ ] Benchmark shows measurable improvement for 100+ item batches
93+
- [ ] Unit tests verify atomicity (partial failure = full rollback)
94+
95+
## Implementation Notes
96+
97+
### Technical Approach
98+
1. Add `begin_transaction()`, `commit()`, and `rollback()` methods to Graph if not already present
99+
2. Modify `upsert_nodes_batch` to wrap operations in transaction
100+
3. Modify `upsert_edges_batch` to wrap operations in transaction
101+
4. Ensure proper error handling with rollback on failure
102+
5. Consider adding optional transaction parameter for caller-controlled transactions
103+
104+
### Dependencies
105+
- None - can be implemented independently
106+
- Related to GQLITE-T-0093 (bulk insert feature) which will need similar transaction handling
107+
108+
### Risk Considerations
109+
- Nested transaction handling if caller is already in a transaction
110+
- Large batches may hold locks longer - consider chunking for very large batches
111+
112+
## Status Updates
113+
114+
### Resolution (2026-01-10)
115+
116+
**Outcome**: Resolved differently than originally planned.
117+
118+
Transaction wrapping for batch methods conflicts with the Cypher extension's internal transaction management, causing syntax errors and rollback failures.
119+
120+
**Solution implemented**:
121+
1. **Bulk insert methods** (GQLITE-T-0093) provide the high-performance atomic batch operations users need
122+
2. **Batch methods** remain as convenience wrappers with documented limitations
123+
124+
**Key differences**:
125+
| Aspect | `upsert_*_batch` | `insert_*_bulk` |
126+
|--------|------------------|-----------------|
127+
| Semantics | Upsert (MERGE) | Insert only |
128+
| Atomicity | No | Yes |
129+
| Performance | ~1x (no improvement) | 100-500x faster |
130+
| Use case | Mixed workloads | Building new graphs |
131+
132+
**Documentation updated** to clearly state that batch methods do not provide atomicity, and users should use bulk methods for atomic operations.

0 commit comments

Comments
 (0)