Skip to content

Commit 586b1cc

Browse files
committed
v0.6.0 LlamaIndex or LangChain, 15 graph (8 + ArangoDB,AGE,Cosmos,SurrealDB,Spanner,HugeGraph,TigerGraph). 4 RDF (3 + Neptune RDF), 10 vector, 3 search
1. 15 total property graph databases — 8 existing LlamaIndex stores with LangChain versions added (Neo4j, ArcadeDB, FalkorDB, Memgraph, NebulaGraph, Neptune, Neptune Analytics, LadybugDB); 6 new LC-only stores (ArangoDB, Apache AGE, Azure Cosmos DB Gremlin, Apache HugeGraph, SurrealDB, TigerGraph); 1 new LI-only store (Google Cloud Spanner) 2. 10 LangChain vector backends (Qdrant, Elasticsearch, Milvus, Weaviate, LanceDB, Chroma, Pinecone, pgvector, OpenSearch, Neo4j vector); 3 LangChain search backends (Elasticsearch, OpenSearch, BM25); 4 RDF/triple-store backends (Fuseki, GraphDB, Oxigraph + new Amazon Neptune RDF with IAM SigV4 auth) 3. flexible-graphrag now runs fully on LlamaIndex, fully on LangChain, or any mix — both frameworks are first-class peers. Each pipeline stage is independently configurable: CHUNKER_BACKEND, KG_EXTRACTOR_BACKEND, GRAPH_BACKEND, VECTOR_BACKEND, SEARCH_BACKEND, RETRIEVAL_FUSION, LLM_PROVIDER, EMBEDDING_KIND. Note: document readers / data sources remain LlamaIndex-based (first pipeline stage). LangChain-only graph stores auto-select GRAPH_BACKEND=langchain. 4. Retrievers for both LI and LC with fusion support for both frameworks (RETRIEVAL_FUSION= llamaindex uses QueryFusionRetriever; =langchain uses EnsembleRetriever when all stores are LC-backed). LangChain retrievers include: Synonym Exploder (expands query terms for vector search), pg_vector + neighborhood traversal for Neo4j (LANGCHAIN_PG_VECTOR_SEARCH, USE_PG_NEIGHBORHOOD), and text-to-query graph QA for all LC property graph stores (generates Cypher for Neo4j/ArcadeDB/Memgraph/FalkorDB/Ladybug/AGE, GQL for HugeGraph, SurrealQL for SurrealDB, AQL for ArangoDB, SPARQL for all RDF stores). 5. Matrix test support — run_matrix.py / run_all_profiles.py; 24+ integration test profiles covering all PG, vector, search, RDF, and chunker combinations 6. Docling OCR — DOCLING_OCR=true + DOCLING_OCR_ENGINE (auto / rapidocr / easyocr / tesseract_cli / tesserocr / ocrmac); optional extras for easyocr, tesserocr, ocrmac 7. Incremental update (add, delete, modify) end-to-end across property graph, RDF, vector, and search databases on both LlamaIndex and LangChain backends 8. scripts/cleanup.py — all 15 property graph stores have native-client cleanup; early skip when store stage disabled to improve speed, postgres document state / datasource config tabvle cleanup skipped ifuse incremental update false. 9. Observability — upgraded OpenLIT + OpenInference LangChain instrumentation; both OTLP producers (LlamaIndex via OpenLIT, LangChain via OpenInference) work simultaneously 10. Docs site (zensical 0.0.40) + major doc updates — ARCHITECTURE.md (15 PG stores, 13 LLM providers, framework backends section), per-store setup guides (Cosmos Gremlin, Neptune, Spanner), CONFIG-PROPERTY-GRAPH, DATABASE-CONFIGURATION, UI-TAB-SEARCH, MCP-TOOLS; all broken links fixed, README.md updated 11. Per-store config isolation — each database and LLM/embedding provider has its own typed config env var ({TYPE}_GRAPH_DB_CONFIG, {TYPE}_VECTOR_DB_CONFIG, {TYPE}_SEARCH_DB_CONFIG, {KIND}_EMBEDDING_MODEL, etc.); per-store config takes precedence over generic fallback; no shared config collisions across stores 12. Time logging now separates out KG extraction time from graph storage time.
1 parent f1814fd commit 586b1cc

287 files changed

Lines changed: 35844 additions & 8568 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,19 @@ flexible-graphrag/log/
188188
ladybug/
189189
*.lbug
190190

191+
# Integration / matrix test data directories (created during test runs)
192+
flexible-graphrag/lancedb_integration_test/
193+
flexible-graphrag/lancedb_matrix_test/
194+
flexible-graphrag/*_integration_test/
195+
flexible-graphrag/*_matrix_test/
196+
lancedb_integration_test/
197+
lancedb_matrix_test/
198+
*_integration_test/
199+
*_matrix_test/
200+
tests/integration/logs/
201+
tests/integration/envs/
202+
tests/integration/results/
203+
191204
# Document processing output directories
192205
flexible-graphrag/parsing_output/
193206
parsing_output/
@@ -209,3 +222,14 @@ memory-bank/
209222
flexible-graphrag/.claude/settings.local.json
210223
.claude/settings.local.json
211224

225+
# GCS service account credentials (machine-specific, never commit)
226+
flexible-graphrag/gcs.json
227+
gcs.json
228+
229+
# Google Vertex AI / gen-lang client credentials (machine-specific, never commit)
230+
flexible-graphrag/gen-lang-client.json
231+
gen-lang-client.json
232+
233+
234+
# pyright
235+
pyrightconfig.json

CHANGELOG.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,173 @@
22

33
All notable changes to this project will be documented in this file.
44

5+
## [2026-05-08] - Incremental Delete Fixes, Docling OCR, Dependency Compatibility
6+
7+
### Added
8+
9+
Configurable Docling OCR: `DOCLING_OCR=true` + `DOCLING_OCR_ENGINE` (auto / rapidocr / easyocr / tesseract_cli / tesserocr / ocrmac). Optional extras `docling-ocr-easyocr`, `docling-ocr-tesserocr`, `docling-ocr-ocrmac`. `Docling OCR config (app):` log line records the requested engine separately from Docling's own runtime selection message.
10+
11+
### Fixed
12+
13+
**Incremental delete** now works end-to-end for all active databases in LangChain-backed mode:
14+
- Qdrant — `QdrantVectorAdapter.delete()` queries `metadata.ref_doc_id`, `metadata.doc_id`, and flat variants
15+
- Elasticsearch — `ElasticsearchSearchAdapter.delete()` uses `delete_by_query` across both `ref_doc_id` and `doc_id` metadata keys
16+
- Neo4j — `ref_doc_id` injected into entity node properties and stamped on `Chunk` nodes after `add_graph_documents`
17+
- RDF (GraphDB) — SPARQL DELETE confirmed working; debug logging added to adapter
18+
- LlamaIndex Elasticsearch — uses `await store.adelete()` to avoid event-loop conflict in async engine path
19+
20+
**`langchain-age` Python 3.14** — pinned to `langchain-age==0.1.2` (standalone package; requires `antlr4>=4.11`). The PyPI `0.2.0` used `antlr4<4.11` causing `TypeError: ord()` on startup. New `age-extras` optional group.
21+
22+
**`omegaconf` / `antlr4` conflict**`extras-overrides.txt` pins `omegaconf>=2.4.0.dev10` so `rapidocr` (bundled in `docling-slim[standard]`) and `langchain-age==0.1.2` coexist in one environment.
23+
24+
**SPARQL hallucination**`_GraphDBQAChain` and `_GenericSparqlQAChain` return an empty string (skip LLM call) when a query yields 0 rows; previously the LLM answered from training data.
25+
26+
**`check_elasticsearch.py`** — reads document text from `text` key (LangChain) in addition to `content`/`page_content`; `--content-len` arg; all metadata fields printed.
27+
28+
---
29+
30+
## [2026-05-07] - v0.6.0 — Spanner + Cosmos Gremlin Cloud Graphs, Cleanup for All 15 PG Stores
31+
32+
### Added
33+
34+
`PG_GRAPH_DB=spanner` — Google Cloud Spanner Graph via LlamaIndex (`llama-index-spanner`), `spanner-extras` dependency group. Auto-creates `{graph_name}_NODE` / `{graph_name}_EDGE` tables and property graph on first ingest.
35+
36+
Auto-create graph container on first ingest for Cosmos Gremlin (uses `ClientSecretCredential` to avoid antivirus false positives) and Spanner — no manual setup required.
37+
38+
Per-store setup guides: `COSMOS-GREMLIN-SETUP.md`, `NEPTUNE-SETUP.md` (rewritten), `SPANNER-SETUP.md` — linked from `DATABASE-CONFIGURATION.md` and docs nav.
39+
40+
### Fixed / Improved
41+
42+
`scripts/cleanup.py` — all 15 property graph stores now have native-client cleanup (previously LC-only stores had no automated path). Spanner table names corrected (`{graph_name}_NODE` / `_EDGE`). Early exit before slow imports when `VECTOR_DB=none` / `SEARCH_DB=none` / `PG_GRAPH_DB=none`. PostgreSQL incremental state cleanup skipped when `ENABLE_INCREMENTAL_UPDATES=false`. Windows `ProactorEventLoop` teardown error suppressed.
43+
44+
### Updated
45+
46+
`pyproject.toml` version `0.6.0` for flexible-graphrag and flexible-graphrag-mcp. `PG_GRAPH_DB` picker in `.env` / `env-sample.txt` reorganised: **LI + LC** (8), **LI only** (Spanner), **LC only** (6)
47+
48+
---
49+
50+
## [2026-05-06] - Spanner LI Adapter, Cosmos Gremlin Cloud Config, Neptune Analytics + Neptune RDF, Namespace From Config
51+
52+
### Added
53+
54+
`PG_GRAPH_DB=spanner` — Google Cloud Spanner property graph now uses **LlamaIndex** via `llama-index-spanner` (`SpannerPropertyGraphStore`). Supports cloud and emulator. New `spanner-extras` optional dependency group (`uv pip install -e ".[spanner-extras]"`). Config keys: `project_id`, `instance_id`, `database_id`, `graph_name`, `credentials_file` (optional; uses ADC if absent). This supersedes the LC-only `langchain-google-spanner` which requires `langchain-core<1.0` and is incompatible.
55+
56+
`PG_GRAPH_DB=cosmos_gremlin` cloud config documented — `COSMOS_GREMLIN_GRAPH_DB_CONFIG` cloud format added to `.env`, `env-sample.txt`, and `CONFIG-PROPERTY-GRAPH.md`: `{"url": "wss://my-cosmos.gremlin.cosmos.azure.com:443/", "username": "/dbs/<db>/colls/<graph>", "password": "<primary_key>"}`.
57+
58+
`RDF_GRAPH_DB=neptune_rdf` — Amazon Neptune RDF/SPARQL backend with IAM SigV4 auth. Included in the integration test matrix.
59+
60+
`PG_GRAPH_DB=neptune_analytics` LangChain backend — `NeptuneAnalyticsAdapter` passes explicit AWS credentials via `SecretStr` (no env-var race), wraps `NeptuneAnalyticsGraph` in `_NeptuneGraphWithWrite` to add `add_graph_documents` support. Both LlamaIndex and LangChain backends validated end-to-end.
61+
62+
### Fixed
63+
64+
SPARQL namespace prefixes and graph URIs now use `RDF_BASE_NAMESPACE` / `RDF_ONTOLOGY_NAMESPACE` from `config.py` across all RDF adapters (Fuseki, Oxigraph, GraphDB, Neptune). Defaults are unchanged.
65+
66+
### Updated
67+
68+
Property graph database counts corrected: **15 total** — 8 both LI+LC, 1 LI-only (Spanner), 6 LC-only (ArangoDB, Apache AGE, Cosmos Gremlin, HugeGraph, SurrealDB, TigerGraph). Updated in `DATABASE-CONFIGURATION.md`, `CONFIG-PROPERTY-GRAPH.md`, and `README.md`.
69+
70+
---
71+
72+
## [2026-04-06 → 2026-05-06] - LangChain as Full Peer Framework, New Databases, Retriever Refactor, Docs Site, Matrix Testing
73+
74+
### Added — LangChain as Full Peer Framework
75+
76+
Every pipeline stage can independently run on LlamaIndex or LangChain via env var pickers: `GRAPH_BACKEND`, `VECTOR_BACKEND`, `SEARCH_BACKEND`, `CHUNKER_BACKEND`, `KG_EXTRACTOR_BACKEND`, `RETRIEVAL_FUSION`. LangChain-only graph stores auto-select `GRAPH_BACKEND=langchain`.
77+
78+
`LC_SPLITTER_TYPE` selects from 6 LangChain text splitters: `recursive`, `character`, `token`, `markdown`, `python`, `sentence_transformers`.
79+
80+
`skip_graph` parameter added to `POST /api/ingest-text`, `POST /api/test-sample`, and corresponding MCP tools — ingests into vector/search without KG extraction.
81+
82+
### Added — New LangChain-Only Property Graph Backends
83+
84+
Seven new graph databases (auto-select `GRAPH_BACKEND=langchain`):
85+
86+
| `PG_GRAPH_DB` | Database | Port |
87+
|---|---|---|
88+
| `arangodb` | ArangoDB | 8529 |
89+
| `apache_age` | Apache AGE (PostgreSQL + Cypher) | 5434 |
90+
| `cosmos_gremlin` | Azure Cosmos DB Gremlin / TinkerPop | 8182 |
91+
| `hugegraph` | Apache HugeGraph | 8082 |
92+
| `surrealdb` | SurrealDB | 8010 |
93+
| `tigergraph` | TigerGraph | 9002 / 14240 |
94+
| `spanner` | Google Cloud Spanner (emulator supported) | 9010 / 9020 |
95+
96+
LangChain-backed ingestion + retrieval also added for all existing LlamaIndex-supported stores: `neo4j`, `arcadedb`, `falkordb`, `memgraph`, `nebula`, `neptune`, `neptune_analytics`, `ladybug`.
97+
98+
### Added — New LangChain Vector and Search Backends
99+
100+
New LC vector adapters (`VECTOR_BACKEND=langchain`): Milvus, Weaviate, LanceDB, Chroma, Pinecone, pgvector, OpenSearch — alongside existing Qdrant, Elasticsearch, Neo4j paths.
101+
102+
New LC search adapters (`SEARCH_BACKEND=langchain`): BM25 (in-memory), Elasticsearch, OpenSearch.
103+
104+
### Added — Adapter Layer Refactor
105+
106+
- **`adapters/`** — framework-neutral ABCs and factories for graph, vector, search, process, and LLM subsystems
107+
- **`llamaindex/`** — all LlamaIndex-specific implementations extracted into subpackages (`graph/`, `llm/`, `vector/`, `search/`, `process/`)
108+
- **`langchain/graph/pg_store_adapters/`** — one file per LC property graph store (15 stores)
109+
- **`langchain/graph/retrievers/`**`li_`/`lc_` two-layer retriever classes; `langchain/retriever_bridge.py` bridge classes
110+
- **`langchain/vector/`**, **`langchain/search/`**, **`langchain/process/`** — LC adapters for each subsystem
111+
- **`ingest/`** — fully modular pipeline steps: `run_chunk_pipeline`, `update_pg_graph`, `update_rdf_graph`, `update_vector`, `update_search`, `ingest_lc_graph`
112+
113+
### Added — New Docker Containers
114+
115+
| Service | Port(s) |
116+
|---|---|
117+
| Apache AGE (PostgreSQL + Cypher) | 5434 |
118+
| Apache HugeGraph + Hubble UI | 8082, 8085 |
119+
| SurrealDB + Surrealist UI | 8010, 8011 |
120+
| TigerGraph Community 4.2.2 | 9002, 14240 |
121+
| Apache TinkerPop Gremlin Server | 8182 |
122+
| Google Spanner emulator | 9010, 9020 |
123+
124+
Standalone pgvector container added at port 5433 (separate from Alfresco PostgreSQL at 5432).
125+
126+
### Added — Retriever Architecture and Search Quality
127+
128+
- Result deduplication and rank-based re-scoring after fusion (`query_engine.py`)
129+
- Source database label on every search result (e.g. *"file.txt | Qdrant vector"*, *"file.txt | Ontotext GraphDB rdf graph"*)
130+
- SPARQL bi-directional fallback retry and improved keyword extraction for zero-result queries
131+
- Cypher tri-part UNION pattern for organizational structure queries (Neo4j, ArcadeDB, Memgraph)
132+
- NebulaGraph dynamic schema patch — `ALTER TAG/EDGE ... ADD` on `SemanticError` during arbitrary property insertion
133+
134+
### Added — Matrix Integration Testing
135+
136+
Full matrix test runner covering 44 backend profiles across both frameworks:
137+
- `tests/integration/run_all_profiles.py` — sequential profile runner with `--clean`, `--include`/`--exclude`, per-profile backend logs, JSON result file
138+
- 44 profiles: 17 property graph × 2 frameworks, 3 RDF stores, 10 vector stores, 4 search stores, combined and LC-pipe profiles
139+
- Test files: `test_ingest_search.py`, `test_incremental.py`, `test_lc_pipeline.py`
140+
141+
### Added — Observability
142+
143+
- **LangChain OpenInference tracing**`openinference-instrumentation-langchain` added to `observability` and `observability-dual` pyproject.toml groups; `LangChainInstrumentor` initialised alongside `LlamaIndexInstrumentor` in `telemetry_setup.py`; all `custom_hooks.py` span decorators made framework-agnostic with a `framework` kwarg and duck-typed result extraction for both LlamaIndex and LangChain response shapes
144+
- **OpenLIT minimum version bumped to `>=1.41.2`** — this release fixed the long-standing `openai` downgrade (1.x → 2.x); all OpenLIT pyproject.toml groups (`observability-openlit`, `observability-dual`) updated; openai downgrade warnings removed from docs and README
145+
- **Separate KG extraction and graph store timings** — ingestion log now shows distinct elapsed times for the KG extraction phase (LLM calls) and the graph store write phase, making it easier to identify whether latency comes from the LLM or the database
146+
147+
### Changed — Configuration
148+
149+
- **DB pickers**: `PG_GRAPH_DB` (15 stores), `RDF_GRAPH_DB` (4 stores), `VECTOR_DB` (10 stores), `SEARCH_DB` (4 stores) — replace the old generic env vars
150+
- **Per-store config precedence**: `{TYPE}_GRAPH_DB_CONFIG` / `{TYPE}_VECTOR_DB_CONFIG` / `{TYPE}_SEARCH_DB_CONFIG` take priority over generic blobs
151+
- **Per-kind embedding vars**: `OPENAI_EMBEDDING_MODEL`, `OLLAMA_EMBEDDING_MODEL`, `GOOGLE_EMBEDDING_MODEL`, etc. — override generic `EMBEDDING_MODEL` per provider
152+
- **`.env` / `env-sample.txt`** restructured: DB selection section first, then framework config section
153+
154+
### Changed — Documentation
155+
156+
- `README.md` — LangChain framework config, new graph/vector/search databases, framework pickers, updated project structure
157+
- `docs/ADVANCED/LANGCHAIN/LANGCHAIN-GRAPH-INTEGRATION.md` — LangChain architecture, new graph store adapters
158+
- `docs/ADVANCED/PORT-MAPPINGS.md` — new database service ports (AGE, HugeGraph, SurrealDB, TigerGraph, Gremlin, Spanner)
159+
- `docs/CONFIGURATION/CONFIG-PROPERTY-GRAPH.md``PG_GRAPH_DB` picker, all 15 stores, per-store config
160+
- `docs/CONFIGURATION/CONFIG-SEARCH-DATABASES.md``SEARCH_DB` picker, LC search backends
161+
- `docs/CONFIGURATION/CONFIG-VECTOR-DATABASES.md``VECTOR_DB` picker, LC vector backends
162+
- `docs/CONFIGURATION/LANGCHAIN-CONFIGURATION.md` — framework backend pickers, LC splitter types
163+
- `docs/DATABASES/DATABASE-CONFIGURATION.md` — updated database overview
164+
- `docs/DATABASES/GRAPH-DATABASES/NEBULA-SETUP.md` — dynamic schema patch notes
165+
- `docs/DATABASES/GRAPH-DATABASES/NEBULA-LANGCHAIN-SETUP.md` — new: NebulaGraph LangChain backend setup guide
166+
- `docs/GETTING-STARTED/ENVIRONMENT-CONFIGURATION.md` — restructured env sections
167+
- `docs/HOME/HOME-DATABASES.md` — new LC-only graph databases
168+
- `docs/MCP/MCP-TOOLS.md``skip_graph` parameter on `ingest_text` and `test_with_sample`
169+
170+
---
171+
5172
## [2026-04-16] - With existing / new md content will how have a Zensical documentation website, including a user guide with coverage of the 13 data source forms and 4 tabs
6173

7174
### Added

0 commit comments

Comments
 (0)