Build a complete GraphRAG pipeline from scratch using SEC 10-K financial filing data. Three notebooks progress from loading raw data into a knowledge graph to running graph-enriched retrieval queries.
- Data Loading: Build a two-layer knowledge graph — structured entities (Company, Product, RiskFactor) and unstructured document chunks — connected by cross-links
- Embeddings: Generate vector embeddings with Amazon Nova and create a vector index for semantic search
- VectorCypherRetriever: Combine vector similarity search with Cypher graph traversal to enrich LLM context with entity data
Before starting this lab, make sure you have:
- Completed Lab 1 (Neo4j Aura instance created and running)
- Filled in your credentials in
CONFIG.txtat the project root - A running environment (SageMaker Studio or GitHub Codespace)
Note: This lab wipes and rebuilds the graph from scratch using financial_data.json. It does not depend on data from previous labs.
| Notebook | Title | What You Build |
|---|---|---|
| 01_data_loading.ipynb | Data Loading | Company, Product, RiskFactor, Document, and Chunk nodes linked by OFFERS, FACES_RISK, FILED, FROM_DOCUMENT, NEXT_CHUNK, and FROM_CHUNK relationships |
| 02_embeddings.ipynb | Embeddings | Vector embeddings on every Chunk node, plus a chunkEmbeddings vector index |
| 03_vector_cypher_retriever.ipynb | VectorCypher Retriever | A VectorCypherRetriever that enriches each vector match with document metadata and connected entities via Cypher, wired into a GraphRAG pipeline |