-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/vector store #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,33 +1,55 @@ | ||
| import logging | ||
|
|
||
| from langchain_core.documents import Document | ||
|
|
||
| from app.core.config import settings | ||
| from app.core.database_connection import SessionLocal | ||
| from app.models.document import Document | ||
| from app.services.chunking_service import document_to_chunks | ||
| from app.services.embedding_service import chunks_to_embeddings | ||
| from app.services.pdf_processor import pdf_to_document | ||
| from app.services.vector_store import store_chunks_with_embeddings | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def _create_document_record(filename: str, minio_path: str) -> int: | ||
| """ | ||
| Create a Document record in the database. | ||
|
|
||
| Args: | ||
| filename: Original filename of the PDF | ||
| minio_path: Path to the PDF in MinIO bucket | ||
|
|
||
| Returns: | ||
| int: The created document's ID | ||
| """ | ||
| db = SessionLocal() | ||
| try: | ||
| document = Document(filename=filename, minio_path=minio_path) | ||
| db.add(document) | ||
| db.commit() | ||
| db.refresh(document) | ||
| logger.info(f"Created document record with id={document.id}") | ||
| return document.id | ||
| finally: | ||
| db.close() | ||
|
Comment on lines
+13
to
+33
|
||
|
|
||
|
|
||
| def process_pdf_pipeline(object_name: str) -> int: | ||
| """ | ||
| Orchestrates the PDF processing pipeline. | ||
|
|
||
| This function coordinates the three-stage pipeline: | ||
| 1. PDF to LangChain Document | ||
| 2. Document to Chunks | ||
| 3. Chunks to Embeddings | ||
| 4. Store in database (to be implemented) | ||
| 3. Embed and Store in database using PGVector | ||
|
|
||
| Args: | ||
| object_name: Path/name of the PDF object in the MinIO bucket | ||
|
|
||
| Returns: | ||
| int: document_id of the created document (mock value for now) | ||
| int: document_id of the created document | ||
|
|
||
| Raises: | ||
| NotImplementedError: If any of the pipeline stages are not yet implemented | ||
| Exception: If any of the pipeline stages fail | ||
| """ | ||
| logger.info(f"Starting PDF processing pipeline for object: {object_name}") | ||
|
|
||
|
|
@@ -44,27 +66,28 @@ def process_pdf_pipeline(object_name: str) -> int: | |
| chunks = document_to_chunks(document, settings.chunk_size, settings.chunk_overlap) | ||
| logger.info(f"Stage 2 completed successfully. Created {len(chunks)} chunks") | ||
|
|
||
| # Stage 3: Chunks to Embeddings | ||
| logger.info("Stage 3: Generating embeddings for chunks") | ||
| embeddings = chunks_to_embeddings(chunks) | ||
| logger.info(f"Stage 3 completed successfully. Generated {len(embeddings)} embeddings") | ||
|
|
||
| # Stage 4: Store in database (placeholder - not implemented yet) | ||
| logger.info("Stage 4: Storing chunks and embeddings in database") | ||
| # TODO: Implement database storage | ||
| # This will: | ||
| # 1. Create a Document record in the documents table | ||
| # 2. Create DocumentChunk records with embeddings in the document_chunks table | ||
| # 3. Return the document_id | ||
| raise NotImplementedError("Database storage will be implemented later") | ||
|
|
||
| except NotImplementedError as e: | ||
| logger.warning(f"Pipeline stage not implemented: {e}") | ||
| # Return a mock document_id for now | ||
| # In production, this should be replaced with actual database storage | ||
| mock_document_id = 1 | ||
| logger.info(f"Pipeline completed with mock document_id: {mock_document_id}") | ||
| return mock_document_id | ||
| # Stage 3: Embed and Store in database | ||
| # First, create the document record to get the document_id | ||
| logger.info("Stage 3: Embedding and storing chunks in database") | ||
|
|
||
| # Extract filename from object_name (e.g., "folder/file.pdf" -> "file.pdf") | ||
| filename = object_name.split("/")[-1] if "/" in object_name else object_name | ||
|
|
||
| # Create document record in the documents table | ||
| document_id = _create_document_record(filename=filename, minio_path=object_name) | ||
|
|
||
| # Store chunks with embeddings using PGVector | ||
| # This generates embeddings via OpenAI and stores in the vector database | ||
| chunks_stored = store_chunks_with_embeddings( | ||
| document_id=document_id, | ||
| filename=filename, | ||
| chunks=chunks, | ||
| ) | ||
| logger.info(f"Stage 3 completed successfully. Stored {chunks_stored} chunks with embeddings") | ||
|
|
||
| logger.info(f"Pipeline completed successfully. Document ID: {document_id}") | ||
| return document_id | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Error in PDF processing pipeline: {e}") | ||
| raise | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,162 @@ | ||||||||||||||||||
| """ | ||||||||||||||||||
| Vector Store Service - Handles embedding generation and storage using LangChain PGVector. | ||||||||||||||||||
|
|
||||||||||||||||||
| This service provides functionality to: | ||||||||||||||||||
| 1. Initialize PGVector connection with OpenAI embeddings | ||||||||||||||||||
| 2. Store document chunks with their embeddings in batches | ||||||||||||||||||
| 3. Convert database URLs to psycopg3 format required by langchain-postgres | ||||||||||||||||||
| """ | ||||||||||||||||||
|
|
||||||||||||||||||
| import logging | ||||||||||||||||||
| from urllib.parse import urlparse, urlunparse | ||||||||||||||||||
|
|
||||||||||||||||||
| from langchain_core.documents import Document | ||||||||||||||||||
| from langchain_openai import OpenAIEmbeddings | ||||||||||||||||||
| from langchain_postgres import PGVector | ||||||||||||||||||
|
|
||||||||||||||||||
| from app.core.config import settings | ||||||||||||||||||
|
|
||||||||||||||||||
| logger = logging.getLogger(__name__) | ||||||||||||||||||
|
|
||||||||||||||||||
| # Collection name for the vector store | ||||||||||||||||||
| COLLECTION_NAME = "document_chunks" | ||||||||||||||||||
|
|
||||||||||||||||||
|
Comment on lines
+18
to
+23
|
||||||||||||||||||
| logger = logging.getLogger(__name__) | |
| # Collection name for the vector store | |
| COLLECTION_NAME = "document_chunks" | |
| from app.agents.nodes.retriever import COLLECTION_NAME | |
| logger = logging.getLogger(__name__) |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function '_convert_database_url_to_psycopg' is duplicated in 'app/agents/nodes/retriever.py' (lines 22-48). The logic is identical. Consider extracting this function to a shared utility module (e.g., 'app/core/database_utils.py' or 'app/utils/database.py') to avoid code duplication and ensure consistency across the codebase.
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function '_get_embeddings' creates a new OpenAIEmbeddings instance on every call. In 'app/agents/nodes/retriever.py' (lines 16-19), the embeddings are initialized once as a module-level variable. Consider using a similar pattern here to avoid creating multiple embedding instances unnecessarily, which improves consistency and could prevent potential issues with resource management.
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function '_get_vector_store' creates a new PGVector instance on every call, and the same pattern exists in 'app/agents/nodes/retriever.py' (lines 51-72). Creating new vector store instances repeatedly can be inefficient. Consider implementing a singleton pattern, caching the instance, or using a module-level variable to reuse the same PGVector instance across calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "Use psycopg2 (sync driver), NOT asyncpg" and the example uses "postgresql+psycopg2://". However, the pyproject.toml dependency was changed from "psycopg2-binary" to "psycopg[binary]>=3.0.0" (psycopg3). This creates a mismatch - the documentation suggests psycopg2 but the dependency is psycopg3. Consider updating the comment to clarify that while the DATABASE_URL should use the psycopg2 format, the vector_store will automatically convert it to psycopg3 format internally, or update the example to show "postgresql://" without a driver specification.