Skip to content

Fix/chunking service dependency#50

Merged
Germanadjemian merged 12 commits intomainfrom
fix/chunking-service-dependency
Dec 17, 2025
Merged

Fix/chunking service dependency#50
Germanadjemian merged 12 commits intomainfrom
fix/chunking-service-dependency

Conversation

@JuanPalbo
Copy link
Copy Markdown
Collaborator

No description provided.

JuanPalbo and others added 11 commits December 13, 2025 10:59
- Add error handling for MinIO get_object and read operations, with validation and clear logging for bucket/object issues
- Wrap pdfplumber.open and page.extract_text in try/except to handle corrupted or password-protected PDFs and log errors
- Refactor table extraction: add per-row and per-cell sanitization, robustly handle non-iterable or malformed tables, and log all exceptions with context
- Warn on large PDF files loaded into memory
- Ensure all error cases are logged and do not break pipeline processing
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings December 17, 2025 00:47
@github-actions
Copy link
Copy Markdown

🔍 PR Validation Results

Check Status
Build ✅ success
Trivy Check Security tab

View detailed results

@Germanadjemian Germanadjemian merged commit e66ac5f into main Dec 17, 2025
4 checks passed
@Germanadjemian Germanadjemian deleted the fix/chunking-service-dependency branch December 17, 2025 00:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the PDF processing pipeline to support table-aware chunking, preventing tables from being split across chunks. The refactoring extracts MinIO client logic into a separate module and introduces a content block abstraction that distinguishes between text and table content types.

Key Changes:

  • Added langchain-text-splitters dependency and removed duplicate dependencies in pyproject.toml
  • Extracted MinIO client configuration into a dedicated minio_client.py module with get_minio_client() and download_object() utility functions
  • Refactored PDF processing to extract content as typed blocks (text vs. table), with tables converted to markdown format and preserved as atomic units during chunking

Reviewed changes

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

File Description
RAGManager/pyproject.toml Added langchain-text-splitters dependency; removed duplicate typing-extensions and uvicorn entries
RAGManager/app/services/minio_client.py New module containing extracted MinIO client creation and object download utilities with proper error handling
RAGManager/app/services/pdf_processor.py Refactored to extract content as typed blocks (text/table); tables converted to markdown; context extraction for tables; simplified error handling
RAGManager/app/services/chunking_service.py Updated to handle table-aware chunking; tables remain atomic regardless of size; added small chunk merging logic; replaced hardcoded values with configurable parameters

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants