Fix/chunking service dependency by JuanPalbo · Pull Request #50 · ucudal/reto-xmas-2025-goland-ia-backend

JuanPalbo · 2025-12-17T00:47:21Z

No description provided.

- Add error handling for MinIO get_object and read operations, with validation and clear logging for bucket/object issues - Wrap pdfplumber.open and page.extract_text in try/except to handle corrupted or password-protected PDFs and log errors - Refactor table extraction: add per-row and per-cell sanitization, robustly handle non-iterable or malformed tables, and log all exceptions with context - Warn on large PDF files loaded into memory - Ensure all error cases are logged and do not break pipeline processing

…or improved readability and maintainability

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-actions · 2025-12-17T00:50:03Z

🔍 PR Validation Results

Check	Status
Build	✅ success
Trivy	Check Security tab

View detailed results

Copilot

Pull request overview

This PR refactors the PDF processing pipeline to support table-aware chunking, preventing tables from being split across chunks. The refactoring extracts MinIO client logic into a separate module and introduces a content block abstraction that distinguishes between text and table content types.

Key Changes:

Added langchain-text-splitters dependency and removed duplicate dependencies in pyproject.toml
Extracted MinIO client configuration into a dedicated minio_client.py module with get_minio_client() and download_object() utility functions
Refactored PDF processing to extract content as typed blocks (text vs. table), with tables converted to markdown format and preserved as atomic units during chunking

Reviewed changes

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

File	Description
RAGManager/pyproject.toml	Added langchain-text-splitters dependency; removed duplicate typing-extensions and uvicorn entries
RAGManager/app/services/minio_client.py	New module containing extracted MinIO client creation and object download utilities with proper error handling
RAGManager/app/services/pdf_processor.py	Refactored to extract content as typed blocks (text/table); tables converted to markdown; context extraction for tables; simplified error handling
RAGManager/app/services/chunking_service.py	Updated to handle table-aware chunking; tables remain atomic regardless of size; added small chunk merging logic; replaced hardcoded values with configurable parameters

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

JuanPalbo and others added 11 commits December 13, 2025 10:59

pdf-processor

5086552

Merge branch 'main' into feature/pdf-processor

108a36a

fix error handling when opening corrupted PDFs

5d3b0cd

Added check for empty PDFs

ba32b9f

Setup MinIO client timeout and retries

e0619c1

Refactor PDF processing to utilize dedicated MinIO client functions f…

056cd66

…or improved readability and maintainability

Merge branch 'main' into fix/minio-client

94a61f2

Update RAGManager/app/services/minio_client.py

20e2004

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Updated chunking service to properly process text and tables.

af6522f

fixed missing dependency

7b3657f

Copilot AI review requested due to automatic review settings December 17, 2025 00:47

Copilot started reviewing on behalf of JuanPalbo December 17, 2025 00:47 View session

Merge branch 'main' into fix/chunking-service-dependency

e1bb8af

Germanadjemian approved these changes Dec 17, 2025

View reviewed changes

Germanadjemian merged commit e66ac5f into main Dec 17, 2025
4 checks passed

Germanadjemian deleted the fix/chunking-service-dependency branch December 17, 2025 00:50

Copilot AI reviewed Dec 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/chunking service dependency#50

Fix/chunking service dependency#50
Germanadjemian merged 12 commits intomainfrom
fix/chunking-service-dependency

JuanPalbo commented Dec 17, 2025

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JuanPalbo commented Dec 17, 2025

Uh oh!

github-actions bot commented Dec 17, 2025

🔍 PR Validation Results

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants