pdf-processor by JuanPalbo · Pull Request #19 · ucudal/reto-xmas-2025-goland-ia-backend

JuanPalbo · 2025-12-13T14:01:16Z

No description provided.

coderabbitai · 2025-12-13T14:01:24Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Summary by CodeRabbit

New Features
- Integrated cloud object storage for PDF document processing with dual extraction modes: individual pages or complete documents.
- Implemented robust table extraction from PDFs with enhanced error handling and detailed processing logs.
- Added comprehensive document metadata tracking including source and page information.
Chores
- Improved environment file configuration handling.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Walkthrough

Adds MinIO-backed PDF retrieval and per-page parsing (text and tables) with sanitization, metadata, error handling, and two public API functions; renames pipeline parameter from minio_url to object_name; adds .env to .gitignore.

Changes

Cohort / File(s)	Summary
Configuration `\`.gitignore``	Added `.env` to ignored files.
PDF processing `RAGManager/app/services/pdf_processor.py`	Implemented MinIO client helper `get_minio_client()`; added `pdf_to_document(object_name, bucket_name=None, minio_client=None)` returning per-page `Document` list and `pdf_to_single_document(...)` returning a single concatenated `Document`; added `_sanitize_cell()` and `_extract_tables_safely()`; performs in-memory MinIO download, `pdfplumber` parsing, per-page text/table extraction, metadata enrichment, and robust error handling/logging.
Pipeline `RAGManager/app/services/pipeline.py`	Renamed entry parameter and references from `minio_url` to `object_name`; updated docstring and call site to use `pdf_to_document(object_name)`.

Sequence Diagram

sequenceDiagram
    participant Pipeline as Pipeline
    participant Processor as PDF Processor
    participant MinIO as MinIO Storage
    participant PDFReader as pdfplumber

    Pipeline->>Processor: process_pdf_pipeline(object_name)
    Processor->>Processor: get_minio_client()
    Processor->>MinIO: GET object (object_name, bucket)
    MinIO-->>Processor: return PDF bytes
    Processor->>PDFReader: open PDF bytes
    PDFReader-->>Processor: pages
    loop per page
        Processor->>Processor: extract text
        Processor->>Processor: _extract_tables_safely()
        Processor->>Processor: _sanitize_cell() for table cells
        Processor->>Processor: assemble Document with metadata
    end
    Processor-->>Pipeline: return list[Document] or single Document

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Focus review on RAGManager/app/services/pdf_processor.py (MinIO client configuration, in-memory byte handling, pdfplumber usage, table-extraction/error handling, metadata fields).
Confirm caller updates for process_pdf_pipeline parameter rename in RAGManager/app/services/pipeline.py and any other call sites.
Inspect logging messages and exception handling paths for clarity and consistency.

Poem

🐰 I hopped through bytes with whiskers bright,

Pulled pages from a bucket late at night,
Tidied cells and stitched each line with care,
Tagged every page with where and when and where,
A rabbit's nod — the parser's set to flight.

Pre-merge checks and finishing touches

❌ Failed checks (2 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'pdf-processor' is vague and generic, failing to convey meaningful information about the specific changes made in the changeset.	Replace with a more descriptive title that specifies the main change, such as 'Implement MinIO-backed PDF processing with per-page extraction' or 'Add PDF processor service with MinIO integration'.
Description check	❓ Inconclusive	No pull request description was provided by the author, making it impossible to assess whether the description relates to the changeset.	Add a pull request description explaining the changes, such as the implementation of MinIO-backed PDF processing, per-page document extraction, and pipeline parameter updates.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/pdf-processor

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba32b9f and e0619c1.

📒 Files selected for processing (1)

RAGManager/app/services/pdf_processor.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

RAGManager/app/services/pdf_processor.py (1)

RAGManager/app/models/document.py (1)

Document (10-21)

🔇 Additional comments (6)

RAGManager/app/services/pdf_processor.py (6)
1-13: LGTM!

Imports are well-organized and appropriate for the functionality. The use of langchain_core.documents.Document (for RAG/vector store operations) is correctly distinguished from app.models.document.Document (ORM model for database storage).

31-58: LGTM!

Robust error handling with type validation at each level (tables, table, row). The defensive approach with per-row exception handling ensures that one malformed row won't break extraction of other tables.

61-88: Potential conflict between cert_reqs and secure=False.

When settings.minio_secure is False (HTTP connection), the cert_reqs="CERT_REQUIRED" setting on the PoolManager may cause issues or be misleading. Consider making cert requirements conditional on the secure setting.
     # Create PoolManager with timeout, retry, and CA bundle
     http_client = urllib3.PoolManager(
         timeout=timeout,
         retries=retry,
         maxsize=10,
-        cert_reqs="CERT_REQUIRED",
-        ca_certs=certifi.where(),
+        cert_reqs="CERT_REQUIRED" if settings.minio_secure else "CERT_NONE",
+        ca_certs=certifi.where() if settings.minio_secure else None,
     )
Verify your MinIO configuration to ensure this doesn't cause connection issues in development/test environments with secure=False.

150-155: Good handling of empty PDFs.

The check for empty PDFs with a clear error message is well-implemented. This addresses the edge case previously flagged.

188-219: LGTM!

The pdf_to_single_document function is a clean wrapper that properly delegates to pdf_to_document and combines the results. All parameters including minio_client are correctly forwarded.

204-206: Bug: minio_client parameter is not passed to pdf_to_document.

The minio_client parameter is accepted but not forwarded to pdf_to_document(), causing it to always create a new client instead of using the provided one.
     if bucket_name is None:
         bucket_name = settings.minio_bucket
-    documents = pdf_to_document(object_name, bucket_name, minio_client)
+    documents = pdf_to_document(object_name, bucket_name, minio_client)
Wait, looking again - the parameter IS being passed. Let me verify:
documents = pdf_to_document(object_name, bucket_name, minio_client)
This is correct - all three parameters are passed positionally.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR implements PDF processing functionality to convert PDF files from MinIO storage into LangChain Document objects. The implementation resolves merge conflicts in dependencies and replaces a placeholder function with a fully functional PDF processor using pdfplumber for text and table extraction.

Implements two PDF processing functions: pdf_to_document() for page-by-page extraction and pdf_to_single_document() for combined content
Adds MinIO client integration with configuration from environment settings
Resolves merge conflicts in pyproject.toml dependencies

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 5 comments.

File	Description
RAGManager/pyproject.toml	Resolves merge conflict markers in the dependencies section
RAGManager/app/services/pdf_processor.py	Implements complete PDF processing with MinIO integration, text/table extraction, and document generation with metadata
.gitignore	Adds .env file to ignore list for environment variable protection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-13T14:05:39Z

+    response = minio_client.get_object(bucket_name, object_name)
+    try:
+        pdf_bytes = response.read()
+    finally:
+        response.close()
+        response.release_conn()


Missing error handling for MinIO operations. If the object doesn't exist, bucket doesn't exist, or there are authentication/permission issues, the get_object call will raise an exception that isn't caught here. Consider adding try-except blocks to provide more informative error messages for common failure scenarios (e.g., S3Error for missing objects or permission denied).

@copilot open a new pull request to apply changes based on this feedback

Copilot · 2025-12-13T14:05:39Z

+        response.release_conn()
+
+    # Open PDF from bytes using pdfplumber
+    with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:


Missing validation for PDF file integrity. The code directly opens the bytes as a PDF without verifying it's actually a valid PDF file. Malformed or non-PDF data could cause pdfplumber to raise exceptions or behave unexpectedly. Consider adding validation or wrapping the pdfplumber.open call in a try-except block to catch and handle PDFSyntaxError or similar exceptions with a clear error message.

@copilot open a new pull request to apply changes based on this feedback

Copilot · 2025-12-13T14:05:39Z

+    # Download the PDF from MinIO into memory
+    response = minio_client.get_object(bucket_name, object_name)
+    try:
+        pdf_bytes = response.read()


Loading entire PDF into memory could cause issues with large files. For very large PDFs (e.g., hundreds of megabytes), calling response.read() will load the entire file into memory at once, which could lead to memory issues or performance degradation. Consider processing the PDF in a streaming fashion if large files are expected, or add file size limits with appropriate error handling.

@copilot open a new pull request to apply changes based on this feedback

Copilot · 2025-12-13T14:05:39Z

+                    "object_name": object_name,
+                    "page": page_num,
+                    "total_pages": len(pdf.pages),
+                    "filename": object_name.split("/")[-1],


Inconsistent filename extraction logic that's duplicated. The expression object_name.split("/")[-1] is repeated in both functions at lines 75 and 110. This could be extracted into a helper function or variable to improve maintainability and ensure consistency if the logic needs to change. Also, this approach doesn't handle edge cases like Windows-style paths or objects without a filename component.

@copilot open a new pull request to apply changes based on this feedback

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 298c6fd and 5086552.

📒 Files selected for processing (3)

.gitignore (1 hunks)
RAGManager/app/services/pdf_processor.py (1 hunks)
RAGManager/pyproject.toml (0 hunks)

💤 Files with no reviewable changes (1)

RAGManager/pyproject.toml

🧰 Additional context used

🧬 Code graph analysis (1)

RAGManager/app/services/pdf_processor.py (1)

RAGManager/app/models/document.py (1)

Document (10-21)

🔇 Additional comments (6)

.gitignore (1)

1-1: LGTM! Standard security practice.

Adding .env to .gitignore is essential to prevent committing sensitive credentials (MinIO access keys, etc.) to version control.

RAGManager/app/services/pdf_processor.py (5)

1-6: LGTM! Clean imports.

All imports are necessary and properly organized.

18-38: LGTM! Well-structured function signature and defaults.

The function has clear type hints, comprehensive documentation, and proper default parameter handling.

67-80: LGTM! Comprehensive per-page metadata.

The Document objects are well-structured with rich metadata including page numbers, source URL, and total pages. The minio:// protocol in the source field provides a clear identifier for the document origin.

10-16: The MinIO settings are already validated at startup through pydantic's BaseSettings. When RAGManager/app/core/config.py is imported, the Settings() class is instantiated at module level (line 48), which automatically validates all required fields (minio_endpoint, minio_access_key, minio_secret_key) against environment variables. If any required setting is missing, pydantic raises a ValidationError during application startup before get_minio_client() is ever invoked. No additional validation or error handling is needed in the function itself.

98-113: The current code already handles empty PDFs gracefully. If pdf_to_document() returns an empty list (for a 0-page PDF), the result will be an empty page_content string and total_pages: 0, which propagates correctly through the pipeline without errors. LangChain's split_documents() handles empty document lists without raising exceptions. No explicit error handling is required unless your design explicitly rejects empty PDFs, which is not indicated in the codebase.

coderabbitai · 2025-12-13T14:06:18Z

+            tables = page.extract_tables()
+            table_text = ""
+            for table in tables:
+                for row in table:
+                    table_text += " | ".join(str(cell) if cell else "" for cell in row) + "\n"


🧹 Nitpick | 🔵 Trivial

Consider more robust table formatting.

The current table-to-text conversion is basic and may not preserve table structure well. The simple pipe-delimited format could be ambiguous if cells contain pipe characters.

Consider alternative approaches:

Use CSV-style formatting with proper escaping

Add Markdown table formatting

Include row/column headers if available from the table structure

Example with Markdown tables:

table_text = "" for table in tables: if not table: continue # Add header separator for Markdown tables for i, row in enumerate(table): cells = [str(cell).strip() if cell else "" for cell in row] table_text += "| " + " | ".join(cells) + " |\n" if i == 0 and len(table) > 1: # Add separator after header table_text += "|" + "|".join([" --- " for _ in cells]) + "|\n" table_text += "\n"

🤖 Prompt for AI Agents

In RAGManager/app/services/pdf_processor.py around lines 56 to 60, the current table-to-text conversion concatenates cells with a raw pipe delimiter which breaks when cells contain pipes and loses header structure; replace the simple join with a more robust formatter such as CSV-style output with proper escaping or Markdown table formatting: iterate tables skipping empty ones, normalize cell strings (trim and replace/escape pipe and newline characters), output rows with surrounding separators (for Markdown: prefix/suffix "| " and " |", and insert a header separator row after the first row when appropriate) or use Python's csv module to generate escaped CSV lines, and ensure a blank line between tables for readability.

@copilot Does that work for chunking further down the pipeline?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

coderabbitai · 2025-12-13T14:06:18Z

+                    "object_name": object_name,
+                    "page": page_num,
+                    "total_pages": len(pdf.pages),
+                    "filename": object_name.split("/")[-1],


🧹 Nitpick | 🔵 Trivial

Filename extraction could be more robust.

Using object_name.split("/")[-1] works for Unix-style paths but could be more explicit and handle edge cases better.

Consider using pathlib for clearer intent:

+from pathlib import PurePosixPath + # ... in the metadata dict: - "filename": object_name.split("/")[-1], + "filename": PurePosixPath(object_name).name,

This makes the intent clearer and handles edge cases like trailing slashes.

🤖 Prompt for AI Agents

In RAGManager/app/services/pdf_processor.py around line 75, the filename extraction using object_name.split("/")[-1] is fragile; replace it with a pathlib-based approach: import pathlib at the top and use pathlib.PurePosixPath(object_name).name (or pathlib.Path if you prefer OS-native behavior) to get the filename safely and handle trailing slashes and edge cases, and ensure the import is added if missing.

Copilot · 2025-12-13T14:10:07Z

@JuanPalbo I've opened a new pull request, #20, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-13T14:10:32Z

@JuanPalbo I've opened a new pull request, #21, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-13T14:10:57Z

@JuanPalbo I've opened a new pull request, #22, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-13T14:11:29Z

@JuanPalbo I've opened a new pull request, #23, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-13T14:13:33Z

@JuanPalbo I've opened a new pull request, #24, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-13T15:09:14Z

@JuanPalbo I've opened a new pull request, #27, to work on those changes. Once the pull request is ready, I'll request review from you.

- Add error handling for MinIO get_object and read operations, with validation and clear logging for bucket/object issues - Wrap pdfplumber.open and page.extract_text in try/except to handle corrupted or password-protected PDFs and log errors - Refactor table extraction: add per-row and per-cell sanitization, robustly handle non-iterable or malformed tables, and log all exceptions with context - Warn on large PDF files loaded into memory - Ensure all error cases are logged and do not break pipeline processing

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

RAGManager/app/services/pipeline.py (1)
37-44: Type mismatch: pdf_to_document now returns list[Document], not Document.

This was previously flagged. The pdf_to_document function now returns a list of Documents (one per page), but this code assigns it to document (singular) and passes it directly to document_to_chunks. Depending on how document_to_chunks is implemented, this will either fail or produce incorrect results.

Update the pipeline to handle the list:
         # Stage 1: PDF to Document
         logger.info("Stage 1: Converting PDF to LangChain Document")
-        document = pdf_to_document(object_name)
+        documents = pdf_to_document(object_name)
         logger.info("Stage 1 completed successfully")

         # Stage 2: Document to Chunks
         logger.info(
             f"Stage 2: Splitting document into chunks (size={settings.chunk_size}, overlap={settings.chunk_overlap})"
         )
-        chunks = document_to_chunks(document, settings.chunk_size, settings.chunk_overlap)
+        # Process each page document and combine chunks
+        chunks = []
+        for doc in documents:
+            chunks.extend(document_to_chunks(doc, settings.chunk_size, settings.chunk_overlap))
         logger.info(f"Stage 2 completed successfully. Created {len(chunks)} chunks")
Alternatively, if you want a single document, use pdf_to_single_document(object_name) instead.
RAGManager/app/services/pdf_processor.py (1)

146-146: Filename extraction is duplicated at lines 146 and 183.

This was previously flagged. Consider using pathlib.PurePosixPath(object_name).name or extracting to a helper for consistency and edge case handling.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5086552 and 95646fb.

📒 Files selected for processing (2)

RAGManager/app/services/pdf_processor.py (1 hunks)
RAGManager/app/services/pipeline.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

RAGManager/app/services/pipeline.py (1)

RAGManager/app/services/pdf_processor.py (1)

pdf_to_document (66-153)

RAGManager/app/services/pdf_processor.py (1)

RAGManager/app/models/document.py (1)

Document (10-21)

🔇 Additional comments (6)

RAGManager/app/services/pdf_processor.py (5)

1-10: LGTM!

Imports and logger setup are appropriate for the PDF processing functionality.

13-25: LGTM!

Good defensive approach with the fallback chain (str() → repr() → empty string) to handle any cell type safely.

28-55: LGTM!

Robust error handling with proper type validation at each level and contextual logging. The graceful degradation approach (log and continue) is appropriate for table extraction.

58-64: LGTM!

Clean factory function for creating MinIO clients from settings.

155-186: LGTM!

Clean composition that delegates to pdf_to_document and aggregates results. The metadata consolidation is appropriate for the single-document use case.

RAGManager/app/services/pipeline.py (1)

13-32: LGTM!

Good semantic improvement renaming minio_url to object_name for consistency with the updated pdf_processor API.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

RAGManager/app/services/pdf_processor.py (1)

150-150: Consider using pathlib.PurePosixPath for filename extraction.

This was noted in previous reviews. Using PurePosixPath(object_name).name would handle edge cases like trailing slashes more robustly and make the intent clearer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 95646fb and 5d3b0cd.

📒 Files selected for processing (1)

RAGManager/app/services/pdf_processor.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

RAGManager/app/services/pdf_processor.py (1)

RAGManager/app/models/document.py (1)

Document (10-21)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Agent

🔇 Additional comments (6)

RAGManager/app/services/pdf_processor.py (6)

1-10: LGTM!

Imports are appropriate and well-organized. Module-level logger initialization follows standard Python practices.

13-25: LGTM!

Robust cell sanitization with appropriate fallbacks. The defensive approach handles edge cases like objects with broken __str__ or __repr__ methods gracefully.

28-55: LGTM!

Excellent defensive implementation with type validation at each level (tables, table, row) and granular exception handling. The logging provides good context for debugging extraction issues.

58-64: LGTM!

Clean factory function for MinIO client creation. Configuration is properly externalized to settings.

66-157: Well-implemented PDF processing with comprehensive error handling.

The implementation properly addresses previous review concerns:

Input validation for object_name

Consistent ValueError raising for all failure cases

Large file warnings

Proper resource cleanup with try/finally blocks

Per-page error handling with graceful degradation

159-190: LGTM!

Clean delegation pattern that reuses pdf_to_document and aggregates results appropriately. The metadata is correctly summarized for the combined document.

Copilot

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-13T15:52:18Z

+        response = minio_client.get_object(bucket_name, object_name)
+    except Exception as e:
+        logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
+        raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e


The error message includes the exception details from MinIO which might expose internal system information. Consider sanitizing the error message before raising it, especially for production environments. For example, you could log the detailed error but raise a more generic message to the caller.

Suggested change

raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e

raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}'.") from e

Copilot · 2025-12-13T15:52:19Z

+def pdf_to_document(
+    object_name: str,
+    bucket_name: str | None = None,
+    minio_client: Minio | None = None,
+) -> list[Document]:
+    """
+    Load a PDF file from MinIO and return a list of Document objects.
+    Each page becomes a separate Document with metadata.

    Args:
-        minio_url: URL pointing to the PDF file in MinIO
+        object_name: Path/name of the PDF object in the bucket
+        bucket_name: Name of the MinIO bucket (defaults to settings.minio_bucket)
+        minio_client: Optional MinIO client (creates one if not provided)

    Returns:
-        Document: LangChain Document object containing the PDF content
+        List of Document objects, one per page
+    """


The return type annotation indicates list[Document] is returned, but the docstring says "return a list of Document objects" without specifying that it's one Document per page. Consider clarifying in the docstring that the function returns exactly one Document object per page of the PDF for better API clarity.

Copilot · 2025-12-13T15:52:19Z

        # Stage 1: PDF to Document
        logger.info("Stage 1: Converting PDF to LangChain Document")
-        document = pdf_to_document(minio_url)
+        document = pdf_to_document(object_name)


The function pdf_to_document now returns list[Document] (as shown in the new implementation), but this code treats it as a single Document. This will cause a type error when trying to pass a list to document_to_chunks, which expects a single Document. You should either use pdf_to_single_document if you want a single document, or update the pipeline to handle a list of documents.

Copilot · 2025-12-13T15:52:19Z

+    except Exception as e:
+        logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
+        raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e
+
+    try:
+        pdf_bytes = response.read()
+        # Optional: warn if file is very large (e.g., > 100MB)


The entire PDF is loaded into memory with response.read(), and then a warning is issued if the file is larger than 100MB. Consider checking the file size before loading it into memory to avoid potential out-of-memory errors. You can use the Content-Length header from the response object to determine the file size before reading it.

Suggested change

except Exception as e:

logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")

raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e

try:

pdf_bytes = response.read()

# Optional: warn if file is very large (e.g., > 100MB)

# Check Content-Length header before reading into memory

content_length = None

if hasattr(response, "getheader"):

content_length = response.getheader("Content-Length")

elif hasattr(response, "headers") and "Content-Length" in response.headers:

content_length = response.headers["Content-Length"]

if content_length is not None:

try:

file_size_bytes = int(content_length)

file_size_mb = file_size_bytes / (1024 * 1024)

if file_size_mb > 100:

logger.warning(f"Large PDF (Content-Length): {file_size_mb:.1f} MB for '{object_name}' - loading into memory")

except Exception as e:

logger.warning(f"Could not parse Content-Length header for '{object_name}': {e}")

else:

logger.info(f"Content-Length header not found for '{object_name}', proceeding to read file into memory")

except Exception as e:

logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")

raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e

try:

pdf_bytes = response.read()

# Optional: warn if file is very large (e.g., > 100MB) (fallback if Content-Length was not available)

Copilot · 2025-12-13T15:52:19Z

+                    "object_name": object_name,
+                    "page": page_num,
+                    "total_pages": len(pdf.pages),
+                    "filename": object_name.split("/")[-1],


The filename extraction using object_name.split("/")[-1] is duplicated in both pdf_to_document and pdf_to_single_document functions. Consider extracting this into a helper function or variable to avoid duplication and make the code more maintainable.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (5)

RAGManager/app/services/pdf_processor.py (5)
13-25: Simplify _sanitize_cell() (redundant type check) and consider newline/pipe normalization.
Right now isinstance(...) is redundant since you already str(cell) in the try anyway; also cells containing \n or | will degrade downstream table formatting/chunking.
 def _sanitize_cell(cell) -> str:
     """Safely convert a cell value to string."""
     if cell is None:
         return ""
-    if isinstance(cell, (str, int, float, bool)):
-        return str(cell)
     try:
-        return str(cell)
+        s = str(cell)
     except Exception:
         try:
-            return repr(cell)
+            s = repr(cell)
         except Exception:
             return ""
+    # Keep tables one-line-per-row and avoid delimiter collisions.
+    return s.replace("\n", " ").replace("|", "\\|").strip()
28-55: Table serialization is ambiguous; escape delimiters or use CSV/Markdown formatting.
" | ".join(...) breaks if cell text contains | and loses structure; prefer a deliberate format (CSV writer, Markdown table with escaping) so downstream chunking/search isn’t polluted.

94-114: Don’t include raw MinIO exception text in raised ValueError (potential info leak); log details instead.
If this bubbles to an API boundary, : {e} can expose internal bucket/object naming, endpoint details, or server error metadata. Keep detailed info in logs and raise a stable/sanitized message.
     try:
         response = minio_client.get_object(bucket_name, object_name)
     except Exception as e:
-        logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
-        raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e
+        logger.exception(
+            "Failed to get object from MinIO - bucket=%r object=%r",
+            bucket_name,
+            object_name,
+        )
+        raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}'.") from e
@@
     except Exception as e:
-        logger.error(f"Failed to read PDF content from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
-        raise ValueError(f"Failed to read content of '{object_name}' from bucket '{bucket_name}': {e}") from e
+        logger.exception(
+            "Failed to read PDF content from MinIO - bucket=%r object=%r",
+            bucket_name,
+            object_name,
+        )
+        raise ValueError(f"Failed to read content of '{object_name}' from bucket '{bucket_name}'.") from e
102-108: Large PDFs: warn-after-read is too late; check size before reading (or enforce a limit).
You only know size after loading into memory; prefer checking stat/headers first or explicitly enforcing a max size to avoid OOM.

147-156: Avoid duplicated/fragile filename extraction (split("/")[-1]); use PurePosixPath + helper.
This is duplicated across both functions and mishandles trailing slashes; a helper also centralizes future changes.
+from pathlib import PurePosixPath
@@
+def _filename_from_object_name(object_name: str) -> str:
+    return PurePosixPath(object_name).name
@@
                     "total_pages": len(pdf.pages),
-                    "filename": object_name.split("/")[-1],
+                    "filename": _filename_from_object_name(object_name),
@@
             "object_name": object_name,
-            "filename": object_name.split("/")[-1],
+            "filename": _filename_from_object_name(object_name),
             "total_pages": len(documents),
         },
Also applies to: 186-195

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d3b0cd and ba32b9f.

📒 Files selected for processing (1)

RAGManager/app/services/pdf_processor.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

RAGManager/app/services/pdf_processor.py (1)

RAGManager/app/models/document.py (1)

Document (10-21)

Se excedió el rate limit de reviews de coderabbit

JPAmorin · 2025-12-13T16:09:55Z

@coderabbitai review

coderabbitai · 2025-12-13T16:10:01Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

JPAmorin · 2025-12-13T16:16:16Z

@coderabbitai review

coderabbitai · 2025-12-13T16:16:21Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

JuanFKurucz · 2025-12-13T16:22:43Z

+
+    return documents
+
+def pdf_to_single_document(


is this being used?

JPAmorin · 2025-12-14T00:13:21Z

@coderabbitai review

coderabbitai · 2025-12-14T00:13:44Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Germanadjemian

revisar los cambios sugeridos por IA, creo que hay partes que no se usan.

.

pdf-processor

5086552

Copilot AI review requested due to automatic review settings December 13, 2025 14:01

Copilot started reviewing on behalf of JuanPalbo December 13, 2025 14:01 View session

Merge branch 'main' into feature/pdf-processor

108a36a

JuanPalbo assigned JPAmorin Dec 13, 2025

Copilot AI reviewed Dec 13, 2025

View reviewed changes

coderabbitai Bot suggested changes Dec 13, 2025

View reviewed changes

Copilot AI mentioned this pull request Dec 13, 2025

[WIP] WIP address feedback on pdf-processor pull request #20

Closed

8 tasks

This was referenced Dec 13, 2025

[WIP] WIP address feedback on pdf-processor pull request #21

Closed

[WIP] Address feedback on pdf-processor pull request #22

Closed

Copilot AI mentioned this pull request Dec 13, 2025

[WIP] Address changes based on feedback from PR #19 #23

Closed

6 tasks

This was referenced Dec 13, 2025

[WIP] WIP: Address feedback from review on pdf-processor PR #24

Closed

Address review question on table format chunking compatibility #27

Closed

coderabbitai Bot suggested changes Dec 13, 2025

View reviewed changes

Comment thread RAGManager/app/services/pdf_processor.py Outdated

fix error handling when opening corrupted PDFs

5d3b0cd

Copilot AI review requested due to automatic review settings December 13, 2025 15:47

Copilot started reviewing on behalf of JuanPalbo December 13, 2025 15:47 View session

coderabbitai Bot suggested changes Dec 13, 2025

View reviewed changes

Comment thread RAGManager/app/services/pdf_processor.py

Copilot AI reviewed Dec 13, 2025

View reviewed changes

Added check for empty PDFs

ba32b9f

coderabbitai Bot previously requested changes Dec 13, 2025

View reviewed changes

Comment thread RAGManager/app/services/pdf_processor.py

Setup MinIO client timeout and retries

e0619c1

Copilot AI review requested due to automatic review settings December 13, 2025 16:04

Copilot started reviewing on behalf of JuanPalbo December 13, 2025 16:05 View session

Copilot AI reviewed Dec 13, 2025

View reviewed changes

JuanFKurucz reviewed Dec 13, 2025

View reviewed changes

Germanadjemian approved these changes Dec 14, 2025

View reviewed changes

Germanadjemian previously requested changes Dec 14, 2025

View reviewed changes

JuanFKurucz approved these changes Dec 14, 2025

View reviewed changes

brunoalbin23 merged commit e7de51f into main Dec 14, 2025
6 of 7 checks passed

brunoalbin23 deleted the feature/pdf-processor branch December 14, 2025 17:31

	raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e
	raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}'.") from e

-    except Exception as e:
-        logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
-        raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e
-    try:
-        pdf_bytes = response.read()
-        # Optional: warn if file is very large (e.g., > 100MB)
+        # Check Content-Length header before reading into memory
+        content_length = None
+        if hasattr(response, "getheader"):
+            content_length = response.getheader("Content-Length")
+        elif hasattr(response, "headers") and "Content-Length" in response.headers:
+            content_length = response.headers["Content-Length"]
+        if content_length is not None:
+            try:
+                file_size_bytes = int(content_length)
+                file_size_mb = file_size_bytes / (1024 * 1024)
+                if file_size_mb > 100:
+                    logger.warning(f"Large PDF (Content-Length): {file_size_mb:.1f} MB for '{object_name}' - loading into memory")
+            except Exception as e:
+                logger.warning(f"Could not parse Content-Length header for '{object_name}': {e}")
+        else:
+            logger.info(f"Content-Length header not found for '{object_name}', proceeding to read file into memory")
+    except Exception as e:
+        logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
+        raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e
+    try:
+        pdf_bytes = response.read()
+        # Optional: warn if file is very large (e.g., > 100MB) (fallback if Content-Length was not available)

Conversation

JuanPalbo commented Dec 13, 2025

Uh oh!

coderabbitai Bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

JuanPalbo Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

JuanPalbo Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

JuanPalbo Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

JuanPalbo Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JuanPalbo Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Dec 13, 2025

Uh oh!

Copilot AI commented Dec 13, 2025

Uh oh!

Copilot AI commented Dec 13, 2025

Uh oh!

Copilot AI commented Dec 13, 2025

Uh oh!

Copilot AI commented Dec 13, 2025

Uh oh!

Copilot AI commented Dec 13, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

coderabbitai Bot commented Dec 13, 2025 •

edited

Loading

coderabbitai Bot Dec 13, 2025 •

edited

Loading

JPAmorin commented Dec 14, 2025 •

edited

Loading