Skip to content

pdf-processor#19

Merged
brunoalbin23 merged 6 commits intomainfrom
feature/pdf-processor
Dec 14, 2025
Merged

pdf-processor#19
brunoalbin23 merged 6 commits intomainfrom
feature/pdf-processor

Conversation

@JuanPalbo
Copy link
Copy Markdown
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings December 13, 2025 14:01
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 13, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Integrated cloud object storage for PDF document processing with dual extraction modes: individual pages or complete documents.
    • Implemented robust table extraction from PDFs with enhanced error handling and detailed processing logs.
    • Added comprehensive document metadata tracking including source and page information.
  • Chores

    • Improved environment file configuration handling.

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

Adds MinIO-backed PDF retrieval and per-page parsing (text and tables) with sanitization, metadata, error handling, and two public API functions; renames pipeline parameter from minio_url to object_name; adds .env to .gitignore.

Changes

Cohort / File(s) Summary
Configuration
\.gitignore``
Added .env to ignored files.
PDF processing
RAGManager/app/services/pdf_processor.py
Implemented MinIO client helper get_minio_client(); added pdf_to_document(object_name, bucket_name=None, minio_client=None) returning per-page Document list and pdf_to_single_document(...) returning a single concatenated Document; added _sanitize_cell() and _extract_tables_safely(); performs in-memory MinIO download, pdfplumber parsing, per-page text/table extraction, metadata enrichment, and robust error handling/logging.
Pipeline
RAGManager/app/services/pipeline.py
Renamed entry parameter and references from minio_url to object_name; updated docstring and call site to use pdf_to_document(object_name).

Sequence Diagram

sequenceDiagram
    participant Pipeline as Pipeline
    participant Processor as PDF Processor
    participant MinIO as MinIO Storage
    participant PDFReader as pdfplumber

    Pipeline->>Processor: process_pdf_pipeline(object_name)
    Processor->>Processor: get_minio_client()
    Processor->>MinIO: GET object (object_name, bucket)
    MinIO-->>Processor: return PDF bytes
    Processor->>PDFReader: open PDF bytes
    PDFReader-->>Processor: pages
    loop per page
        Processor->>Processor: extract text
        Processor->>Processor: _extract_tables_safely()
        Processor->>Processor: _sanitize_cell() for table cells
        Processor->>Processor: assemble Document with metadata
    end
    Processor-->>Pipeline: return list[Document] or single Document
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Focus review on RAGManager/app/services/pdf_processor.py (MinIO client configuration, in-memory byte handling, pdfplumber usage, table-extraction/error handling, metadata fields).
  • Confirm caller updates for process_pdf_pipeline parameter rename in RAGManager/app/services/pipeline.py and any other call sites.
  • Inspect logging messages and exception handling paths for clarity and consistency.

Poem

🐰 I hopped through bytes with whiskers bright,

Pulled pages from a bucket late at night,
Tidied cells and stitched each line with care,
Tagged every page with where and when and where,
A rabbit's nod — the parser's set to flight.

Pre-merge checks and finishing touches

❌ Failed checks (2 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'pdf-processor' is vague and generic, failing to convey meaningful information about the specific changes made in the changeset. Replace with a more descriptive title that specifies the main change, such as 'Implement MinIO-backed PDF processing with per-page extraction' or 'Add PDF processor service with MinIO integration'.
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess whether the description relates to the changeset. Add a pull request description explaining the changes, such as the implementation of MinIO-backed PDF processing, per-page document extraction, and pipeline parameter updates.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/pdf-processor

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba32b9f and e0619c1.

📒 Files selected for processing (1)
  • RAGManager/app/services/pdf_processor.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
RAGManager/app/services/pdf_processor.py (1)
RAGManager/app/models/document.py (1)
  • Document (10-21)
🔇 Additional comments (6)
RAGManager/app/services/pdf_processor.py (6)

1-13: LGTM!

Imports are well-organized and appropriate for the functionality. The use of langchain_core.documents.Document (for RAG/vector store operations) is correctly distinguished from app.models.document.Document (ORM model for database storage).


31-58: LGTM!

Robust error handling with type validation at each level (tables, table, row). The defensive approach with per-row exception handling ensures that one malformed row won't break extraction of other tables.


61-88: Potential conflict between cert_reqs and secure=False.

When settings.minio_secure is False (HTTP connection), the cert_reqs="CERT_REQUIRED" setting on the PoolManager may cause issues or be misleading. Consider making cert requirements conditional on the secure setting.

     # Create PoolManager with timeout, retry, and CA bundle
     http_client = urllib3.PoolManager(
         timeout=timeout,
         retries=retry,
         maxsize=10,
-        cert_reqs="CERT_REQUIRED",
-        ca_certs=certifi.where(),
+        cert_reqs="CERT_REQUIRED" if settings.minio_secure else "CERT_NONE",
+        ca_certs=certifi.where() if settings.minio_secure else None,
     )

Verify your MinIO configuration to ensure this doesn't cause connection issues in development/test environments with secure=False.


150-155: Good handling of empty PDFs.

The check for empty PDFs with a clear error message is well-implemented. This addresses the edge case previously flagged.


188-219: LGTM!

The pdf_to_single_document function is a clean wrapper that properly delegates to pdf_to_document and combines the results. All parameters including minio_client are correctly forwarded.


204-206: Bug: minio_client parameter is not passed to pdf_to_document.

The minio_client parameter is accepted but not forwarded to pdf_to_document(), causing it to always create a new client instead of using the provided one.

     if bucket_name is None:
         bucket_name = settings.minio_bucket
-    documents = pdf_to_document(object_name, bucket_name, minio_client)
+    documents = pdf_to_document(object_name, bucket_name, minio_client)

Wait, looking again - the parameter IS being passed. Let me verify:

documents = pdf_to_document(object_name, bucket_name, minio_client)

This is correct - all three parameters are passed positionally.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements PDF processing functionality to convert PDF files from MinIO storage into LangChain Document objects. The implementation resolves merge conflicts in dependencies and replaces a placeholder function with a fully functional PDF processor using pdfplumber for text and table extraction.

  • Implements two PDF processing functions: pdf_to_document() for page-by-page extraction and pdf_to_single_document() for combined content
  • Adds MinIO client integration with configuration from environment settings
  • Resolves merge conflicts in pyproject.toml dependencies

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 5 comments.

File Description
RAGManager/pyproject.toml Resolves merge conflict markers in the dependencies section
RAGManager/app/services/pdf_processor.py Implements complete PDF processing with MinIO integration, text/table extraction, and document generation with metadata
.gitignore Adds .env file to ignore list for environment variable protection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread RAGManager/app/services/pdf_processor.py
Comment on lines +43 to +48
response = minio_client.get_object(bucket_name, object_name)
try:
pdf_bytes = response.read()
finally:
response.close()
response.release_conn()
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for MinIO operations. If the object doesn't exist, bucket doesn't exist, or there are authentication/permission issues, the get_object call will raise an exception that isn't caught here. Consider adding try-except blocks to provide more informative error messages for common failure scenarios (e.g., S3Error for missing objects or permission denied).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

response.release_conn()

# Open PDF from bytes using pdfplumber
with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for PDF file integrity. The code directly opens the bytes as a PDF without verifying it's actually a valid PDF file. Malformed or non-PDF data could cause pdfplumber to raise exceptions or behave unexpectedly. Consider adding validation or wrapping the pdfplumber.open call in a try-except block to catch and handle PDFSyntaxError or similar exceptions with a clear error message.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

# Download the PDF from MinIO into memory
response = minio_client.get_object(bucket_name, object_name)
try:
pdf_bytes = response.read()
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading entire PDF into memory could cause issues with large files. For very large PDFs (e.g., hundreds of megabytes), calling response.read() will load the entire file into memory at once, which could lead to memory issues or performance degradation. Consider processing the PDF in a streaming fashion if large files are expected, or add file size limits with appropriate error handling.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

"object_name": object_name,
"page": page_num,
"total_pages": len(pdf.pages),
"filename": object_name.split("/")[-1],
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent filename extraction logic that's duplicated. The expression object_name.split("/")[-1] is repeated in both functions at lines 75 and 110. This could be extracted into a helper function or variable to improve maintainability and ensure consistency if the logic needs to change. Also, this approach doesn't handle edge cases like Windows-style paths or objects without a filename component.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 298c6fd and 5086552.

📒 Files selected for processing (3)
  • .gitignore (1 hunks)
  • RAGManager/app/services/pdf_processor.py (1 hunks)
  • RAGManager/pyproject.toml (0 hunks)
💤 Files with no reviewable changes (1)
  • RAGManager/pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (1)
RAGManager/app/services/pdf_processor.py (1)
RAGManager/app/models/document.py (1)
  • Document (10-21)
🔇 Additional comments (6)
.gitignore (1)

1-1: LGTM! Standard security practice.

Adding .env to .gitignore is essential to prevent committing sensitive credentials (MinIO access keys, etc.) to version control.

RAGManager/app/services/pdf_processor.py (5)

1-6: LGTM! Clean imports.

All imports are necessary and properly organized.


18-38: LGTM! Well-structured function signature and defaults.

The function has clear type hints, comprehensive documentation, and proper default parameter handling.


67-80: LGTM! Comprehensive per-page metadata.

The Document objects are well-structured with rich metadata including page numbers, source URL, and total pages. The minio:// protocol in the source field provides a clear identifier for the document origin.


10-16: The MinIO settings are already validated at startup through pydantic's BaseSettings. When RAGManager/app/core/config.py is imported, the Settings() class is instantiated at module level (line 48), which automatically validates all required fields (minio_endpoint, minio_access_key, minio_secret_key) against environment variables. If any required setting is missing, pydantic raises a ValidationError during application startup before get_minio_client() is ever invoked. No additional validation or error handling is needed in the function itself.


98-113: The current code already handles empty PDFs gracefully. If pdf_to_document() returns an empty list (for a 0-page PDF), the result will be an empty page_content string and total_pages: 0, which propagates correctly through the pipeline without errors. LangChain's split_documents() handles empty document lists without raising exceptions. No explicit error handling is required unless your design explicitly rejects empty PDFs, which is not indicated in the codebase.

Comment thread RAGManager/app/services/pdf_processor.py
Comment thread RAGManager/app/services/pdf_processor.py Outdated
Comment on lines +56 to +60
tables = page.extract_tables()
table_text = ""
for table in tables:
for row in table:
table_text += " | ".join(str(cell) if cell else "" for cell in row) + "\n"
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider more robust table formatting.

The current table-to-text conversion is basic and may not preserve table structure well. The simple pipe-delimited format could be ambiguous if cells contain pipe characters.

Consider alternative approaches:

  1. Use CSV-style formatting with proper escaping
  2. Add Markdown table formatting
  3. Include row/column headers if available from the table structure

Example with Markdown tables:

table_text = ""
for table in tables:
    if not table:
        continue
    # Add header separator for Markdown tables
    for i, row in enumerate(table):
        cells = [str(cell).strip() if cell else "" for cell in row]
        table_text += "| " + " | ".join(cells) + " |\n"
        if i == 0 and len(table) > 1:  # Add separator after header
            table_text += "|" + "|".join([" --- " for _ in cells]) + "|\n"
    table_text += "\n"
🤖 Prompt for AI Agents
In RAGManager/app/services/pdf_processor.py around lines 56 to 60, the current
table-to-text conversion concatenates cells with a raw pipe delimiter which
breaks when cells contain pipes and loses header structure; replace the simple
join with a more robust formatter such as CSV-style output with proper escaping
or Markdown table formatting: iterate tables skipping empty ones, normalize cell
strings (trim and replace/escape pipe and newline characters), output rows with
surrounding separators (for Markdown: prefix/suffix "| " and " |", and insert a
header separator row after the first row when appropriate) or use Python's csv
module to generate escaped CSV lines, and ensure a blank line between tables for
readability.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Does that work for chunking further down the pipeline?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

"object_name": object_name,
"page": page_num,
"total_pages": len(pdf.pages),
"filename": object_name.split("/")[-1],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Filename extraction could be more robust.

Using object_name.split("/")[-1] works for Unix-style paths but could be more explicit and handle edge cases better.

Consider using pathlib for clearer intent:

+from pathlib import PurePosixPath
+
 # ... in the metadata dict:
-                    "filename": object_name.split("/")[-1],
+                    "filename": PurePosixPath(object_name).name,

This makes the intent clearer and handles edge cases like trailing slashes.

🤖 Prompt for AI Agents
In RAGManager/app/services/pdf_processor.py around line 75, the filename
extraction using object_name.split("/")[-1] is fragile; replace it with a
pathlib-based approach: import pathlib at the top and use
pathlib.PurePosixPath(object_name).name (or pathlib.Path if you prefer OS-native
behavior) to get the filename safely and handle trailing slashes and edge cases,
and ensure the import is added if missing.

Copy link
Copy Markdown

Copilot AI commented Dec 13, 2025

@JuanPalbo I've opened a new pull request, #20, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown

Copilot AI commented Dec 13, 2025

@JuanPalbo I've opened a new pull request, #21, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown

Copilot AI commented Dec 13, 2025

@JuanPalbo I've opened a new pull request, #22, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown

Copilot AI commented Dec 13, 2025

@JuanPalbo I've opened a new pull request, #23, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown

Copilot AI commented Dec 13, 2025

@JuanPalbo I've opened a new pull request, #24, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown

Copilot AI commented Dec 13, 2025

@JuanPalbo I've opened a new pull request, #27, to work on those changes. Once the pull request is ready, I'll request review from you.

- Add error handling for MinIO get_object and read operations, with validation and clear logging for bucket/object issues
- Wrap pdfplumber.open and page.extract_text in try/except to handle corrupted or password-protected PDFs and log errors
- Refactor table extraction: add per-row and per-cell sanitization, robustly handle non-iterable or malformed tables, and log all exceptions with context
- Warn on large PDF files loaded into memory
- Ensure all error cases are logged and do not break pipeline processing
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
RAGManager/app/services/pipeline.py (1)

37-44: Type mismatch: pdf_to_document now returns list[Document], not Document.

This was previously flagged. The pdf_to_document function now returns a list of Documents (one per page), but this code assigns it to document (singular) and passes it directly to document_to_chunks. Depending on how document_to_chunks is implemented, this will either fail or produce incorrect results.

Update the pipeline to handle the list:

         # Stage 1: PDF to Document
         logger.info("Stage 1: Converting PDF to LangChain Document")
-        document = pdf_to_document(object_name)
+        documents = pdf_to_document(object_name)
         logger.info("Stage 1 completed successfully")

         # Stage 2: Document to Chunks
         logger.info(
             f"Stage 2: Splitting document into chunks (size={settings.chunk_size}, overlap={settings.chunk_overlap})"
         )
-        chunks = document_to_chunks(document, settings.chunk_size, settings.chunk_overlap)
+        # Process each page document and combine chunks
+        chunks = []
+        for doc in documents:
+            chunks.extend(document_to_chunks(doc, settings.chunk_size, settings.chunk_overlap))
         logger.info(f"Stage 2 completed successfully. Created {len(chunks)} chunks")

Alternatively, if you want a single document, use pdf_to_single_document(object_name) instead.

RAGManager/app/services/pdf_processor.py (1)

146-146: Filename extraction is duplicated at lines 146 and 183.

This was previously flagged. Consider using pathlib.PurePosixPath(object_name).name or extracting to a helper for consistency and edge case handling.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5086552 and 95646fb.

📒 Files selected for processing (2)
  • RAGManager/app/services/pdf_processor.py (1 hunks)
  • RAGManager/app/services/pipeline.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
RAGManager/app/services/pipeline.py (1)
RAGManager/app/services/pdf_processor.py (1)
  • pdf_to_document (66-153)
RAGManager/app/services/pdf_processor.py (1)
RAGManager/app/models/document.py (1)
  • Document (10-21)
🔇 Additional comments (6)
RAGManager/app/services/pdf_processor.py (5)

1-10: LGTM!

Imports and logger setup are appropriate for the PDF processing functionality.


13-25: LGTM!

Good defensive approach with the fallback chain (str()repr() → empty string) to handle any cell type safely.


28-55: LGTM!

Robust error handling with proper type validation at each level and contextual logging. The graceful degradation approach (log and continue) is appropriate for table extraction.


58-64: LGTM!

Clean factory function for creating MinIO clients from settings.


155-186: LGTM!

Clean composition that delegates to pdf_to_document and aggregates results. The metadata consolidation is appropriate for the single-document use case.

RAGManager/app/services/pipeline.py (1)

13-32: LGTM!

Good semantic improvement renaming minio_url to object_name for consistency with the updated pdf_processor API.

Comment thread RAGManager/app/services/pdf_processor.py Outdated
Copilot AI review requested due to automatic review settings December 13, 2025 15:47
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
RAGManager/app/services/pdf_processor.py (1)

150-150: Consider using pathlib.PurePosixPath for filename extraction.

This was noted in previous reviews. Using PurePosixPath(object_name).name would handle edge cases like trailing slashes more robustly and make the intent clearer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 95646fb and 5d3b0cd.

📒 Files selected for processing (1)
  • RAGManager/app/services/pdf_processor.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
RAGManager/app/services/pdf_processor.py (1)
RAGManager/app/models/document.py (1)
  • Document (10-21)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Agent
🔇 Additional comments (6)
RAGManager/app/services/pdf_processor.py (6)

1-10: LGTM!

Imports are appropriate and well-organized. Module-level logger initialization follows standard Python practices.


13-25: LGTM!

Robust cell sanitization with appropriate fallbacks. The defensive approach handles edge cases like objects with broken __str__ or __repr__ methods gracefully.


28-55: LGTM!

Excellent defensive implementation with type validation at each level (tables, table, row) and granular exception handling. The logging provides good context for debugging extraction issues.


58-64: LGTM!

Clean factory function for MinIO client creation. Configuration is properly externalized to settings.


66-157: Well-implemented PDF processing with comprehensive error handling.

The implementation properly addresses previous review concerns:

  • Input validation for object_name
  • Consistent ValueError raising for all failure cases
  • Large file warnings
  • Proper resource cleanup with try/finally blocks
  • Per-page error handling with graceful degradation

159-190: LGTM!

Clean delegation pattern that reuses pdf_to_document and aggregates results appropriately. The metadata is correctly summarized for the combined document.

Comment thread RAGManager/app/services/pdf_processor.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread RAGManager/app/services/pdf_processor.py
response = minio_client.get_object(bucket_name, object_name)
except Exception as e:
logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message includes the exception details from MinIO which might expose internal system information. Consider sanitizing the error message before raising it, especially for production environments. For example, you could log the detailed error but raise a more generic message to the caller.

Suggested change
raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e
raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}'.") from e

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +82
def pdf_to_document(
object_name: str,
bucket_name: str | None = None,
minio_client: Minio | None = None,
) -> list[Document]:
"""
Load a PDF file from MinIO and return a list of Document objects.
Each page becomes a separate Document with metadata.

Args:
minio_url: URL pointing to the PDF file in MinIO
object_name: Path/name of the PDF object in the bucket
bucket_name: Name of the MinIO bucket (defaults to settings.minio_bucket)
minio_client: Optional MinIO client (creates one if not provided)

Returns:
Document: LangChain Document object containing the PDF content
List of Document objects, one per page
"""
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation indicates list[Document] is returned, but the docstring says "return a list of Document objects" without specifying that it's one Document per page. Consider clarifying in the docstring that the function returns exactly one Document object per page of the PDF for better API clarity.

Copilot uses AI. Check for mistakes.
# Stage 1: PDF to Document
logger.info("Stage 1: Converting PDF to LangChain Document")
document = pdf_to_document(minio_url)
document = pdf_to_document(object_name)
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function pdf_to_document now returns list[Document] (as shown in the new implementation), but this code treats it as a single Document. This will cause a type error when trying to pass a list to document_to_chunks, which expects a single Document. You should either use pdf_to_single_document if you want a single document, or update the pipeline to handle a list of documents.

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +104
except Exception as e:
logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e

try:
pdf_bytes = response.read()
# Optional: warn if file is very large (e.g., > 100MB)
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire PDF is loaded into memory with response.read(), and then a warning is issued if the file is larger than 100MB. Consider checking the file size before loading it into memory to avoid potential out-of-memory errors. You can use the Content-Length header from the response object to determine the file size before reading it.

Suggested change
except Exception as e:
logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e
try:
pdf_bytes = response.read()
# Optional: warn if file is very large (e.g., > 100MB)
# Check Content-Length header before reading into memory
content_length = None
if hasattr(response, "getheader"):
content_length = response.getheader("Content-Length")
elif hasattr(response, "headers") and "Content-Length" in response.headers:
content_length = response.headers["Content-Length"]
if content_length is not None:
try:
file_size_bytes = int(content_length)
file_size_mb = file_size_bytes / (1024 * 1024)
if file_size_mb > 100:
logger.warning(f"Large PDF (Content-Length): {file_size_mb:.1f} MB for '{object_name}' - loading into memory")
except Exception as e:
logger.warning(f"Could not parse Content-Length header for '{object_name}': {e}")
else:
logger.info(f"Content-Length header not found for '{object_name}', proceeding to read file into memory")
except Exception as e:
logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e
try:
pdf_bytes = response.read()
# Optional: warn if file is very large (e.g., > 100MB) (fallback if Content-Length was not available)

Copilot uses AI. Check for mistakes.
"object_name": object_name,
"page": page_num,
"total_pages": len(pdf.pages),
"filename": object_name.split("/")[-1],
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename extraction using object_name.split("/")[-1] is duplicated in both pdf_to_document and pdf_to_single_document functions. Consider extracting this into a helper function or variable to avoid duplication and make the code more maintainable.

Copilot uses AI. Check for mistakes.
Comment thread RAGManager/app/services/pdf_processor.py
coderabbitai[bot]
coderabbitai Bot previously requested changes Dec 13, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
RAGManager/app/services/pdf_processor.py (5)

13-25: Simplify _sanitize_cell() (redundant type check) and consider newline/pipe normalization.
Right now isinstance(...) is redundant since you already str(cell) in the try anyway; also cells containing \n or | will degrade downstream table formatting/chunking.

 def _sanitize_cell(cell) -> str:
     """Safely convert a cell value to string."""
     if cell is None:
         return ""
-    if isinstance(cell, (str, int, float, bool)):
-        return str(cell)
     try:
-        return str(cell)
+        s = str(cell)
     except Exception:
         try:
-            return repr(cell)
+            s = repr(cell)
         except Exception:
             return ""
+    # Keep tables one-line-per-row and avoid delimiter collisions.
+    return s.replace("\n", " ").replace("|", "\\|").strip()

28-55: Table serialization is ambiguous; escape delimiters or use CSV/Markdown formatting.
" | ".join(...) breaks if cell text contains | and loses structure; prefer a deliberate format (CSV writer, Markdown table with escaping) so downstream chunking/search isn’t polluted.


94-114: Don’t include raw MinIO exception text in raised ValueError (potential info leak); log details instead.
If this bubbles to an API boundary, : {e} can expose internal bucket/object naming, endpoint details, or server error metadata. Keep detailed info in logs and raise a stable/sanitized message.

     try:
         response = minio_client.get_object(bucket_name, object_name)
     except Exception as e:
-        logger.error(f"Failed to get object from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
-        raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}': {e}") from e
+        logger.exception(
+            "Failed to get object from MinIO - bucket=%r object=%r",
+            bucket_name,
+            object_name,
+        )
+        raise ValueError(f"Failed to retrieve '{object_name}' from bucket '{bucket_name}'.") from e
@@
     except Exception as e:
-        logger.error(f"Failed to read PDF content from MinIO - bucket: '{bucket_name}', object: '{object_name}': {e}")
-        raise ValueError(f"Failed to read content of '{object_name}' from bucket '{bucket_name}': {e}") from e
+        logger.exception(
+            "Failed to read PDF content from MinIO - bucket=%r object=%r",
+            bucket_name,
+            object_name,
+        )
+        raise ValueError(f"Failed to read content of '{object_name}' from bucket '{bucket_name}'.") from e

102-108: Large PDFs: warn-after-read is too late; check size before reading (or enforce a limit).
You only know size after loading into memory; prefer checking stat/headers first or explicitly enforcing a max size to avoid OOM.


147-156: Avoid duplicated/fragile filename extraction (split("/")[-1]); use PurePosixPath + helper.
This is duplicated across both functions and mishandles trailing slashes; a helper also centralizes future changes.

+from pathlib import PurePosixPath
@@
+def _filename_from_object_name(object_name: str) -> str:
+    return PurePosixPath(object_name).name
@@
                     "total_pages": len(pdf.pages),
-                    "filename": object_name.split("/")[-1],
+                    "filename": _filename_from_object_name(object_name),
@@
             "object_name": object_name,
-            "filename": object_name.split("/")[-1],
+            "filename": _filename_from_object_name(object_name),
             "total_pages": len(documents),
         },

Also applies to: 186-195

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d3b0cd and ba32b9f.

📒 Files selected for processing (1)
  • RAGManager/app/services/pdf_processor.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
RAGManager/app/services/pdf_processor.py (1)
RAGManager/app/models/document.py (1)
  • Document (10-21)

Comment thread RAGManager/app/services/pdf_processor.py
Copilot AI review requested due to automatic review settings December 13, 2025 16:04
@JuanPalbo JuanPalbo dismissed coderabbitai[bot]’s stale review December 13, 2025 16:06

Se excedió el rate limit de reviews de coderabbit

@JPAmorin
Copy link
Copy Markdown
Collaborator

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 13, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@JPAmorin
Copy link
Copy Markdown
Collaborator

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 13, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.


return documents

def pdf_to_single_document(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this being used?

@JPAmorin
Copy link
Copy Markdown
Collaborator

JPAmorin commented Dec 14, 2025

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 14, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Collaborator

@Germanadjemian Germanadjemian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revisar los cambios sugeridos por IA, creo que hay partes que no se usan.

@brunoalbin23 brunoalbin23 merged commit e7de51f into main Dec 14, 2025
6 of 7 checks passed
@brunoalbin23 brunoalbin23 deleted the feature/pdf-processor branch December 14, 2025 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants