Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.env
115 changes: 103 additions & 12 deletions RAGManager/app/services/pdf_processor.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,114 @@
import io
import pdfplumber
from langchain_core.documents import Document
from minio import Minio

from app.core.config import settings


def pdf_to_document(minio_url: str) -> Document:
"""
Placeholder function - to be implemented later.

This function will:
1. Download the PDF file from MinIO using the provided URL
2. Parse the PDF content
3. Convert it to a LangChain Document object
def get_minio_client() -> Minio:
return Minio(
endpoint=settings.minio_endpoint,
access_key=settings.minio_access_key,
secret_key=settings.minio_secret_key,
secure=settings.minio_secure,
)

Comment thread
coderabbitai[bot] marked this conversation as resolved.
def pdf_to_document(
object_name: str,
bucket_name: str | None = None,
minio_client: Minio | None = None,
) -> list[Document]:
Comment thread
JuanPalbo marked this conversation as resolved.
Comment thread
JuanPalbo marked this conversation as resolved.
"""
Load a PDF file from MinIO and return a list of Document objects.
Each page becomes a separate Document with metadata.

Args:
minio_url: URL pointing to the PDF file in MinIO
object_name: Path/name of the PDF object in the bucket
bucket_name: Name of the MinIO bucket (defaults to settings.minio_bucket)
minio_client: Optional MinIO client (creates one if not provided)

Returns:
Document: LangChain Document object containing the PDF content
List of Document objects, one per page
"""
Comment on lines +90 to +106
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation indicates list[Document] is returned, but the docstring says "return a list of Document objects" without specifying that it's one Document per page. Consider clarifying in the docstring that the function returns exactly one Document object per page of the PDF for better API clarity.

Copilot uses AI. Check for mistakes.
if bucket_name is None:
bucket_name = settings.minio_bucket
if minio_client is None:
minio_client = get_minio_client()

documents: list[Document] = []

# Download the PDF from MinIO into memory
response = minio_client.get_object(bucket_name, object_name)
try:
pdf_bytes = response.read()
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading entire PDF into memory could cause issues with large files. For very large PDFs (e.g., hundreds of megabytes), calling response.read() will load the entire file into memory at once, which could lead to memory issues or performance degradation. Consider processing the PDF in a streaming fashion if large files are expected, or add file size limits with appropriate error handling.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

finally:
response.close()
response.release_conn()
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for MinIO operations. If the object doesn't exist, bucket doesn't exist, or there are authentication/permission issues, the get_object call will raise an exception that isn't caught here. Consider adding try-except blocks to provide more informative error messages for common failure scenarios (e.g., S3Error for missing objects or permission denied).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment thread
coderabbitai[bot] marked this conversation as resolved.

# Open PDF from bytes using pdfplumber
with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for PDF file integrity. The code directly opens the bytes as a PDF without verifying it's actually a valid PDF file. Malformed or non-PDF data could cause pdfplumber to raise exceptions or behave unexpectedly. Consider adding validation or wrapping the pdfplumber.open call in a try-except block to catch and handle PDFSyntaxError or similar exceptions with a clear error message.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

for page_num, page in enumerate(pdf.pages, start=1):
text = page.extract_text() or ""

# Extract tables and convert to text format
tables = page.extract_tables()
table_text = ""
for table in tables:
for row in table:
table_text += " | ".join(str(cell) if cell else "" for cell in row) + "\n"
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider more robust table formatting.

The current table-to-text conversion is basic and may not preserve table structure well. The simple pipe-delimited format could be ambiguous if cells contain pipe characters.

Consider alternative approaches:

  1. Use CSV-style formatting with proper escaping
  2. Add Markdown table formatting
  3. Include row/column headers if available from the table structure

Example with Markdown tables:

table_text = ""
for table in tables:
    if not table:
        continue
    # Add header separator for Markdown tables
    for i, row in enumerate(table):
        cells = [str(cell).strip() if cell else "" for cell in row]
        table_text += "| " + " | ".join(cells) + " |\n"
        if i == 0 and len(table) > 1:  # Add separator after header
            table_text += "|" + "|".join([" --- " for _ in cells]) + "|\n"
    table_text += "\n"
🤖 Prompt for AI Agents
In RAGManager/app/services/pdf_processor.py around lines 56 to 60, the current
table-to-text conversion concatenates cells with a raw pipe delimiter which
breaks when cells contain pipes and loses header structure; replace the simple
join with a more robust formatter such as CSV-style output with proper escaping
or Markdown table formatting: iterate tables skipping empty ones, normalize cell
strings (trim and replace/escape pipe and newline characters), output rows with
surrounding separators (for Markdown: prefix/suffix "| " and " |", and insert a
header separator row after the first row when appropriate) or use Python's csv
module to generate escaped CSV lines, and ensure a blank line between tables for
readability.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Does that work for chunking further down the pipeline?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!


# Combine text and tables
full_content = text
if table_text:
full_content += f"\n\n[Tables]\n{table_text}"
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

Raises:
NotImplementedError: This function is not yet implemented
doc = Document(
page_content=full_content,
metadata={
"source": f"minio://{bucket_name}/{object_name}",
"bucket": bucket_name,
"object_name": object_name,
"page": page_num,
"total_pages": len(pdf.pages),
"filename": object_name.split("/")[-1],
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent filename extraction logic that's duplicated. The expression object_name.split("/")[-1] is repeated in both functions at lines 75 and 110. This could be extracted into a helper function or variable to improve maintainability and ensure consistency if the logic needs to change. Also, this approach doesn't handle edge cases like Windows-style paths or objects without a filename component.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Filename extraction could be more robust.

Using object_name.split("/")[-1] works for Unix-style paths but could be more explicit and handle edge cases better.

Consider using pathlib for clearer intent:

+from pathlib import PurePosixPath
+
 # ... in the metadata dict:
-                    "filename": object_name.split("/")[-1],
+                    "filename": PurePosixPath(object_name).name,

This makes the intent clearer and handles edge cases like trailing slashes.

🤖 Prompt for AI Agents
In RAGManager/app/services/pdf_processor.py around line 75, the filename
extraction using object_name.split("/")[-1] is fragile; replace it with a
pathlib-based approach: import pathlib at the top and use
pathlib.PurePosixPath(object_name).name (or pathlib.Path if you prefer OS-native
behavior) to get the filename safely and handle trailing slashes and edge cases,
and ensure the import is added if missing.

Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename extraction using object_name.split("/")[-1] is duplicated in both pdf_to_document and pdf_to_single_document functions. Consider extracting this into a helper function or variable to avoid duplication and make the code more maintainable.

Copilot uses AI. Check for mistakes.
},
)
documents.append(doc)

return documents

def pdf_to_single_document(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this being used?

object_name: str,
bucket_name: str | None = None,
minio_client: Minio | None = None,
) -> Document:
"""
raise NotImplementedError("This function will be implemented later")
Load a PDF file from MinIO and return a single Document with all pages combined.

Args:
object_name: Path/name of the PDF object in the bucket
bucket_name: Name of the MinIO bucket (defaults to settings.minio_bucket)
minio_client: Optional MinIO client (creates one if not provided)

Returns:
Single Document object with all content
"""
if bucket_name is None:
bucket_name = settings.minio_bucket
documents = pdf_to_document(object_name, bucket_name, minio_client)

combined_content = "\n\n".join(doc.page_content for doc in documents)
Comment thread
JuanPalbo marked this conversation as resolved.

return Document(
page_content=combined_content,
metadata={
"source": f"minio://{bucket_name}/{object_name}",
"bucket": bucket_name,
"object_name": object_name,
"filename": object_name.split("/")[-1],
"total_pages": len(documents),
},
)