Skip to content

Fix #289#10

Merged
fabnemEPFL merged 6 commits into
fix/288from
fix/289
May 19, 2026
Merged

Fix #289#10
fabnemEPFL merged 6 commits into
fix/288from
fix/289

Conversation

@fabnemEPFL
Copy link
Copy Markdown
Owner

This pull request improves the file upload and indexing endpoint by making the mapping between uploaded files and their custom IDs more robust, enhancing the returned API response with more detailed metadata, and updating tests to verify these changes. The main focus is on ensuring accurate handling of file IDs, filenames, and chunk counts for each uploaded file.

Enhancements to file upload and indexing logic:

  • Improved parsing and sanitization of the listIds parameter to support multiple comma-separated ID strings and remove whitespace.
  • Introduced explicit tracking of uploaded files and a mapping from filenames to file IDs, ensuring that processed documents are correctly matched to their original uploads. [1] [2]
  • Updated the file saving and document processing logic to use the correct filenames and maintain accurate associations between files and their IDs. [1] [2]

API response improvements:

  • Enhanced the response structure to include, for each uploaded file: its ID, filename, a text preview, and the number of chunks indexed, providing more transparency to clients.

Testing improvements:

  • Updated the bulk file upload test to verify the new response fields (filename and chunks) and ensure correct mapping between file IDs and uploaded files. [1] [2]

@fabnemEPFL fabnemEPFL requested a review from Copilot May 19, 2026 12:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request updates the Indexer API bulk upload endpoint to more robustly associate uploaded files with client-provided IDs, and enhances the API response to return per-file metadata (including filename and chunk counts), with corresponding test updates.

Changes:

  • Improved parsing of listIds to support multiple comma-separated entries and whitespace trimming.
  • Reworked bulk upload processing to map processed chunks back to the originating uploaded file and return per-file response entries with filename and chunks.
  • Updated bulk upload tests to validate the new response fields and chunk counting behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/mmore/run_index_api.py Updates /v1/files/bulk ID parsing, processed-doc ↔ upload mapping, and response payload to include filename + chunk counts.
tests/test_live_retriever_api.py Extends the bulk upload test to cover multi-chunk behavior and the new response schema fields.
Comments suppressed due to low confidence (1)

src/mmore/run_index_api.py:216

  • The matching logic uses id_by_filename keyed by the raw upload file.filename, but later looks up using Path(doc.metadata.file_path).name (basename only). If the client-supplied filename includes any directory components, this lookup will fail and return a 500. Normalize filenames consistently (e.g., always use the basename for both the temp save name and the mapping key), and consider returning the normalized filename in the API response too.
                    filename = FilePath(doc.metadata.file_path).name
                    doc_id = id_by_filename.get(filename)
                    if doc_id is None:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +173 to 176
filename = file.filename
uploaded_files.append({"fileId": file_id, "filename": filename})
id_by_filename[filename] = file_id

Comment on lines 185 to 188
# Save to temp directory
file_name = FilePath(temp_dir) / file.filename
file_name = FilePath(temp_dir) / filename
with file_name.open("wb") as buffer:
shutil.copyfileobj(file.file, buffer)
Comment on lines 238 to +246
return {
"status": "success",
"message": f"Successfully processed and indexed {len(modified_documents)} documents",
"message": f"Successfully processed and indexed {len(uploaded_files)} files",
"documents": [
{"fileId": doc.document_id, "text": doc.text[:50] + "..."}
for doc in modified_documents
{
"fileId": file_info["fileId"],
"filename": file_info["filename"],
"text": text_by_file_id.get(file_info["fileId"], "")[:50]
+ "...",
Copy link
Copy Markdown

@JCHAVEROT JCHAVEROT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with using the code of your branch for the live retrieval API and your fixes work great, now when doing an HTTP POST on the /v1/files/bulk endpoint, the uploaded files are indeed listed, great 👍

cp /Users/chaverot/Downloads/mmore.pdf /tmp/mmore_a.pdf
cp /Users/chaverot/Downloads/mmore.pdf /tmp/mmore_b.pdf

curl -s http://localhost:8000/v1/files/bulk \
  --form 'listIds=id_1,id_2' \
  --form files=@/tmp/mmore_a.pdf \
  --form files=@/tmp/mmore_b.pdf | jq
{
  "status": "success",
  "message": "Successfully processed and indexed 2 files",
  "documents": [
    {
      "fileId": "id_1",
      "filename": "mmore_a.pdf",
      "text": "## 000 001 002 003 004 005 006 007\n\n## 008 009 010...",
      "chunks": 42
    },
    {
      "fileId": "id_2",
      "filename": "mmore_b.pdf",
      "text": "## 001 002\n\n#### 003 004 005 006\n\n### 008 009 010 ...",
      "chunks": 48
    }
  ]
}

(I was a bit surprised by the difference in text and chunks number for the processed documents, as they are coming originally from the same pdf. In fact it comes from the non deterministic pdf to markdown parser as it relies on some machine learning )

fabnemEPFL and others added 3 commits May 19, 2026 16:35
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
@fabnemEPFL fabnemEPFL merged commit 8d8d036 into fix/288 May 19, 2026
3 checks passed
@fabnemEPFL fabnemEPFL deleted the fix/289 branch May 19, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants