Fix #289 by fabnemEPFL · Pull Request #10 · fabnemEPFL/mmore

fabnemEPFL · 2026-05-19T12:33:40Z

This pull request improves the file upload and indexing endpoint by making the mapping between uploaded files and their custom IDs more robust, enhancing the returned API response with more detailed metadata, and updating tests to verify these changes. The main focus is on ensuring accurate handling of file IDs, filenames, and chunk counts for each uploaded file.

Enhancements to file upload and indexing logic:

Improved parsing and sanitization of the listIds parameter to support multiple comma-separated ID strings and remove whitespace.
Introduced explicit tracking of uploaded files and a mapping from filenames to file IDs, ensuring that processed documents are correctly matched to their original uploads. [1] [2]
Updated the file saving and document processing logic to use the correct filenames and maintain accurate associations between files and their IDs. [1] [2]

API response improvements:

Enhanced the response structure to include, for each uploaded file: its ID, filename, a text preview, and the number of chunks indexed, providing more transparency to clients.

Testing improvements:

Updated the bulk file upload test to verify the new response fields (filename and chunks) and ensure correct mapping between file IDs and uploaded files. [1] [2]

Copilot

Pull request overview

This pull request updates the Indexer API bulk upload endpoint to more robustly associate uploaded files with client-provided IDs, and enhances the API response to return per-file metadata (including filename and chunk counts), with corresponding test updates.

Changes:

Improved parsing of listIds to support multiple comma-separated entries and whitespace trimming.
Reworked bulk upload processing to map processed chunks back to the originating uploaded file and return per-file response entries with filename and chunks.
Updated bulk upload tests to validate the new response fields and chunk counting behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`src/mmore/run_index_api.py`	Updates `/v1/files/bulk` ID parsing, processed-doc ↔ upload mapping, and response payload to include filename + chunk counts.
`tests/test_live_retriever_api.py`	Extends the bulk upload test to cover multi-chunk behavior and the new response schema fields.

Comments suppressed due to low confidence (1)

src/mmore/run_index_api.py:216

The matching logic uses id_by_filename keyed by the raw upload file.filename, but later looks up using Path(doc.metadata.file_path).name (basename only). If the client-supplied filename includes any directory components, this lookup will fail and return a 500. Normalize filenames consistently (e.g., always use the basename for both the temp save name and the mapping key), and consider returning the normalized filename in the API response too.

                    filename = FilePath(doc.metadata.file_path).name
                    doc_id = id_by_filename.get(filename)
                    if doc_id is None:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                    filename = file.filename
+                    uploaded_files.append({"fileId": file_id, "filename": filename})
+                    id_by_filename[filename] = file_id



                    # Save to temp directory
-                    file_name = FilePath(temp_dir) / file.filename
+                    file_name = FilePath(temp_dir) / filename
                    with file_name.open("wb") as buffer:
                        shutil.copyfileobj(file.file, buffer)


                return {
                    "status": "success",
-                    "message": f"Successfully processed and indexed {len(modified_documents)} documents",
+                    "message": f"Successfully processed and indexed {len(uploaded_files)} files",
                    "documents": [
-                        {"fileId": doc.document_id, "text": doc.text[:50] + "..."}
-                        for doc in modified_documents
+                        {
+                            "fileId": file_info["fileId"],
+                            "filename": file_info["filename"],
+                            "text": text_by_file_id.get(file_info["fileId"], "")[:50]
+                            + "...",


JCHAVEROT

I tested with using the code of your branch for the live retrieval API and your fixes work great, now when doing an HTTP POST on the /v1/files/bulk endpoint, the uploaded files are indeed listed, great 👍

cp /Users/chaverot/Downloads/mmore.pdf /tmp/mmore_a.pdf
cp /Users/chaverot/Downloads/mmore.pdf /tmp/mmore_b.pdf

curl -s http://localhost:8000/v1/files/bulk \
  --form 'listIds=id_1,id_2' \
  --form files=@/tmp/mmore_a.pdf \
  --form files=@/tmp/mmore_b.pdf | jq
{
  "status": "success",
  "message": "Successfully processed and indexed 2 files",
  "documents": [
    {
      "fileId": "id_1",
      "filename": "mmore_a.pdf",
      "text": "## 000 001 002 003 004 005 006 007\n\n## 008 009 010...",
      "chunks": 42
    },
    {
      "fileId": "id_2",
      "filename": "mmore_b.pdf",
      "text": "## 001 002\n\n#### 003 004 005 006\n\n### 008 009 010 ...",
      "chunks": 48
    }
  ]
}

(I was a bit surprised by the difference in text and chunks number for the processed documents, as they are coming originally from the same pdf. In fact it comes from the non deterministic pdf to markdown parser as it relies on some machine learning )

Co-authored-by: Copilot <copilot@github.com>

fix

91cfe65

fabnemEPFL requested a review from Copilot May 19, 2026 12:34

Copilot AI reviewed May 19, 2026

View reviewed changes

fabnemEPFL and others added 2 commits May 19, 2026 14:45

fixes?

9c556ad

Merge branch 'fix/288' into fix/289

a806bba

JCHAVEROT approved these changes May 19, 2026

View reviewed changes

fabnemEPFL and others added 3 commits May 19, 2026 16:35

Merge branch 'fix/288' into fix/289

af72253

Merge branch 'fix/289' of github.com:fabnemEPFL/mmore into fix/289

32f24ec

Co-authored-by: Copilot <copilot@github.com>

fix

41204d9

Co-authored-by: Copilot <copilot@github.com>

fabnemEPFL merged commit 8d8d036 into fix/288 May 19, 2026
3 checks passed

fabnemEPFL deleted the fix/289 branch May 19, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #289#10

Fix #289#10
fabnemEPFL merged 6 commits into
fix/288from
fix/289

fabnemEPFL commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

JCHAVEROT left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fabnemEPFL commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

JCHAVEROT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants