Fix #289#10
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request updates the Indexer API bulk upload endpoint to more robustly associate uploaded files with client-provided IDs, and enhances the API response to return per-file metadata (including filename and chunk counts), with corresponding test updates.
Changes:
- Improved parsing of
listIdsto support multiple comma-separated entries and whitespace trimming. - Reworked bulk upload processing to map processed chunks back to the originating uploaded file and return per-file response entries with
filenameandchunks. - Updated bulk upload tests to validate the new response fields and chunk counting behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/mmore/run_index_api.py |
Updates /v1/files/bulk ID parsing, processed-doc ↔ upload mapping, and response payload to include filename + chunk counts. |
tests/test_live_retriever_api.py |
Extends the bulk upload test to cover multi-chunk behavior and the new response schema fields. |
Comments suppressed due to low confidence (1)
src/mmore/run_index_api.py:216
- The matching logic uses
id_by_filenamekeyed by the raw uploadfile.filename, but later looks up usingPath(doc.metadata.file_path).name(basename only). If the client-supplied filename includes any directory components, this lookup will fail and return a 500. Normalize filenames consistently (e.g., always use the basename for both the temp save name and the mapping key), and consider returning the normalized filename in the API response too.
filename = FilePath(doc.metadata.file_path).name
doc_id = id_by_filename.get(filename)
if doc_id is None:
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| filename = file.filename | ||
| uploaded_files.append({"fileId": file_id, "filename": filename}) | ||
| id_by_filename[filename] = file_id | ||
|
|
| # Save to temp directory | ||
| file_name = FilePath(temp_dir) / file.filename | ||
| file_name = FilePath(temp_dir) / filename | ||
| with file_name.open("wb") as buffer: | ||
| shutil.copyfileobj(file.file, buffer) |
| return { | ||
| "status": "success", | ||
| "message": f"Successfully processed and indexed {len(modified_documents)} documents", | ||
| "message": f"Successfully processed and indexed {len(uploaded_files)} files", | ||
| "documents": [ | ||
| {"fileId": doc.document_id, "text": doc.text[:50] + "..."} | ||
| for doc in modified_documents | ||
| { | ||
| "fileId": file_info["fileId"], | ||
| "filename": file_info["filename"], | ||
| "text": text_by_file_id.get(file_info["fileId"], "")[:50] | ||
| + "...", |
JCHAVEROT
left a comment
There was a problem hiding this comment.
I tested with using the code of your branch for the live retrieval API and your fixes work great, now when doing an HTTP POST on the /v1/files/bulk endpoint, the uploaded files are indeed listed, great 👍
cp /Users/chaverot/Downloads/mmore.pdf /tmp/mmore_a.pdf
cp /Users/chaverot/Downloads/mmore.pdf /tmp/mmore_b.pdf
curl -s http://localhost:8000/v1/files/bulk \
--form 'listIds=id_1,id_2' \
--form files=@/tmp/mmore_a.pdf \
--form files=@/tmp/mmore_b.pdf | jq
{
"status": "success",
"message": "Successfully processed and indexed 2 files",
"documents": [
{
"fileId": "id_1",
"filename": "mmore_a.pdf",
"text": "## 000 001 002 003 004 005 006 007\n\n## 008 009 010...",
"chunks": 42
},
{
"fileId": "id_2",
"filename": "mmore_b.pdf",
"text": "## 001 002\n\n#### 003 004 005 006\n\n### 008 009 010 ...",
"chunks": 48
}
]
}(I was a bit surprised by the difference in text and chunks number for the processed documents, as they are coming originally from the same pdf. In fact it comes from the non deterministic pdf to markdown parser as it relies on some machine learning )
Co-authored-by: Copilot <copilot@github.com>
This pull request improves the file upload and indexing endpoint by making the mapping between uploaded files and their custom IDs more robust, enhancing the returned API response with more detailed metadata, and updating tests to verify these changes. The main focus is on ensuring accurate handling of file IDs, filenames, and chunk counts for each uploaded file.
Enhancements to file upload and indexing logic:
listIdsparameter to support multiple comma-separated ID strings and remove whitespace.API response improvements:
Testing improvements:
filenameandchunks) and ensure correct mapping between file IDs and uploaded files. [1] [2]