Skip to content

Asynchronous multi-GPU file processing and indexing#328

Merged
fabnemEPFL merged 22 commits into
EPFLiGHT:masterfrom
JCHAVEROT:feat/parallelization
Jun 25, 2026
Merged

Asynchronous multi-GPU file processing and indexing#328
fabnemEPFL merged 22 commits into
EPFLiGHT:masterfrom
JCHAVEROT:feat/parallelization

Conversation

@JCHAVEROT

@JCHAVEROT JCHAVEROT commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

Related issues: #324, #326

In production the live-retrieval API processed uploads synchronously, so a single user uploading a document blocked every other request until it finished. This PR makes the processing and indexing pipelines asynchronous and parallel across GPUs: uploads now return immediately with a job id, and documents are processed and indexed concurrently thanks to a queue. To handle concurrent updates on the Milvus database, its version was changed from Lite to Standalone.

Changes

  • Fixed a bug where files uploaded through the API were processed twice
  • Added anJobQueue (src/mmore/job_queue.py) that schedule jobs on GPUs with size limit and eviction of finished jobs
  • Made file upload asynchronous: POST /v1/files, POST /v1/files/bulk and PUT /v1/files/{fileId} endpoints now return 202 Accepted with a jobId instead of blocking. Duplicate ids are rejected (409), a saturated queue returns 503
  • Added job status endpoints: GET /v1/jobs/{jobId} for a one-time status and GET /v1/jobs/{jobId}/events for a server-pushed status stream over Server-Sent Events (SSE)
  • Parallelized processing and indexing across GPUs: per-device model copies in the PDF and media processors (artifacts_by_device / pipelines_by_device), and per-device embedding models (dense + splade)
  • Updated configs (jobs_per_gpu, max_queue_size, Milvus Standalone uri) and the API documentation
  • Added new tests for the job queue and the changed endpoints

Improvements

  • Non-blocking API: one user's upload no longer stalls everyone else, the request returns instantly with a job id
  • Multi-GPU: documents are processed and indexed in parallel across all available GPUs instead of one at a time
  • Observability: logs report the queue state and per-job progress, and clients can either poll GET /v1/jobs/{jobId} or subscribe to the SSE stream
  • Configurable concurrency: jobs_per_gpu (defaults to 1) lets you overlap CPU and GPU work for higher utilization but each replica loads its own models (needs some tuning wrt VRAM)

Milvus Standalone server

  • Concurrent writes require a real Milvus Standalone server (the production config now points at http://localhost:19530), not Milvus Lite which is single-process

To create it:

wget https://github.com/milvus-io/milvus/releases/download/v2.5.4/milvus-standalone-docker-compose.yml -O docker-compose.yml
docker compose up -d

Alternatively:

curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh
bash standalone_embed.sh start

The port exposed by default is 19530


DEMO

The experiment runs on a server with 4 NVIDIA Tesla V100 32 GB GPUs, of which only CUDA devices 1, 2 and 3 are available to the indexer (device 0 is reserved for other containers). The workload is a single HTTP request to the bulk-upload endpoint carrying 4 large PDFs (18 MB, 450 pages each), one per available GPU

experiment.mp4

Notes:

  • To restrict the API to devices 1-3, it was started with env variable CUDA_VISIBLE_DEVICES=1,2,3. Inside the process these are then re-indexed as cuda:0,1,2 and _detect_devices() sees 3 GPUs
  • As GPUs are working in parallel, the API logs get quite often overwritten, explaining the behavior in the video
  • With jobs_per_gpu=1 and 3 visible GPUs, the 4th PDF queues until one of the first three finishes, then runs

@JCHAVEROT JCHAVEROT self-assigned this Jun 19, 2026
@JCHAVEROT JCHAVEROT added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Jun 19, 2026
This was linked to issues Jun 19, 2026
@JCHAVEROT JCHAVEROT changed the title Make file processing run asynchronously across multiple GPUs Make file processing and indexing run asynchronously across multiple GPUs Jun 19, 2026
@JCHAVEROT JCHAVEROT force-pushed the feat/parallelization branch 3 times, most recently from ba4f60a to b35d8ae Compare June 19, 2026 20:01
@JCHAVEROT JCHAVEROT changed the title Make file processing and indexing run asynchronously across multiple GPUs Asynchronous multi-GPU file processing and indexing Jun 19, 2026
@fabnemEPFL fabnemEPFL requested a review from Copilot June 19, 2026 20:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the live-retrieval/indexer API to make uploads asynchronous and to parallelize processing/indexing across multiple GPUs via an in-memory job queue, adding job-status polling/SSE endpoints and updating Milvus configuration/docs accordingly.

Changes:

  • Introduces an in-memory JobQueue and job-status endpoints (/v1/jobs/{jobId} + SSE stream) to decouple uploads from processing/indexing.
  • Splits processing/indexing work across GPUs by pinning jobs and model replicas per device (processors + dense/sparse embedding models).
  • Updates API responses/status codes, production/example configs, and documentation; adds tests for the new async behavior.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_live_retriever_api.py Updates API tests for async uploads and adds job/SSE coverage.
tests/test_job_queue.py Adds unit tests for the new in-memory job queue.
src/mmore/utils.py Caches indexers per (collection, device) and threads device/output_path through processing.
src/mmore/run_index_api.py Converts upload/update endpoints to async job submission, adds job status + SSE endpoints, and per-device subprocess processing.
src/mmore/rag/retriever.py Adds queue/concurrency settings to RetrieverConfig.
src/mmore/rag/model/sparse/splade.py Allows passing an explicit device for sparse embeddings.
src/mmore/rag/model/sparse/base.py Threads device through sparse model construction.
src/mmore/rag/model/dense/base.py Threads device through HF embedding model construction.
src/mmore/process/processors/pdf_processor.py Adds per-device model caching/loading for parallel PDF processing.
src/mmore/process/processors/media_processor.py Adds per-device pipeline caching/loading for parallel media processing.
src/mmore/process/execution_state.py Makes execution state init/shutdown concurrency-safe with refcounting.
src/mmore/process/dispatcher.py Adds optional device pinning and lazily initializes the shared multiprocessing pool.
src/mmore/job_queue.py Implements the in-memory queue, device checkout, and retention/eviction.
src/mmore/index/indexer.py Loads embedding models on a given device and tolerates concurrent collection creation.
production-config/retriever_api/config.yaml Switches to Milvus Standalone config and adds queue settings.
examples/retriever_api/config.yaml Notes Standalone Milvus URI option in the example config.
docs/source/developer_documentation/index_api.md Updates API docs to describe async uploads, job status endpoints, and concurrency knobs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_job_queue.py
Comment thread src/mmore/run_index_api.py
Comment thread src/mmore/run_index_api.py
Comment thread src/mmore/run_index_api.py
Comment thread src/mmore/job_queue.py
@fabnemEPFL fabnemEPFL force-pushed the feat/parallelization branch from 71fb1ee to 85fc0fd Compare June 20, 2026 12:19
Comment thread src/mmore/rag/retriever.py
Comment thread src/mmore/index/indexer.py Outdated
Comment thread src/mmore/process/processors/media_processor.py
Comment thread src/mmore/process/processors/pdf_processor.py Outdated
Comment thread src/mmore/process/processors/pdf_processor.py Outdated
Comment thread src/mmore/rag/model/dense/base.py
Comment thread src/mmore/utils.py Outdated
Comment thread docs/source/developer_documentation/index_api.md
Comment thread src/mmore/job_queue.py
Comment thread src/mmore/job_queue.py Outdated
):
self.devices = devices or _detect_devices()
self.n_workers = len(self.devices) * jobs_per_gpu
self.max_queue_size = max_queue_size or self.n_workers * 10

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or self.n_workers * 10 sounds arbitrary, why?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is completely arbitrary yes, it makes 10 buffered jobs per worker as a default value which can be reasonable (not too small not too big)

Maybe in a production environment it can be a bit too little depending on how many users we have, let me know if I should increase the default value

In any case it can be changed in the config files

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can leave it this way if it works with this rule of thumb, it would be cool to test this some day when urgent things have been handled

@fabnemEPFL fabnemEPFL left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me, let's wait for feedback from the Moove team

@tharvik

tharvik commented Jun 24, 2026

Copy link
Copy Markdown

the current feedback is "migration is longer than hoped"

@fabnemEPFL

Copy link
Copy Markdown
Collaborator

🥲

@fabnemEPFL fabnemEPFL merged commit 74ee9eb into EPFLiGHT:master Jun 25, 2026
4 checks passed
@fabnemEPFL fabnemEPFL deleted the feat/parallelization branch June 25, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

can't retrieve when the rag is indexing only uses a single GPU

4 participants