Asynchronous multi-GPU file processing and indexing by JCHAVEROT · Pull Request #328 · EPFLiGHT/mmore

JCHAVEROT · 2026-06-19T13:46:47Z

Summary

Related issues: #324, #326

In production the live-retrieval API processed uploads synchronously, so a single user uploading a document blocked every other request until it finished. This PR makes the processing and indexing pipelines asynchronous and parallel across GPUs: uploads now return immediately with a job id, and documents are processed and indexed concurrently thanks to a queue. To handle concurrent updates on the Milvus database, its version was changed from Lite to Standalone.

Changes

Fixed a bug where files uploaded through the API were processed twice
Added anJobQueue (src/mmore/job_queue.py) that schedule jobs on GPUs with size limit and eviction of finished jobs
Made file upload asynchronous: POST /v1/files, POST /v1/files/bulk and PUT /v1/files/{fileId} endpoints now return 202 Accepted with a jobId instead of blocking. Duplicate ids are rejected (409), a saturated queue returns 503
Added job status endpoints: GET /v1/jobs/{jobId} for a one-time status and GET /v1/jobs/{jobId}/events for a server-pushed status stream over Server-Sent Events (SSE)
Parallelized processing and indexing across GPUs: per-device model copies in the PDF and media processors (artifacts_by_device / pipelines_by_device), and per-device embedding models (dense + splade)
Updated configs (jobs_per_gpu, max_queue_size, Milvus Standalone uri) and the API documentation
Added new tests for the job queue and the changed endpoints

Improvements

Non-blocking API: one user's upload no longer stalls everyone else, the request returns instantly with a job id
Multi-GPU: documents are processed and indexed in parallel across all available GPUs instead of one at a time
Observability: logs report the queue state and per-job progress, and clients can either poll GET /v1/jobs/{jobId} or subscribe to the SSE stream
Configurable concurrency: jobs_per_gpu (defaults to 1) lets you overlap CPU and GPU work for higher utilization but each replica loads its own models (needs some tuning wrt VRAM)

Milvus Standalone server

Concurrent writes require a real Milvus Standalone server (the production config now points at http://localhost:19530), not Milvus Lite which is single-process

To create it:

wget https://github.com/milvus-io/milvus/releases/download/v2.5.4/milvus-standalone-docker-compose.yml -O docker-compose.yml
docker compose up -d

Alternatively:

curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh
bash standalone_embed.sh start

The port exposed by default is 19530

DEMO

The experiment runs on a server with 4 NVIDIA Tesla V100 32 GB GPUs, of which only CUDA devices 1, 2 and 3 are available to the indexer (device 0 is reserved for other containers). The workload is a single HTTP request to the bulk-upload endpoint carrying 4 large PDFs (18 MB, 450 pages each), one per available GPU

experiment.mp4

Notes:

To restrict the API to devices 1-3, it was started with env variable CUDA_VISIBLE_DEVICES=1,2,3. Inside the process these are then re-indexed as cuda:0,1,2 and _detect_devices() sees 3 GPUs
As GPUs are working in parallel, the API logs get quite often overwritten, explaining the behavior in the video
With jobs_per_gpu=1 and 3 visible GPUs, the 4th PDF queues until one of the first three finishes, then runs

Copilot

Pull request overview

This PR refactors the live-retrieval/indexer API to make uploads asynchronous and to parallelize processing/indexing across multiple GPUs via an in-memory job queue, adding job-status polling/SSE endpoints and updating Milvus configuration/docs accordingly.

Changes:

Introduces an in-memory JobQueue and job-status endpoints (/v1/jobs/{jobId} + SSE stream) to decouple uploads from processing/indexing.
Splits processing/indexing work across GPUs by pinning jobs and model replicas per device (processors + dense/sparse embedding models).
Updates API responses/status codes, production/example configs, and documentation; adds tests for the new async behavior.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_live_retriever_api.py	Updates API tests for async uploads and adds job/SSE coverage.
tests/test_job_queue.py	Adds unit tests for the new in-memory job queue.
src/mmore/utils.py	Caches indexers per `(collection, device)` and threads device/output_path through processing.
src/mmore/run_index_api.py	Converts upload/update endpoints to async job submission, adds job status + SSE endpoints, and per-device subprocess processing.
src/mmore/rag/retriever.py	Adds queue/concurrency settings to `RetrieverConfig`.
src/mmore/rag/model/sparse/splade.py	Allows passing an explicit device for sparse embeddings.
src/mmore/rag/model/sparse/base.py	Threads `device` through sparse model construction.
src/mmore/rag/model/dense/base.py	Threads `device` through HF embedding model construction.
src/mmore/process/processors/pdf_processor.py	Adds per-device model caching/loading for parallel PDF processing.
src/mmore/process/processors/media_processor.py	Adds per-device pipeline caching/loading for parallel media processing.
src/mmore/process/execution_state.py	Makes execution state init/shutdown concurrency-safe with refcounting.
src/mmore/process/dispatcher.py	Adds optional device pinning and lazily initializes the shared multiprocessing pool.
src/mmore/job_queue.py	Implements the in-memory queue, device checkout, and retention/eviction.
src/mmore/index/indexer.py	Loads embedding models on a given device and tolerates concurrent collection creation.
production-config/retriever_api/config.yaml	Switches to Milvus Standalone config and adds queue settings.
examples/retriever_api/config.yaml	Notes Standalone Milvus URI option in the example config.
docs/source/developer_documentation/index_api.md	Updates API docs to describe async uploads, job status endpoints, and concurrency knobs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fabnemEPFL · 2026-06-20T14:13:49Z

+    ):
+        self.devices = devices or _detect_devices()
+        self.n_workers = len(self.devices) * jobs_per_gpu
+        self.max_queue_size = max_queue_size or self.n_workers * 10


or self.n_workers * 10 sounds arbitrary, why?

It is completely arbitrary yes, it makes 10 buffered jobs per worker as a default value which can be reasonable (not too small not too big)

Maybe in a production environment it can be a bit too little depending on how many users we have, let me know if I should increase the default value

In any case it can be changed in the config files

I guess we can leave it this way if it works with this rule of thumb, it would be cool to test this some day when urgent things have been handled

fabnemEPFL

sounds good to me, let's wait for feedback from the Moove team

tharvik · 2026-06-24T12:29:16Z

the current feedback is "migration is longer than hoped"

fabnemEPFL · 2026-06-24T13:05:03Z

🥲

JCHAVEROT self-assigned this Jun 19, 2026

JCHAVEROT added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Jun 19, 2026

This was linked to issues Jun 19, 2026

only uses a single GPU #324

Closed

can't retrieve when the rag is indexing #326

Closed

JCHAVEROT changed the title ~~Make file processing run asynchronously across multiple GPUs~~ Make file processing and indexing run asynchronously across multiple GPUs Jun 19, 2026

JCHAVEROT force-pushed the feat/parallelization branch 3 times, most recently from ba4f60a to b35d8ae Compare June 19, 2026 20:01

JCHAVEROT changed the title ~~Make file processing and indexing run asynchronously across multiple GPUs~~ Asynchronous multi-GPU file processing and indexing Jun 19, 2026

fabnemEPFL requested a review from Copilot June 19, 2026 20:22

Copilot started reviewing on behalf of fabnemEPFL June 19, 2026 20:23 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread tests/test_job_queue.py

Comment thread src/mmore/run_index_api.py

Comment thread src/mmore/run_index_api.py

Comment thread src/mmore/run_index_api.py

Comment thread src/mmore/job_queue.py

JCHAVEROT added 16 commits June 20, 2026 14:19

fix: files processed twice when uploaded through API

519522f

feat: add first version of the job queue for parallelization

52f4f05

feat: add per-GPU device pinning in processing pipeline (pdf and media)

7aba5c2

feat: process file uploads asynchronously via a job queue

ccaed02

feat: add job status endpoints and fix tests

60d22a4

chores: update config files

bf1e830

chores: update API documentation

21d0faf

fix: correct Milvus Standalone db name

db5bbbc

chores: improve JobQueue logs

f583bf1

fix: make ExecutionState compatible for concurrent dispatches

e0fa821

fix: serialize processor models loading to handle concurrency

7bd458f

fix: correct device number once a gpu starts ingesting

16d1151

fix: run processing in a device subprocess

e6ad342

fix: tests not passing

d297ada

feat: make indexing parallelizable across GPUs

84d37f1

fix: correct wrong stats in JobQueue logs

f1c0df4

JCHAVEROT added 2 commits June 20, 2026 14:19

perf: lazy create shared pool as gpus process not using it

31ff215

chores: fix type check errors in tests

85fc0fd

fabnemEPFL force-pushed the feat/parallelization branch from 71fb1ee to 85fc0fd Compare June 20, 2026 12:19

fabnemEPFL requested changes Jun 20, 2026

View reviewed changes

JCHAVEROT added 4 commits June 20, 2026 18:27

review: make changes following Fabrice's review

f07340c

docs: update the Swagger doc for the retrieval API

f82668a

review: make changes following copilot's review

00b1905

review: mention CUDA_VISIBLE_DEVICES var in docs

51ca4bb

fabnemEPFL reviewed Jun 22, 2026

View reviewed changes

fabnemEPFL approved these changes Jun 25, 2026

View reviewed changes

fabnemEPFL merged commit 74ee9eb into EPFLiGHT:master Jun 25, 2026
4 checks passed

fabnemEPFL deleted the feat/parallelization branch June 25, 2026 20:52

Uh oh!

Conversation

JCHAVEROT commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Improvements

Milvus Standalone server

DEMO

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fabnemEPFL Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

JCHAVEROT Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

fabnemEPFL Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

fabnemEPFL left a comment

Choose a reason for hiding this comment

Uh oh!

tharvik commented Jun 24, 2026

Uh oh!

fabnemEPFL commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JCHAVEROT commented Jun 19, 2026 •

edited

Loading