Rankllm has so many demos that act as integration tests for e2e flow, but running them is manual and expensive.
1- Change them to take parameters, like rerank_qwen_async and rerank_qwen_local.
2- make sure there are clis for the same tests
3- write skills for running smoke tests to automate the e2e testing process with a single/handful of queries: something that generate a report like the one below as the output, that looks at the regressions in metrics, as well as the invocation history, latency etc.
This will allow for complete testing of all the commands, code paths to make sure nothing breaks as regressions.
RankLLM single-query smoke / verification report
Goal
Validate that the new sampling_kwargs plumbing in RankListwiseOSLLM
works end-to-end on both code paths — the in-process vLLM engine
(rerank_qwen_local.py) and the OpenAI-compatible vLLM HTTP server
(rerank_qwen_async.py) — and that the in-tree unit tests still pass.
All runs use:
- Dataset: TREC DL-19 (
dl19), single query qid=264014 ("how long is life cycle of flea")
- Retrieval: BM25 prebuilt index,
--k 100
- Listwise window:
--window-size 20 --stride 10 (→ 9 sliding windows)
- Sampling JSON:
{"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}
- Env:
rankllm-2 conda env or in-repo .venv (vLLM 0.20.0, transformers 5.7.0)
- Hardware: 1× NVIDIA RTX 6000 Ada (48 GB),
CUDA_VISIBLE_DEVICES=0
PYTHONNOUSERSITE=1 is set to bypass a broken ~/.local/lib/python3.11/site-packages/transformers.
Commands run
Run 1 — local in-process, Qwen3-0.6B (rankllm-2 env)
cd /u6/s8sharif/rankllm_new/rank_llm
PYTHONNOUSERSITE=1 CUDA_VISIBLE_DEVICES=0 \
python src/rank_llm/demo/rerank_qwen_local.py \
--dataset dl19 --model Qwen/Qwen3-0.6B \
--num-queries 1 --k 100 --batch-size 4 \
--context-size 4096 --window-size 20 --stride 10 --num-gpus 1 \
--output-dir demo_outputs/_single_query_test \
--sampling-json '{"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}'
Run 2 — local in-process, Qwen3-4B (.venv)
cd /u6/s8sharif/rankllm_new/rank_llm
PYTHONNOUSERSITE=1 CUDA_VISIBLE_DEVICES=0 \
.venv/bin/python src/rank_llm/demo/rerank_qwen_local.py \
--dataset dl19 --model Qwen/Qwen3-4B \
--num-queries 1 --k 100 --batch-size 4 \
--context-size 4096 --window-size 20 --stride 10 --num-gpus 1 \
--output-dir demo_outputs/_single_query_test \
--sampling-json '{"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}'
Run 3 — async client against a vLLM HTTP server, Qwen3-4B (.venv)
Server (background):
cd /u6/s8sharif/rankllm_new/rank_llm
PYTHONNOUSERSITE=1 CUDA_VISIBLE_DEVICES=0 \
.venv/bin/vllm serve Qwen/Qwen3-4B \
--port 8765 --dtype auto \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--enable-prefix-caching \
--served-model-name Qwen/Qwen3-4B \
> logs/vllm_qwen3-4b.log 2>&1 &
Client:
PYTHONNOUSERSITE=1 \
.venv/bin/python src/rank_llm/demo/rerank_qwen_async.py \
--dataset dl19 --model Qwen/Qwen3-4B \
--base-url http://127.0.0.1:8765/v1 \
--num-queries 1 --k 100 --batch-size 4 \
--context-size 4096 --window-size 20 --stride 10 \
--output-dir demo_outputs/_single_query_test_async \
--sampling-json '{"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}'
Shutdown:
pgrep -f "vllm serve Qwen/Qwen3-4B" | xargs -r kill
Re-runs after the test/handler fixes
The two Qwen3-4B runs above were repeated unchanged after the source +
test fix; results were bit-for-bit identical to the pre-fix runs (final
permutation, all four metrics).
TREC-eval metrics (single dl19 query, qid=264014)
| Metric |
BM25 (k=100) |
Local in-proc, Qwen3-0.6B |
Local in-proc, Qwen3-4B |
Async via vLLM server, Qwen3-4B |
| nDCG@10 |
0.5257 |
0.5257 (Δ +0.0000) |
0.7821 (Δ +0.2564) |
0.7821 (Δ +0.2564) |
| MAP@100 |
0.0776 |
0.0891 (Δ +0.0115) |
0.1192 (Δ +0.0416) |
0.1192 (Δ +0.0416) |
| Recall@20 |
0.0664 |
0.0664 (Δ +0.0000) |
0.0758 (Δ +0.0094) |
0.0758 (Δ +0.0094) |
| Recall@100 |
0.2417 |
0.2417 (Δ 0) |
0.2417 (Δ 0) |
0.2417 (Δ 0) |
Top-5 reranked docids per run
| Run |
rank 1 → 5 |
| Qwen3-0.6B local |
5611210, 6641238, 4834547, 96852, 96854 |
| Qwen3-4B local (in-process) |
6641238, 4834547, 6105572, 96855, 2223171 |
| Qwen3-4B async (vLLM server) |
6641238, 4834547, 6105572, 96855, 2223171 |
| The two Qwen3-4B paths produce identical final rankings, confirming |
|
the sampling_kwargs flow is equivalent across in-process vLLM and the |
|
| OpenAI-compatible HTTP path. |
|
Response-parser stats (per run, out of 9 sliding-window LLM calls)
| Run |
ok |
wrong_format |
repetition |
missing_documents |
| Qwen3-0.6B local |
3 |
5 |
1 |
0 |
| Qwen3-4B local (in-process) |
5 |
1 |
3 |
0 |
| Qwen3-4B async (vLLM server) |
5 |
0 |
4 |
0 |
| Interpretation: |
|
|
|
|
- Qwen3-0.6B with
temperature=0.7 regularly produces invalid permutations (5/9 windows fail to parse).
- Qwen3-4B is much cleaner (5/9
ok); the two 4B paths differ only on how the parser classifies “wrong_format” vs “repetition” due to prefix-cache/scheduler differences between the offline and HTTP engines — final permutation is identical.
Wall-clock timeline
| Run |
Total |
Model load / torch.compile |
Reranking only |
| Qwen3-0.6B local (1st) |
71 s |
~63 s |
~3 s |
| Qwen3-4B local (1st) |
98 s |
~85 s |
~11 s |
| Qwen3-4B local (2nd, caches warm) |
45 s |
~33 s |
~11 s |
| vLLM server cold start (Qwen3-4B) |
100 s (1st), 40 s (warm) |
— |
— |
| Qwen3-4B async client (server already up) |
22 s |
— |
~10 s |
| Qwen3-4B async client (re-run, server warm) |
21 s |
— |
~10 s |
| The async path wins as soon as you have more than one query — sliding |
|
|
|
| windows are sequential per query (later windows depend on the previous |
|
|
|
window's permutation), but rerank_async overlaps work across queries |
|
|
|
| on a single live server. |
|
|
|
Unit tests after the source+test fix
| Suite |
Result |
python -m unittest discover -s test/rerank |
84/84 OK |
python -m unittest discover -s test/analysis |
1/1 OK |
python -m unittest discover -s test/evaluation |
1/1 OK |
CLI smoke (with .venv/bin on PATH) |
84/84 OK (4 skipped) |
Two test failures were introduced by the expose sampling kwargs |
|
| commit and fixed: |
|
test_chat_completion_async_success — the new explicit
extra_body=None leaked into the mock call signature. Fixed by
letting extra_body flow through **kwargs instead of popping +
re-passing it.
test_concurrent_rerank_async_shares_semaphore — the new
AutoTokenizer.from_pretrained(model, …) call before constructing
VllmHandler needed to be mocked. Fixed by patching
rank_llm.rerank.listwise.rank_listwise_os_llm.AutoTokenizer in
the test.
The CLI-smoke failures seen on a first run were unrelated to the
commit: a stale ~/.local/bin/rank-llm shim was earlier on PATH than
.venv/bin/rank-llm. Verified by re-running with
PATH="$PWD/.venv/bin:$PATH" (all pass).
Outputs saved
demo_outputs/_single_query_test/qwen3-0.6b/dl19/{rerank.jsonl, rerank.txt, invocations.json}
demo_outputs/_single_query_test/qwen3-4b/dl19/{rerank.jsonl, rerank.txt, invocations.json}
demo_outputs/_single_query_test_async/qwen3-4b/dl19/{rerank.jsonl, rerank.txt, invocations.json}
Per-run logs:
/tmp/rerank_qwen_local_single.log
/tmp/rerank_qwen_local_single_4b.log
/tmp/rerank_qwen_local_single_4b_after_fix.log
/tmp/rerank_qwen_async_single_4b.log
/tmp/rerank_qwen_async_single_4b_after_fix.log
logs/vllm_qwen3-4b.log
Conclusions
--sampling-json / --sampling-json-file (and the SAMPLING_JSON
env var) is correctly threaded through RankListwiseOSLLM(sampling_kwargs=...)
to both backends:
- In-process vLLM:
VllmHandler.generate_output_async(sampling_extra=…)
- HTTP vLLM:
AsyncOpenAI.chat.completions.create(..., extra_body=…)
with OpenAI-native keys split out by split_openai_chat_sampling.
- Both Qwen3-4B paths produced identical metrics and final
permutations under the same seed, validating equivalence of the
two pipelines.
- Reranking is meaningfully useful only at the 4B size on this query
(+0.26 nDCG@10); the 0.6B model only reshuffles existing top-k
without bringing in new relevant docs.
- Recall@100 cannot move (reranking only permutes the input top-100
set); Recall@20 ticks up when Qwen3-4B promotes one extra relevant
doc.
- All in-repo unit tests are green again after the two follow-up fixes.
Rankllm has so many demos that act as integration tests for e2e flow, but running them is manual and expensive.
1- Change them to take parameters, like rerank_qwen_async and rerank_qwen_local.
2- make sure there are clis for the same tests
3- write skills for running smoke tests to automate the e2e testing process with a single/handful of queries: something that generate a report like the one below as the output, that looks at the regressions in metrics, as well as the invocation history, latency etc.
This will allow for complete testing of all the commands, code paths to make sure nothing breaks as regressions.
RankLLM single-query smoke / verification report
Goal
Validate that the new
sampling_kwargsplumbing inRankListwiseOSLLMworks end-to-end on both code paths — the in-process vLLM engine
(
rerank_qwen_local.py) and the OpenAI-compatible vLLM HTTP server(
rerank_qwen_async.py) — and that the in-tree unit tests still pass.All runs use:
dl19), single queryqid=264014("how long is life cycle of flea")--k 100--window-size 20 --stride 10(→ 9 sliding windows){"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}rankllm-2conda env or in-repo.venv(vLLM 0.20.0, transformers 5.7.0)CUDA_VISIBLE_DEVICES=0PYTHONNOUSERSITE=1is set to bypass a broken~/.local/lib/python3.11/site-packages/transformers.Commands run
Run 1 — local in-process, Qwen3-0.6B (rankllm-2 env)
Run 2 — local in-process, Qwen3-4B (.venv)
Run 3 — async client against a vLLM HTTP server, Qwen3-4B (.venv)
Server (background):
Client:
PYTHONNOUSERSITE=1 \ .venv/bin/python src/rank_llm/demo/rerank_qwen_async.py \ --dataset dl19 --model Qwen/Qwen3-4B \ --base-url http://127.0.0.1:8765/v1 \ --num-queries 1 --k 100 --batch-size 4 \ --context-size 4096 --window-size 20 --stride 10 \ --output-dir demo_outputs/_single_query_test_async \ --sampling-json '{"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}'Shutdown:
Re-runs after the test/handler fixes
The two Qwen3-4B runs above were repeated unchanged after the source +
test fix; results were bit-for-bit identical to the pre-fix runs (final
permutation, all four metrics).
TREC-eval metrics (single dl19 query, qid=264014)
Top-5 reranked docids per run
sampling_kwargsflow is equivalent across in-process vLLM and theResponse-parser stats (per run, out of 9 sliding-window LLM calls)
temperature=0.7regularly produces invalid permutations (5/9 windows fail to parse).ok); the two 4B paths differ only on how the parser classifies “wrong_format” vs “repetition” due to prefix-cache/scheduler differences between the offline and HTTP engines — final permutation is identical.Wall-clock timeline
rerank_asyncoverlaps work across queriesUnit tests after the source+test fix
python -m unittest discover -s test/rerankpython -m unittest discover -s test/analysispython -m unittest discover -s test/evaluation.venv/binon PATH)expose sampling kwargstest_chat_completion_async_success— the new explicitextra_body=Noneleaked into the mock call signature. Fixed byletting
extra_bodyflow through**kwargsinstead of popping +re-passing it.
test_concurrent_rerank_async_shares_semaphore— the newAutoTokenizer.from_pretrained(model, …)call before constructingVllmHandlerneeded to be mocked. Fixed by patchingrank_llm.rerank.listwise.rank_listwise_os_llm.AutoTokenizerinthe test.
The CLI-smoke failures seen on a first run were unrelated to the
commit: a stale
~/.local/bin/rank-llmshim was earlier onPATHthan.venv/bin/rank-llm. Verified by re-running withPATH="$PWD/.venv/bin:$PATH"(all pass).Outputs saved
Per-run logs:
Conclusions
--sampling-json/--sampling-json-file(and theSAMPLING_JSONenv var) is correctly threaded through
RankListwiseOSLLM(sampling_kwargs=...)to both backends:
VllmHandler.generate_output_async(sampling_extra=…)AsyncOpenAI.chat.completions.create(..., extra_body=…)with OpenAI-native keys split out by
split_openai_chat_sampling.permutations under the same seed, validating equivalence of the
two pipelines.
(+0.26 nDCG@10); the 0.6B model only reshuffles existing top-k
without bringing in new relevant docs.
set); Recall@20 ticks up when Qwen3-4B promotes one extra relevant
doc.