Skip to content

make sure all the demos are supported by clis #390

@sahel-sh

Description

@sahel-sh

Rankllm has so many demos that act as integration tests for e2e flow, but running them is manual and expensive.
1- Change them to take parameters, like rerank_qwen_async and rerank_qwen_local.
2- make sure there are clis for the same tests
3- write skills for running smoke tests to automate the e2e testing process with a single/handful of queries: something that generate a report like the one below as the output, that looks at the regressions in metrics, as well as the invocation history, latency etc.

This will allow for complete testing of all the commands, code paths to make sure nothing breaks as regressions.

RankLLM single-query smoke / verification report

Goal

Validate that the new sampling_kwargs plumbing in RankListwiseOSLLM
works end-to-end on both code paths — the in-process vLLM engine
(rerank_qwen_local.py) and the OpenAI-compatible vLLM HTTP server
(rerank_qwen_async.py) — and that the in-tree unit tests still pass.
All runs use:

  • Dataset: TREC DL-19 (dl19), single query qid=264014 ("how long is life cycle of flea")
  • Retrieval: BM25 prebuilt index, --k 100
  • Listwise window: --window-size 20 --stride 10 (→ 9 sliding windows)
  • Sampling JSON:
    {"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}
  • Env: rankllm-2 conda env or in-repo .venv (vLLM 0.20.0, transformers 5.7.0)
  • Hardware: 1× NVIDIA RTX 6000 Ada (48 GB), CUDA_VISIBLE_DEVICES=0
    PYTHONNOUSERSITE=1 is set to bypass a broken ~/.local/lib/python3.11/site-packages/transformers.

Commands run

Run 1 — local in-process, Qwen3-0.6B (rankllm-2 env)

cd /u6/s8sharif/rankllm_new/rank_llm
PYTHONNOUSERSITE=1 CUDA_VISIBLE_DEVICES=0 \
python src/rank_llm/demo/rerank_qwen_local.py \
  --dataset dl19 --model Qwen/Qwen3-0.6B \
  --num-queries 1 --k 100 --batch-size 4 \
  --context-size 4096 --window-size 20 --stride 10 --num-gpus 1 \
  --output-dir demo_outputs/_single_query_test \
  --sampling-json '{"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}'

Run 2 — local in-process, Qwen3-4B (.venv)

cd /u6/s8sharif/rankllm_new/rank_llm
PYTHONNOUSERSITE=1 CUDA_VISIBLE_DEVICES=0 \
.venv/bin/python src/rank_llm/demo/rerank_qwen_local.py \
  --dataset dl19 --model Qwen/Qwen3-4B \
  --num-queries 1 --k 100 --batch-size 4 \
  --context-size 4096 --window-size 20 --stride 10 --num-gpus 1 \
  --output-dir demo_outputs/_single_query_test \
  --sampling-json '{"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}'

Run 3 — async client against a vLLM HTTP server, Qwen3-4B (.venv)

Server (background):

cd /u6/s8sharif/rankllm_new/rank_llm
PYTHONNOUSERSITE=1 CUDA_VISIBLE_DEVICES=0 \
.venv/bin/vllm serve Qwen/Qwen3-4B \
  --port 8765 --dtype auto \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --enable-prefix-caching \
  --served-model-name Qwen/Qwen3-4B \
  > logs/vllm_qwen3-4b.log 2>&1 &

Client:

PYTHONNOUSERSITE=1 \
.venv/bin/python src/rank_llm/demo/rerank_qwen_async.py \
  --dataset dl19 --model Qwen/Qwen3-4B \
  --base-url http://127.0.0.1:8765/v1 \
  --num-queries 1 --k 100 --batch-size 4 \
  --context-size 4096 --window-size 20 --stride 10 \
  --output-dir demo_outputs/_single_query_test_async \
  --sampling-json '{"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}'

Shutdown:

pgrep -f "vllm serve Qwen/Qwen3-4B" | xargs -r kill

Re-runs after the test/handler fixes

The two Qwen3-4B runs above were repeated unchanged after the source +
test fix; results were bit-for-bit identical to the pre-fix runs (final
permutation, all four metrics).

TREC-eval metrics (single dl19 query, qid=264014)

Metric BM25 (k=100) Local in-proc, Qwen3-0.6B Local in-proc, Qwen3-4B Async via vLLM server, Qwen3-4B
nDCG@10 0.5257 0.5257 (Δ +0.0000) 0.7821 (Δ +0.2564) 0.7821 (Δ +0.2564)
MAP@100 0.0776 0.0891 (Δ +0.0115) 0.1192 (Δ +0.0416) 0.1192 (Δ +0.0416)
Recall@20 0.0664 0.0664 (Δ +0.0000) 0.0758 (Δ +0.0094) 0.0758 (Δ +0.0094)
Recall@100 0.2417 0.2417 (Δ 0) 0.2417 (Δ 0) 0.2417 (Δ 0)

Top-5 reranked docids per run

Run rank 1 → 5
Qwen3-0.6B local 5611210, 6641238, 4834547, 96852, 96854
Qwen3-4B local (in-process) 6641238, 4834547, 6105572, 96855, 2223171
Qwen3-4B async (vLLM server) 6641238, 4834547, 6105572, 96855, 2223171
The two Qwen3-4B paths produce identical final rankings, confirming
the sampling_kwargs flow is equivalent across in-process vLLM and the
OpenAI-compatible HTTP path.

Response-parser stats (per run, out of 9 sliding-window LLM calls)

Run ok wrong_format repetition missing_documents
Qwen3-0.6B local 3 5 1 0
Qwen3-4B local (in-process) 5 1 3 0
Qwen3-4B async (vLLM server) 5 0 4 0
Interpretation:
  • Qwen3-0.6B with temperature=0.7 regularly produces invalid permutations (5/9 windows fail to parse).
  • Qwen3-4B is much cleaner (5/9 ok); the two 4B paths differ only on how the parser classifies “wrong_format” vs “repetition” due to prefix-cache/scheduler differences between the offline and HTTP engines — final permutation is identical.

Wall-clock timeline

Run Total Model load / torch.compile Reranking only
Qwen3-0.6B local (1st) 71 s ~63 s ~3 s
Qwen3-4B local (1st) 98 s ~85 s ~11 s
Qwen3-4B local (2nd, caches warm) 45 s ~33 s ~11 s
vLLM server cold start (Qwen3-4B) 100 s (1st), 40 s (warm)
Qwen3-4B async client (server already up) 22 s ~10 s
Qwen3-4B async client (re-run, server warm) 21 s ~10 s
The async path wins as soon as you have more than one query — sliding
windows are sequential per query (later windows depend on the previous
window's permutation), but rerank_async overlaps work across queries
on a single live server.

Unit tests after the source+test fix

Suite Result
python -m unittest discover -s test/rerank 84/84 OK
python -m unittest discover -s test/analysis 1/1 OK
python -m unittest discover -s test/evaluation 1/1 OK
CLI smoke (with .venv/bin on PATH) 84/84 OK (4 skipped)
Two test failures were introduced by the expose sampling kwargs
commit and fixed:
  1. test_chat_completion_async_success — the new explicit
    extra_body=None leaked into the mock call signature. Fixed by
    letting extra_body flow through **kwargs instead of popping +
    re-passing it.
  2. test_concurrent_rerank_async_shares_semaphore — the new
    AutoTokenizer.from_pretrained(model, …) call before constructing
    VllmHandler needed to be mocked. Fixed by patching
    rank_llm.rerank.listwise.rank_listwise_os_llm.AutoTokenizer in
    the test.
    The CLI-smoke failures seen on a first run were unrelated to the
    commit: a stale ~/.local/bin/rank-llm shim was earlier on PATH than
    .venv/bin/rank-llm. Verified by re-running with
    PATH="$PWD/.venv/bin:$PATH" (all pass).

Outputs saved

demo_outputs/_single_query_test/qwen3-0.6b/dl19/{rerank.jsonl, rerank.txt, invocations.json}
demo_outputs/_single_query_test/qwen3-4b/dl19/{rerank.jsonl, rerank.txt, invocations.json}
demo_outputs/_single_query_test_async/qwen3-4b/dl19/{rerank.jsonl, rerank.txt, invocations.json}

Per-run logs:

/tmp/rerank_qwen_local_single.log
/tmp/rerank_qwen_local_single_4b.log
/tmp/rerank_qwen_local_single_4b_after_fix.log
/tmp/rerank_qwen_async_single_4b.log
/tmp/rerank_qwen_async_single_4b_after_fix.log
logs/vllm_qwen3-4b.log

Conclusions

  • --sampling-json / --sampling-json-file (and the SAMPLING_JSON
    env var) is correctly threaded through RankListwiseOSLLM(sampling_kwargs=...)
    to both backends:
    • In-process vLLM: VllmHandler.generate_output_async(sampling_extra=…)
    • HTTP vLLM: AsyncOpenAI.chat.completions.create(..., extra_body=…)
      with OpenAI-native keys split out by split_openai_chat_sampling.
  • Both Qwen3-4B paths produced identical metrics and final
    permutations
    under the same seed, validating equivalence of the
    two pipelines.
  • Reranking is meaningfully useful only at the 4B size on this query
    (+0.26 nDCG@10); the 0.6B model only reshuffles existing top-k
    without bringing in new relevant docs.
  • Recall@100 cannot move (reranking only permutes the input top-100
    set); Recall@20 ticks up when Qwen3-4B promotes one extra relevant
    doc.
  • All in-repo unit tests are green again after the two follow-up fixes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions