make sure all the demos are supported by clis

Rankllm has so many demos that act as integration tests for e2e flow, but running them is manual and expensive.
1- Change them to take parameters, like rerank_qwen_async and rerank_qwen_local.
2- make sure there are clis for the same tests
3- write skills for running smoke tests to automate the e2e testing process with a single/handful of queries: something that generate a report like the one below as the output, that looks at the regressions in metrics, as well as the invocation history, latency etc.

This will allow for complete testing of all the commands, code paths to make sure nothing breaks as regressions.

# RankLLM single-query smoke / verification report
## Goal
Validate that the new `sampling_kwargs` plumbing in `RankListwiseOSLLM`
works end-to-end on both code paths — the in-process vLLM engine
(`rerank_qwen_local.py`) and the OpenAI-compatible vLLM HTTP server
(`rerank_qwen_async.py`) — and that the in-tree unit tests still pass.
All runs use:
- **Dataset:** TREC DL-19 (`dl19`), single query `qid=264014` (`"how long is life cycle of flea"`)
- **Retrieval:** BM25 prebuilt index, `--k 100`
- **Listwise window:** `--window-size 20 --stride 10`  (→ 9 sliding windows)
- **Sampling JSON:**
  ```json
  {"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}
  ```
- **Env:** `rankllm-2` conda env or in-repo `.venv` (vLLM 0.20.0, transformers 5.7.0)
- **Hardware:** 1× NVIDIA RTX 6000 Ada (48 GB), `CUDA_VISIBLE_DEVICES=0`
`PYTHONNOUSERSITE=1` is set to bypass a broken `~/.local/lib/python3.11/site-packages/transformers`.
---
## Commands run
### Run 1 — local in-process, Qwen3-0.6B (rankllm-2 env)
```bash
cd /u6/s8sharif/rankllm_new/rank_llm
PYTHONNOUSERSITE=1 CUDA_VISIBLE_DEVICES=0 \
python src/rank_llm/demo/rerank_qwen_local.py \
  --dataset dl19 --model Qwen/Qwen3-0.6B \
  --num-queries 1 --k 100 --batch-size 4 \
  --context-size 4096 --window-size 20 --stride 10 --num-gpus 1 \
  --output-dir demo_outputs/_single_query_test \
  --sampling-json '{"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}'
```
### Run 2 — local in-process, Qwen3-4B (.venv)
```bash
cd /u6/s8sharif/rankllm_new/rank_llm
PYTHONNOUSERSITE=1 CUDA_VISIBLE_DEVICES=0 \
.venv/bin/python src/rank_llm/demo/rerank_qwen_local.py \
  --dataset dl19 --model Qwen/Qwen3-4B \
  --num-queries 1 --k 100 --batch-size 4 \
  --context-size 4096 --window-size 20 --stride 10 --num-gpus 1 \
  --output-dir demo_outputs/_single_query_test \
  --sampling-json '{"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}'
```
### Run 3 — async client against a vLLM HTTP server, Qwen3-4B (.venv)
Server (background):
```bash
cd /u6/s8sharif/rankllm_new/rank_llm
PYTHONNOUSERSITE=1 CUDA_VISIBLE_DEVICES=0 \
.venv/bin/vllm serve Qwen/Qwen3-4B \
  --port 8765 --dtype auto \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --enable-prefix-caching \
  --served-model-name Qwen/Qwen3-4B \
  > logs/vllm_qwen3-4b.log 2>&1 &
```
Client:
```bash
PYTHONNOUSERSITE=1 \
.venv/bin/python src/rank_llm/demo/rerank_qwen_async.py \
  --dataset dl19 --model Qwen/Qwen3-4B \
  --base-url http://127.0.0.1:8765/v1 \
  --num-queries 1 --k 100 --batch-size 4 \
  --context-size 4096 --window-size 20 --stride 10 \
  --output-dir demo_outputs/_single_query_test_async \
  --sampling-json '{"temperature":0.7,"top_p":0.9,"top_k":50,"repetition_penalty":1.05,"seed":42}'
```
Shutdown:
```bash
pgrep -f "vllm serve Qwen/Qwen3-4B" | xargs -r kill
```
### Re-runs after the test/handler fixes
The two Qwen3-4B runs above were repeated unchanged after the source +
test fix; results were bit-for-bit identical to the pre-fix runs (final
permutation, all four metrics).
---
## TREC-eval metrics (single dl19 query, qid=264014)
| Metric     | BM25 (k=100) | Local in-proc, Qwen3-0.6B | Local in-proc, Qwen3-4B | Async via vLLM server, Qwen3-4B |
|------------|---------------|---------------------------|--------------------------|----------------------------------|
| nDCG@10    | 0.5257        | 0.5257 (Δ +0.0000)        | **0.7821** (Δ +0.2564)   | **0.7821** (Δ +0.2564)           |
| MAP@100    | 0.0776        | 0.0891 (Δ +0.0115)        | **0.1192** (Δ +0.0416)   | **0.1192** (Δ +0.0416)           |
| Recall@20  | 0.0664        | 0.0664 (Δ +0.0000)        | **0.0758** (Δ +0.0094)   | **0.0758** (Δ +0.0094)           |
| Recall@100 | 0.2417        | 0.2417 (Δ 0)              | 0.2417 (Δ 0)             | 0.2417 (Δ 0)                     |
### Top-5 reranked docids per run
| Run                          | rank 1 → 5                                |
|------------------------------|-------------------------------------------|
| Qwen3-0.6B local             | 5611210, 6641238, 4834547, 96852, 96854   |
| Qwen3-4B local (in-process)  | 6641238, 4834547, 6105572, 96855, 2223171 |
| Qwen3-4B async (vLLM server) | 6641238, 4834547, 6105572, 96855, 2223171 |
The two Qwen3-4B paths produce **identical final rankings**, confirming
the `sampling_kwargs` flow is equivalent across in-process vLLM and the
OpenAI-compatible HTTP path.
---
## Response-parser stats (per run, out of 9 sliding-window LLM calls)
| Run                          | ok | wrong_format | repetition | missing_documents |
|------------------------------|----|--------------|------------|-------------------|
| Qwen3-0.6B local             | 3  | 5            | 1          | 0                 |
| Qwen3-4B local (in-process)  | 5  | 1            | 3          | 0                 |
| Qwen3-4B async (vLLM server) | 5  | 0            | 4          | 0                 |
Interpretation:
- Qwen3-0.6B with `temperature=0.7` regularly produces invalid permutations (5/9 windows fail to parse).
- Qwen3-4B is much cleaner (5/9 `ok`); the two 4B paths differ only on how the parser classifies “wrong_format” vs “repetition” due to prefix-cache/scheduler differences between the offline and HTTP engines — final permutation is identical.
---
## Wall-clock timeline
| Run                                          | Total | Model load / torch.compile | Reranking only |
|----------------------------------------------|-------|----------------------------|----------------|
| Qwen3-0.6B local (1st)                       | 71 s  | ~63 s                      | ~3 s           |
| Qwen3-4B local (1st)                         | 98 s  | ~85 s                      | ~11 s          |
| Qwen3-4B local (2nd, caches warm)            | 45 s  | ~33 s                      | ~11 s          |
| vLLM server cold start (Qwen3-4B)            | 100 s (1st), 40 s (warm) | — | — |
| Qwen3-4B async client (server already up)    | 22 s  | —                          | ~10 s          |
| Qwen3-4B async client (re-run, server warm)  | 21 s  | —                          | ~10 s          |
The async path wins as soon as you have more than one query — sliding
windows are sequential per query (later windows depend on the previous
window's permutation), but `rerank_async` overlaps work across queries
on a single live server.
---
## Unit tests after the source+test fix
| Suite                                              | Result       |
|----------------------------------------------------|--------------|
| `python -m unittest discover -s test/rerank`       | 84/84 OK     |
| `python -m unittest discover -s test/analysis`     | 1/1 OK       |
| `python -m unittest discover -s test/evaluation`   | 1/1 OK       |
| CLI smoke (with `.venv/bin` on PATH)               | 84/84 OK (4 skipped) |
Two test failures were introduced by the `expose sampling kwargs`
commit and fixed:
1. `test_chat_completion_async_success` — the new explicit
   `extra_body=None` leaked into the mock call signature. Fixed by
   letting `extra_body` flow through `**kwargs` instead of popping +
   re-passing it.
2. `test_concurrent_rerank_async_shares_semaphore` — the new
   `AutoTokenizer.from_pretrained(model, …)` call before constructing
   `VllmHandler` needed to be mocked. Fixed by patching
   `rank_llm.rerank.listwise.rank_listwise_os_llm.AutoTokenizer` in
   the test.
The CLI-smoke failures seen on a first run were unrelated to the
commit: a stale `~/.local/bin/rank-llm` shim was earlier on `PATH` than
`.venv/bin/rank-llm`. Verified by re-running with
`PATH="$PWD/.venv/bin:$PATH"` (all pass).
---
## Outputs saved
```
demo_outputs/_single_query_test/qwen3-0.6b/dl19/{rerank.jsonl, rerank.txt, invocations.json}
demo_outputs/_single_query_test/qwen3-4b/dl19/{rerank.jsonl, rerank.txt, invocations.json}
demo_outputs/_single_query_test_async/qwen3-4b/dl19/{rerank.jsonl, rerank.txt, invocations.json}
```
Per-run logs:
```
/tmp/rerank_qwen_local_single.log
/tmp/rerank_qwen_local_single_4b.log
/tmp/rerank_qwen_local_single_4b_after_fix.log
/tmp/rerank_qwen_async_single_4b.log
/tmp/rerank_qwen_async_single_4b_after_fix.log
logs/vllm_qwen3-4b.log
```
---
## Conclusions
- `--sampling-json` / `--sampling-json-file` (and the `SAMPLING_JSON`
  env var) is correctly threaded through `RankListwiseOSLLM(sampling_kwargs=...)`
  to both backends:
  - In-process vLLM: `VllmHandler.generate_output_async(sampling_extra=…)`
  - HTTP vLLM: `AsyncOpenAI.chat.completions.create(..., extra_body=…)`
    with OpenAI-native keys split out by `split_openai_chat_sampling`.
- Both Qwen3-4B paths produced **identical metrics and final
  permutations** under the same seed, validating equivalence of the
  two pipelines.
- Reranking is meaningfully useful only at the 4B size on this query
  (+0.26 nDCG@10); the 0.6B model only reshuffles existing top-k
  without bringing in new relevant docs.
- Recall@100 cannot move (reranking only permutes the input top-100
  set); Recall@20 ticks up when Qwen3-4B promotes one extra relevant
  doc.
- All in-repo unit tests are green again after the two follow-up fixes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make sure all the demos are supported by clis #390

RankLLM single-query smoke / verification report

Goal

Commands run

Run 1 — local in-process, Qwen3-0.6B (rankllm-2 env)

Run 2 — local in-process, Qwen3-4B (.venv)

Run 3 — async client against a vLLM HTTP server, Qwen3-4B (.venv)

Re-runs after the test/handler fixes

The two Qwen3-4B runs above were repeated unchanged after the source +
test fix; results were bit-for-bit identical to the pre-fix runs (final
permutation, all four metrics).

TREC-eval metrics (single dl19 query, qid=264014)

Top-5 reranked docids per run

Response-parser stats (per run, out of 9 sliding-window LLM calls)

Wall-clock timeline

Unit tests after the source+test fix

Outputs saved

Conclusions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	BM25 (k=100)	Local in-proc, Qwen3-0.6B	Local in-proc, Qwen3-4B	Async via vLLM server, Qwen3-4B
nDCG@10	0.5257	0.5257 (Δ +0.0000)	0.7821 (Δ +0.2564)	0.7821 (Δ +0.2564)
MAP@100	0.0776	0.0891 (Δ +0.0115)	0.1192 (Δ +0.0416)	0.1192 (Δ +0.0416)
Recall@20	0.0664	0.0664 (Δ +0.0000)	0.0758 (Δ +0.0094)	0.0758 (Δ +0.0094)
Recall@100	0.2417	0.2417 (Δ 0)	0.2417 (Δ 0)	0.2417 (Δ 0)

Run	rank 1 → 5
Qwen3-0.6B local	5611210, 6641238, 4834547, 96852, 96854
Qwen3-4B local (in-process)	6641238, 4834547, 6105572, 96855, 2223171
Qwen3-4B async (vLLM server)	6641238, 4834547, 6105572, 96855, 2223171
The two Qwen3-4B paths produce identical final rankings, confirming
the `sampling_kwargs` flow is equivalent across in-process vLLM and the
OpenAI-compatible HTTP path.

Run	Total	Model load / torch.compile	Reranking only
Qwen3-0.6B local (1st)	71 s	~63 s	~3 s
Qwen3-4B local (1st)	98 s	~85 s	~11 s
Qwen3-4B local (2nd, caches warm)	45 s	~33 s	~11 s
vLLM server cold start (Qwen3-4B)	100 s (1st), 40 s (warm)	—	—
Qwen3-4B async client (server already up)	22 s	—	~10 s
Qwen3-4B async client (re-run, server warm)	21 s	—	~10 s
The async path wins as soon as you have more than one query — sliding
windows are sequential per query (later windows depend on the previous
window's permutation), but `rerank_async` overlaps work across queries
on a single live server.

Suite	Result
`python -m unittest discover -s test/rerank`	84/84 OK
`python -m unittest discover -s test/analysis`	1/1 OK
`python -m unittest discover -s test/evaluation`	1/1 OK
CLI smoke (with `.venv/bin` on PATH)	84/84 OK (4 skipped)
Two test failures were introduced by the `expose sampling kwargs`
commit and fixed:

make sure all the demos are supported by clis #390

Description

RankLLM single-query smoke / verification report

Goal

Commands run

Run 1 — local in-process, Qwen3-0.6B (rankllm-2 env)

Run 2 — local in-process, Qwen3-4B (.venv)

Run 3 — async client against a vLLM HTTP server, Qwen3-4B (.venv)

Re-runs after the test/handler fixes

The two Qwen3-4B runs above were repeated unchanged after the source + test fix; results were bit-for-bit identical to the pre-fix runs (final permutation, all four metrics).

TREC-eval metrics (single dl19 query, qid=264014)

Top-5 reranked docids per run

Response-parser stats (per run, out of 9 sliding-window LLM calls)

Wall-clock timeline

Unit tests after the source+test fix

Outputs saved

Conclusions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

The two Qwen3-4B runs above were repeated unchanged after the source +
test fix; results were bit-for-bit identical to the pre-fix runs (final
permutation, all four metrics).