Skip to content

Latest commit

 

History

History
513 lines (399 loc) · 34.5 KB

File metadata and controls

513 lines (399 loc) · 34.5 KB

Anserini Fatjar Regresions (v1.7.1)

Fetch the fatjar:

wget https://repo1.maven.org/maven2/io/anserini/anserini/1.7.1/anserini-1.7.1-fatjar.jar

Start by setting the OUTPUT_DIR and JAVA_OPTS:

export OUTPUT_DIR="."

# for zsh
export JAVA_OPTS=(-cp `ls anserini-*-fatjar.jar` --add-modules jdk.incubator.vector)

# for bash
export JAVA_OPTS="-cp `ls anserini-*-fatjar.jar` --add-modules jdk.incubator.vector"

❗ Anserini ships with a number of prebuilt indexes, which it'll automagically download for you. This is a great feature, but the indexes can take up a lot of space. See this guide on prebuilt indexes for more details.

Contents

Extracting queries and documents: converting TREC runs into jsonl structures that include both queries and candidate documents that, for example, feed downstream rerankers.

TREC RAG

The MS MARCO V2.1 collections were created for the TREC RAG Track. It was the official corpus used in 2024 and will remain the corpus for 2025. There are two separate MS MARCO V2.1 "variants", documents and segmented documents:

  • The segmented documents corpus (segments = passages) is the one actually used for the TREC RAG evaluations. It contains 113,520,750 passages.
  • The documents corpus is the source of the segments and useful as a point of reference (but not actually used in the TREC evaluations). It contains 10,960,555 documents.

Here, we focus on the segmented documents corpus.

With Anserini, you can reproduce baseline runs on the TREC 2024 RAG test queries using BM25 and ArcticEmbed-L embeddings. Using the UMBRELA qrels, these are the evaluation numbers you'd get:

Dataset / Metric BM25 ArcticEmbed-L
RAG24 Test (UMBRELA): nDCG@20 0.3198 0.5497
RAG24 Test (UMBRELA): nDCG@100 0.2563 0.4855
RAG24 Test (UMBRELA): Recall@100 0.1395 0.2547

See instructions below on how to reproduce these runs; more details can be found in the following two papers:

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, and Jimmy Lin. A Large-Scale Study of Relevance Assessments with Large Language Models Using UMBRELA. Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR 2025), pages 358-368, July 2025, Padua, Italy.

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look. arXiv:2411.08275, November 2024.

This guide covers runs with the official TREC 2024 RAG test queries. See this page for instructions on runs with the TREC 2024 RAG "dev queries".

BM25

For BM25, Anserini provides prebuilt inverted indexes. The following command will reproduce the above results:

❗ Beware, you need lots of space to run these experiments. The msmarco-v2.1-doc-segmented prebuilt index is 83 GB uncompressed. The command below will download the index automatically. See this guide on prebuilt indexes for more details.

java $JAVA_OPTS io.anserini.search.SearchCollection \
  -index msmarco-v2.1-doc-segmented -topics rag24.test -bm25 -hits 1000 \
  -output $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt

And to evaluate:

java $JAVA_OPTS trec_eval -c -m ndcg_cut.20 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt
java $JAVA_OPTS trec_eval -c -m ndcg_cut.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt
java $JAVA_OPTS trec_eval -c -m recall.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt

ArcticEmbed-L

For ArcticEmbed-L, Anserini also provides prebuilt indexes with ArcticEmbed-L embeddings. The embedding vectors were generated by Snowflake and are freely downloadable on Hugging Face. We provide prebuilt HNSW indexes with int8 quantization, divided into 10 shards, 00 to 09.

❗ Beware, the complete ArcticEmbed-L index for all 10 shards of the MS MARCO V2.1 segmented document collection totals 557 GB! The commands below will download the indexes automatically, so make sure you have plenty of space. See this guide on prebuilt indexes for general info on prebuilt indexes. Additional helpful tips are provided below for dealing with space issues.

Here's how you reproduce results for each shard individually on the TREC 2024 RAG Track test queries, using ONNX to encode queries on the fly (which means you can extend to arbitrary queries):

SHARDS=(00 01 02 03 04 05 06 07 08 09); for shard in "${SHARDS[@]}"
do
    java $JAVA_OPTS io.anserini.search.SearchHnswDenseVectors -threads 32 -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 -topics rag24.test -topicReader TsvString -topicField title -encoder ArcticEmbedL -output $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard${shard}.txt -hits 250 -efSearch 1000 > $OUTPUT_DIR/log.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard${shard}.txt 2>&1
done
Same commands, but using cached queries (faster)
SHARDS=(00 01 02 03 04 05 06 07 08 09); for shard in "${SHARDS[@]}"
do
    java $JAVA_OPTS io.anserini.search.SearchHnswDenseVectors -threads 32 -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 -topics rag24.test.snowflake-arctic-embed-l -output $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard${shard}.txt -hits 250 -efSearch 1000 > $OUTPUT_DIR/log.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard${shard}.txt 2>&1
done

For evaluation purposes, you can just cat all the 10 run files together and evaluate:

cat $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard0* > $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt

java $JAVA_OPTS trec_eval -c -m ndcg_cut.20 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt
java $JAVA_OPTS trec_eval -c -m ndcg_cut.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt
java $JAVA_OPTS trec_eval -c -m recall.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt

You should arrive at exactly the effectiveness metrics above.

Alternatively, you can use SearchShardedHnswDenseVectors to search all the shards at once. Here, you trade off fine-grained control for convenience. In the following, we use ONNX to encode queries:

java $JAVA_OPTS io.anserini.search.SearchShardedHnswDenseVectors \
  -index "msmarco-v2.1-doc-segmented-shard00.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard01.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard02.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard03.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard04.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard05.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard06.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard07.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard08.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard09.arctic-embed-l.hnsw-int8" \
  -topics rag24.test -topicReader TsvString -topicField title -encoder ArcticEmbedL \
  -hits 250 -efSearch 1000 -threads 4 \
  -output $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt \
  > $OUTPUT_DIR/log.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt 2>&1

In this case, the output run file contains results from all the shards. You can directly evaluate:

java $JAVA_OPTS trec_eval -c -m ndcg_cut.20 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt
java $JAVA_OPTS trec_eval -c -m ndcg_cut.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt
java $JAVA_OPTS trec_eval -c -m recall.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt

You should arrive at exactly the effectiveness metrics above.

MS MARCO V1 Passages

Anserini provides support for a variety of models. The table below reports the effectiveness of selected models (dev in terms of RR@10, DL19 and DL20 in terms of nDCG@10):

dev DL19 DL20
BM25 (k1=0.9, b=0.4) 0.1840 0.5058 0.4796
SPLADE-v3: cached queries 0.3999 0.7264 0.7522
SPLADE-v3: ONNX 0.4000 0.7264 0.7522
cosDPR-distil: HNSW, cached queries 0.3887 0.7250 0.7025
cosDPR-distil: HNSW, ONNX 0.3887 0.7250 0.7025
cosDPR-distil: quantized (int8) HNSW, cached queries 0.3897 0.7240 0.7004
cosDPR-distil: quantized (int8) HNSW, ONNX 0.3899 0.7247 0.6996
bge-base-en-v1.5: HNSW, cached queries 0.3574 0.7065 0.6780
bge-base-en-v1.5: HNSW, ONNX 0.3575 0.7016 0.6768
bge-base-en-v1.5: quantized (int8) HNSW, cached queries 0.3572 0.7016 0.6738
bge-base-en-v1.5: quantized (int8) HNSW, ONNX 0.3575 0.7017 0.6767
cohere-embed-english-v3.0: HNSW, cached queries 0.3647 0.6956 0.7245
cohere-embed-english-v3.0: quantized (int) HNSW, cached queries 0.3656 0.6955 0.7262

The following commands will reproduce runs corresponding to the above models (as well as additional ones not included in the table):

java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v1-passage.core
java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v1-passage.optional

To print out the commands that will generate the runs without performing the runs, use the options --dry-run --print-commands.

❗ Beware, running msmarco-v1-passage.core will automatically download 5 indexes totaling 65.3 GB; msmarco-v1-passage.optional will automatically download 10 indexes totaling 205.4 GB.

MS MARCO V1 Documents

Anserini provides support for a variety of models. The table below reports the effectiveness of selected models (dev in terms of RR@100, DL19 and DL20 in terms of nDCG@10):

dev DL19 DL20
BM25 complete doc (k1=0.9, b=0.4) 0.2299 0.5176 0.5286
BM25 segmented doc (k1=0.9, b=0.4) 0.2684 0.5302 0.5281
BM25 complete doc with doc2query-T5 0.2880 0.5968 0.5885
BM25 segmented doc with doc2query-T5 0.3179 0.6119 0.5957
uniCOIL (with doc2query-T5): ONNX 0.3531 0.6396 0.6033

The following commands will reproduce runs corresponding to the above models (as well as additional ones not included in the table):

java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v1-doc.core
java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v1-doc.optional

To print out the commands that will generate the runs without performing the runs, use the options --dry-run --print-commands.

❗ Beware, running msmarco-v1-doc.core will automatically download 5 indexes totaling 45.7 GB; msmarco-v1-doc.optional will automatically download 6 indexes totaling 77.9 GB.

MS MARCO V2 Passages

Anserini provides support for a variety of models. The table below reports the effectiveness of selected models (dev and dev2 in terms of RR@100, DL21-23 in terms of nDCG@10):

dev dev2 DL21 DL22 DL23
BM25 (k1=0.9, b=0.4) 0.0719 0.0802 0.4458 0.2692 0.2627
uniCOIL (with doc2query-T5): ONNX 0.1499 0.1577 0.6159 0.4614 0.3855

The following commands will reproduce runs corresponding to the above models (as well as additional ones not included in the table):

java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v2-passage.core
java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v2-passage.optional

To print out the commands that will generate the runs without performing the runs, use the options --dry-run --print-commands.

❗ Beware, running msmarco-v2-passage.core will automatically download 3 indexes totaling 91.2 GB; msmarco-v2-passage.optional will automatically download 4 indexes totaling 127.3 GB.

MS MARCO V2 Documents

Anserini provides support for a variety of models. The table below reports the effectiveness of selected models (dev in terms of RR@10, DL19 and DL20 in terms of nDCG@10):

dev dev2 DL21 DL22 DL23
BM25 complete doc (k1=0.9, b=0.4) 0.1572 0.1659 0.5116 0.2993 0.2946
BM25 segmented doc (k1=0.9, b=0.4) 0.1896 0.1930 0.5776 0.3618 0.3405
BM25 complete doc with doc2query-T5 0.2011 0.2012 0.5792 0.3539 0.3511
BM25 segmented doc with doc2query-T5 0.2226 0.2234 0.6289 0.3975 0.3612
uniCOIL (with doc2query-T5): ONNX 0.2419 0.2445 0.6783 0.4451 0.4150

The following commands will reproduce runs corresponding to the above models (as well as additional ones not included in the table):

java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v2-doc.core
java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v2-doc.optional

To print out the commands that will generate the runs without performing the runs, use the options --dry-run --print-commands.

❗ Beware, running msmarco-v2-doc.core will automatically download 5 indexes totaling 278.1 GB; msmarco-v2-doc.optional will automatically download 2 indexes totaling 68.5 GB.

MS MARCO V2.1 Segmented Documents

The MS MARCO V2.1 collections were created for the TREC RAG Track. There were two variants: the documents corpus and the segmented documents corpus. The documents corpus served as the source of the segmented documents corpus, but the segmented documents corpus is the one used in official TREC RAG evaluations. The following table reports nDCG@20 scores for various retrieval conditions:

RAG 24 UMBRELA RAG 24 NIST
BM25 0.3198 0.2809
SPLADE-v3: ONNX 0.5167 0.4642
Arctic-embed-l (shard00): quantized (int8) HNSW, ONNX 0.3003 0.2449
Arctic-embed-l (shard01): quantized (int8) HNSW, ONNX 0.2599 0.2184
Arctic-embed-l (shard02): quantized (int8) HNSW, ONNX 0.2661 0.2211
Arctic-embed-l (shard03): quantized (int8) HNSW, ONNX 0.2705 0.2388
Arctic-embed-l (shard04): quantized (int8) HNSW, ONNX 0.2937 0.2253
Arctic-embed-l (shard05): quantized (int8) HNSW, ONNX 0.2590 0.2383
Arctic-embed-l (shard06): quantized (int8) HNSW, ONNX 0.2444 0.2336
Arctic-embed-l (shard07): quantized (int8) HNSW, ONNX 0.2417 0.2255
Arctic-embed-l (shard08): quantized (int8) HNSW, ONNX 0.2847 0.2765
Arctic-embed-l (shard09): quantized (int8) HNSW, ONNX 0.2432 0.2457

The following commands will reproduce runs corresponding to the above models (as well as additional ones not included in the table):

java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v2.1-doc-segmented.core
java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v2.1-doc-segmented.optional

To print out the commands that will generate the runs without performing the runs, use the options --dry-run --print-commands.

❗ Beware, running msmarco-v2.1-doc-segmented.core will automatically download 12 indexes totaling 698.0 GB; msmarco-v2.1-doc-segmented.optional will automatically download 13 indexes totaling 819.9 GB.

MS MARCO V2.1 Documents

The MS MARCO V2.1 collections were created for the TREC RAG Track. There were two variants: the documents corpus and the segmented documents corpus. The documents corpus served as the source of the segmented documents corpus, but is not otherwise used in any formal evaluations. It primarily served development purposes for the TREC 2024 RAG evaluation, where previous qrels from MS MARCO V2 and DL21-DL23 were "projected over" to this corpus.

The table below reports effectiveness (dev in terms of RR@10, DL21-DL23, RAGgy in terms of nDCG@10):

dev dev2 DL21 DL22 DL23 RAGgy
BM25 doc 0.1654 0.1732 0.5183 0.2991 0.2914 0.3631
BM25 doc-segmented 0.1973 0.2000 0.5778 0.3576 0.3356 0.4227

The following commands will reproduce runs corresponding to the above models (as well as additional ones not included in the table):

java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v2.1-doc.core
java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config msmarco-v2.1-doc.optional

To print out the commands that will generate the runs without performing the runs, use the options --dry-run --print-commands.

❗ Beware, running msmarco-v2.1-doc.core will automatically download 2 indexes totaling 145.8 GB; msmarco-v2.1-doc.optional will automatically download 4 indexes totaling 326.5 GB.

BEIR

Here is a selection of models that are currently supported in Anserini:

  • BM25 (flat): BM25, "flat" bag-of-words baseline (see paper below)
  • BM25 (MF): BM25, "multifield" bag-of-words baseline (see paper below)
  • SPLADE-v3: SPLADE-v3 with ONNX query encoding
  • BGE (flat): bge-base-en-v1.5 using flat vector indexes, with ONNX query encoding
  • BGE (HNSW): bge-base-en-v1.5 using HNSW indexes, with ONNX query encoding

Ehsan Kamalloo, Nandan Thakur, Carlos Lassance, Xueguang Ma, Jheng-Hong Yang, and Jimmy Lin. Resources for Brewing BEIR: Reproducible Reference Models and Statistical Analyses. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024), pages 1431-1440, July 2024, Washington, D.C.

The table below reports the effectiveness of the models (nDCG@10):

Corpus BM25 (flat) BM25 (MF) SPLADE-v3 BGE (flat) BGE (HNSW)
trec-covid 0.5947 0.6559 0.7299 0.7815 0.7835
bioasq 0.5225 0.4646 0.5142 0.4148 0.4042
nfcorpus 0.3218 0.3254 0.3629 0.3735 0.3735
nq 0.3055 0.3285 0.5842 0.5415 0.5415
hotpotqa 0.6330 0.6027 0.6884 0.7259 0.7241
fiqa 0.2361 0.2361 0.3798 0.4065 0.4065
signal1m 0.3304 0.3304 0.2465 0.2886 0.2869
trec-news 0.3952 0.3977 0.4365 0.4424 0.4410
robust04 0.4070 0.4070 0.4952 0.4435 0.4437
arguana 0.3970 0.4142 0.4862 0.6375 0.6375
webis-touche2020 0.4422 0.3673 0.3086 0.2571 0.2571
cqadupstack-android 0.3801 0.3709 0.4109 0.5076 0.5076
cqadupstack-english 0.3453 0.3321 0.4255 0.4857 0.4855
cqadupstack-gaming 0.4822 0.4418 0.5193 0.5967 0.5967
cqadupstack-gis 0.2901 0.2904 0.3236 0.4131 0.4133
cqadupstack-mathematica 0.2015 0.2046 0.2445 0.3163 0.3163
cqadupstack-physics 0.3214 0.3248 0.3753 0.4724 0.4724
cqadupstack-programmers 0.2802 0.2963 0.3387 0.4238 0.4238
cqadupstack-stats 0.2711 0.2790 0.3137 0.3728 0.3728
cqadupstack-tex 0.2244 0.2086 0.2493 0.3115 0.3115
cqadupstack-unix 0.2749 0.2788 0.3196 0.4220 0.4220
cqadupstack-webmasters 0.3059 0.3008 0.3250 0.4072 0.4072
cqadupstack-wordpress 0.2483 0.2562 0.2807 0.3547 0.3547
quora 0.7886 0.7886 0.8141 0.8876 0.8876
dbpedia-entity 0.3180 0.3128 0.4476 0.4073 0.4076
scidocs 0.1490 0.1581 0.1567 0.2172 0.2172
fever 0.6513 0.7530 0.8015 0.8629 0.8620
climate-fever 0.1651 0.2129 0.2625 0.3117 0.3117
scifact 0.6789 0.6647 0.7140 0.7408 0.7408

The table below reports fusion results (nDCG@10) combining BM25 (flat) and BGE (flat) runs:

  • RRF: Reciprocal Rank Fusion (k=60)
  • Average: Average fusion with min-max normalization
Corpus BM25 (flat) BGE (flat) RRF Average
trec-covid 0.5947 0.7815 0.8041 0.7956
bioasq 0.5225 0.4148 0.5278 0.5427
nfcorpus 0.3218 0.3735 0.3725 0.3782
nq 0.3055 0.5415 0.4831 0.5183
hotpotqa 0.6330 0.7259 0.7389 0.7658
fiqa 0.2361 0.4065 0.3671 0.3942
signal1m 0.3304 0.2886 0.3533 0.3626
trec-news 0.3952 0.4424 0.4855 0.5008
robust04 0.4070 0.4435 0.5070 0.5127
arguana 0.3970 0.6375 0.5626 0.5738
webis-touche2020 0.4422 0.2571 0.3771 0.3755
cqadupstack-android 0.3801 0.5076 0.4652 0.4868
cqadupstack-english 0.3453 0.4857 0.4461 0.4678
cqadupstack-gaming 0.4822 0.5967 0.5615 0.5818
cqadupstack-gis 0.2901 0.4131 0.3679 0.3937
cqadupstack-mathematica 0.2015 0.3163 0.2751 0.2951
cqadupstack-physics 0.3214 0.4724 0.4143 0.4375
cqadupstack-programmers 0.2802 0.4238 0.3715 0.4005
cqadupstack-stats 0.2711 0.3728 0.3414 0.3534
cqadupstack-tex 0.2244 0.3115 0.2931 0.3090
cqadupstack-unix 0.2749 0.4220 0.3597 0.3853
cqadupstack-webmasters 0.3059 0.4072 0.3711 0.3857
cqadupstack-wordpress 0.2483 0.3547 0.3353 0.3546
quora 0.7886 0.8876 0.8682 0.8858
dbpedia-entity 0.3180 0.4073 0.4190 0.4374
scidocs 0.1490 0.2172 0.1948 0.2019
fever 0.6513 0.8629 0.8108 0.8584
climate-fever 0.1651 0.3117 0.2812 0.2946
scifact 0.6789 0.7408 0.7420 0.7472

The following commands will reproduce runs corresponding to the above models (as well as additional ones not included in the table):

java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config beir.core
java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config beir.optional

To print out the commands that will generate the runs without performing the runs, use the options --dry-run --print-commands.

❗ Beware, running beir.core will automatically download 145 indexes totaling 378.5 GB; beir.optional will automatically download 116 indexes totaling 289.2 GB.

BRIGHT

BRIGHT is a retrieval benchmark described in this paper. The following table reports nDCG@10 scores.

  • BM25: bag-of-words BM25
  • BM25QS: query-side BM25
  • SPLADE-v3: SPLADE-v3 with ONNX query encoding
  • BGE (flat): BGE-large-en-v1.5 using flat vector indexes with ONNX query encoding

Yijun Ge, Sahel Sharifymoghaddam, and Jimmy Lin. Lighting the Way for BRIGHT: Reproducible Baselines with Anserini, Pyserini, and RankLLM. arXiv:2509.02558, 2025.

Corpus BM25 BM25QS SPLADE-v3 BGE (flat)
StackExchange
Biology 0.1824 0.1972 0.2101 0.1242
Earth Science 0.2791 0.2789 0.2670 0.2545
Economics 0.1645 0.1518 0.1604 0.1662
Psychology 0.1342 0.1266 0.1527 0.1805
Robotics 0.1091 0.1390 0.1578 0.1230
Stack Overflow 0.1626 0.1855 0.1290 0.1099
Sustainable Living 0.1613 0.1515 0.1497 0.1440
StackExchange average 0.1705 0.1758 0.1752 0.1575
 
Coding
LeetCode 0.2471 0.2497 0.2603 0.2668
Pony 0.0434 0.0789 0.1440 0.0338
Coding average 0.1453 0.1643 0.2022 0.1503
 
Theorems
AoPS 0.0645 0.0627 0.0692 0.0638
TheoremQA-Q 0.0733 0.1036 0.1113 0.1411
TheoremQA-T 0.0214 0.0492 0.0554 0.0532
Theorems average 0.0531 0.0718 0.0786 0.0860
 
Overall average 0.1369 0.1479 0.1556 0.1384

The following commands will reproduce runs corresponding to the above models (as well as additional ones not included in the table):

java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config bright.core
java $JAVA_OPTS io.anserini.reproduce.ReproduceFromPrebuiltIndexes --print-commands --compute-index-size --config bright.optional

To print out the commands that will generate the runs without performing the runs, use the options --dry-run --print-commands.

❗ Beware, running bright.core will automatically download 36 indexes totaling 6.8 GB; bright.optional will automatically download 24 indexes totaling 5.7 GB.

Extracting Queries and Documents

To generate jsonl output containing the raw documents that can be reranked and further processed (for example, using RankLLM), use the ExtractQueriesAndDocumentsFromTrecRun CLI on the desired run file. For example, to generate for the BM25 retrieval results:

java $JAVA_OPTS io.anserini.cli.ExtractQueriesAndDocumentsFromTrecRun \
  --index msmarco-v2.1-doc-segmented --topics rag24.test --hits 20 \
  --run $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt \
  --output $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.candidates.jsonl

In the above command, we only fetch the top-20 hits. To examine the output, pipe through jq to pretty-print:

$ head -n 1 $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.candidates.jsonl | jq
{
  "query": {
    "qid": "2024-105741",
    "text": "is it dangerous to have wbc over 15,000 without treatment?"
  },
  "candidates": [
    {
      "docid": "msmarco_v2.1_doc_16_287012450#4_490828734",
      "score": 15.8199,
      "doc": {
        "url": "https://emedicine.medscape.com/article/961169-treatment",
        "title": "Bacteremia Treatment & Management: Medical Care",
        "headings": "Bacteremia Treatment & Management\nBacteremia Treatment & Management\nMedical Care\nHow well do low-risk criteria work?\nEmpiric antibiotics: How well do they work?\nTreatment algorithms\n",
        "segment": "band-to-neutrophil ratio\n< 0.2\n< 20,000/μL\n5-15,000/μL; ABC < 1,000\n5-15,000/μL; ABC < 1,000\nUrine assessment\n< 10 WBCs per HPF; Negative for bacteria\n< 10 WBCs per HPF; Leukocyte esterase negative\n< 10 WBCs per HPF\n< 5 WBCs per HPF\nCSF assessment\n< 8 WBCs per HPF; Negative for bacteria\n< 10 WBCs per HPF\n< 10-20 WBCs per HPF\n…\nChest radiography\nNo infiltrate\nWithin reference range, if obtained\nWithin reference range, if obtained\n…\nStool culture\n< 5 WBCs per HPF\n…\n< 5 WBCs per HPF\n…\n* Acute illness observation score\nHow well do low-risk criteria work? The above guidelines are presented to define a group of febrile young infants who can be treated without antibiotics. Statistically, this translates into a high NPV (ie, a very high proportion of true negative cultures is observed in patients deemed to be at low risk). The NPV of various low-risk criteria for serious bacterial infection and occult bacteremia are as follows [ 10, 14, 16, 19, 74, 75, 76] : Philadelphia NPV - 95-100%\nBoston NPV - 95-98%\nRochester NPV - 98.3-99%\nAAP 1993 - 99-99.8%\nIn basic terms, even by the most stringent criteria, somewhere between 1 in 100 and 1 in 500 low-risk, but bacteremic, febrile infants are missed.",
        "start_char": 2846,
        "end_char": 4049
      }
    },
    {
      "docid": "msmarco_v2.1_doc_16_287012450#3_490827079",
      "score": 15.231,
      "doc": {
        "url": "https://emedicine.medscape.com/article/961169-treatment",
        "title": "Bacteremia Treatment & Management: Medical Care",
        "headings": "Bacteremia Treatment & Management\nBacteremia Treatment & Management\nMedical Care\nHow well do low-risk criteria work?\nEmpiric antibiotics: How well do they work?\nTreatment algorithms\n",
        "segment": "73] Since then, numerous studies have evaluated combinations of age, temperature, history, examination findings, and laboratory results to determine which young infants are at a low risk for bacterial infection. [ 10, 66, 74, 75, 76]\nThe following are the low-risk criteria established by groups from Philadelphia, Boston, and Rochester and the 1993 American Academy of Pediatrics (AAP) guideline. Table 11. Low-Risk Criteria for Infants Younger than 3 Months [ 10, 74, 75, 76] (Open Table in a new window)\nCriterion\nPhiladelphia\nBoston\nRochester\nAAP 1993\nAge\n1-2 mo\n1-2 mo\n0-3 mo\n1-3 mo\nTemperature\n38.2°C\n≥38°C\n≥38°C\n≥38°C\nAppearance\nAIOS * < 15\nWell\nAny\nWell\nHistory\nImmune\nNo antibiotics in the last 24 h; No immunizations in the last 48 h\nPreviously healthy\nPreviously healthy\nExamination\nNonfocal\nNonfocal\nNonfocal\nNonfocal\nWBC count\n< 15,000/μL; band-to-neutrophil ratio\n< 0.2\n< 20,000/μL\n5-15,000/μL; ABC < 1,000\n5-15,000/μL; ABC < 1,000\nUrine assessment\n< 10 WBCs per HPF; Negative for bacteria\n< 10 WBCs per HPF; Leukocyte esterase negative\n< 10 WBCs per HPF\n< 5 WBCs per HPF\nCSF assessment\n< 8 WBCs per HPF;",
        "start_char": 1993,
        "end_char": 3111
      }
    },
    ...
  ]
}

To generate similar output for ArcticEmbed-L, specify the corresponding run file with --run.