Fetch the fatjar:
wget https://repo1.maven.org/maven2/io/anserini/anserini/1.2.1/anserini-1.2.1-fatjar.jarLet's start out by setting the ANSERINI_JAR and the OUTPUT_DIR:
export ANSERINI_JAR="anserini-1.2.1-fatjar.jar"
export OUTPUT_DIR="."❗ Anserini ships with a number of prebuilt indexes, which it'll automagically download for you. This is a great feature, but the indexes can take up a lot of space. See this guide on prebuilt indexes for more details.
- MS MARCO V2.1 + TREC RAG
- MS MARCO V1 Passage
- MS MARCO V2.1 Segmented Documents
- MS MARCO V2.1 Documents
- BEIR
- BRIGHT
The MS MARCO V2.1 collections were created for the TREC RAG Track. It was the official corpus used in 2024 and will remain the corpus for 2025. There are two separate MS MARCO V2.1 "variants", documents and segmented documents:
- The segmented documents corpus (segments = passages) is the one actually used for the TREC RAG evaluations. It contains 113,520,750 passages.
- The documents corpus is the source of the segments and useful as a point of reference (but not actually used in the TREC evaluations). It contains 10,960,555 documents.
Here, we focus on the segmented documents corpus.
With Anserini, you can reproduce baseline runs on the TREC 2024 RAG test queries using BM25 and ArcticEmbed-L embeddings. Using the UMBRELA qrels, these are the evaluation numbers you'd get:
| Dataset / Metric | BM25 | ArcticEmbed-L |
|---|---|---|
| RAG24 Test (UMBRELA): nDCG@20 | 0.3198 | 0.5497 |
| RAG24 Test (UMBRELA): nDCG@100 | 0.2563 | 0.4855 |
| RAG24 Test (UMBRELA): Recall@100 | 0.1395 | 0.2547 |
See instructions below on how to reproduce these runs; more details can be found in the following two papers:
Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, and Jimmy Lin. A Large-Scale Study of Relevance Assessments with Large Language Models Using UMBRELA. Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR 2025), pages 358-368, July 2025, Padua, Italy.
Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look. arXiv:2411.08275, November 2024.
This guide covers runs with the official TREC 2024 RAG test queries. See this page for instructions on runs with the TREC 2024 RAG "dev queries".
For BM25, Anserini provides prebuilt inverted indexes. The following command will reproduce the above results:
❗ Beware, you need lots of space to run these experiments.
The msmarco-v2.1-doc-segmented prebuilt index is 84 GB uncompressed.
The command below will download the index automatically.
See this guide on prebuilt indexes for more details.
java -cp $ANSERINI_JAR --add-modules jdk.incubator.vector io.anserini.search.SearchCollection \
-index msmarco-v2.1-doc-segmented \
-topics rag24.test \
-output $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt \
-bm25 -hits 1000And to evaluate:
java -cp $ANSERINI_JAR trec_eval -c -m ndcg_cut.20 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt
java -cp $ANSERINI_JAR trec_eval -c -m ndcg_cut.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt
java -cp $ANSERINI_JAR trec_eval -c -m recall.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txtFor ArcticEmbed-L, Anserini also provides prebuilt indexes with ArcticEmbed-L embeddings.
The embedding vectors were generated by Snowflake and are freely downloadable on Hugging Face.
We provide prebuilt HNSW indexes with int8 quantization, divided into 10 shards, 00 to 09.
❗ Beware, the complete ArcticEmbed-L index for all 10 shards of the MS MARCO V2.1 segmented document collection totals 558 GB! The commands below will download the indexes automatically, so make sure you have plenty of space. See this guide on prebuilt indexes for general info on prebuilt indexes. Additional helpful tips are provided below for dealing with space issues.
Here's how you reproduce results for each shard individually on the TREC 2024 RAG Track test queries, using ONNX to encode queries on the fly (which means you can extend to arbitrary queries):
# RAG24 test
SHARDS=(00 01 02 03 04 05 06 07 08 09); for shard in "${SHARDS[@]}"
do
java -cp $ANSERINI_JAR --add-modules jdk.incubator.vector io.anserini.search.SearchHnswDenseVectors -threads 32 -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 -topics rag24.test -topicReader TsvString -topicField title -encoder ArcticEmbedL -output $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard${shard}.txt -hits 250 -efSearch 1000 > $OUTPUT_DIR/log.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard${shard}.txt 2>&1
doneSame commands, but using cached queries (faster)
# RAG24 test
SHARDS=(00 01 02 03 04 05 06 07 08 09); for shard in "${SHARDS[@]}"
do
java -cp $ANSERINI_JAR --add-modules jdk.incubator.vector io.anserini.search.SearchHnswDenseVectors -threads 32 -index msmarco-v2.1-doc-segmented-shard${shard}.arctic-embed-l.hnsw-int8 -topics rag24.test.snowflake-arctic-embed-l -output $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard${shard}.txt -hits 250 -efSearch 1000 > $OUTPUT_DIR/log.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard${shard}.txt 2>&1
doneFor evaluation purposes, you can just cat all the 10 run files together and evaluate:
cat $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.shard0* > $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt
java -cp $ANSERINI_JAR trec_eval -c -m ndcg_cut.20 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt
java -cp $ANSERINI_JAR trec_eval -c -m ndcg_cut.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt
java -cp $ANSERINI_JAR trec_eval -c -m recall.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txtYou should arrive at exactly the effectiveness metrics above.
Alternatively, you can use SearchShardedHnswDenseVectors to search all the shards at once.
Here, you trade off fine-grained control for convenience.
In the following, we use ONNX to encode queries:
# RAG24 test
java -cp $ANSERINI_JAR --add-modules jdk.incubator.vector io.anserini.search.SearchShardedHnswDenseVectors \
-threads 4 \
-index "msmarco-v2.1-doc-segmented-shard00.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard01.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard02.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard03.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard04.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard05.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard06.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard07.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard08.arctic-embed-l.hnsw-int8,msmarco-v2.1-doc-segmented-shard09.arctic-embed-l.hnsw-int8" \
-topics rag24.test -topicReader TsvString -topicField title \
-encoder ArcticEmbedL \
-output $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt \
-hits 250 -efSearch 1000 \
> $OUTPUT_DIR/log.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt 2>&1In this case, the output run file contains results from all the shards. You can directly evaluate:
java -cp $ANSERINI_JAR trec_eval -c -m ndcg_cut.20 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt
java -cp $ANSERINI_JAR trec_eval -c -m ndcg_cut.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txt
java -cp $ANSERINI_JAR trec_eval -c -m recall.100 rag24.test-umbrela-all $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.arctic-l.rag24.test.txtYou should arrive at exactly the effectiveness metrics above.
To generate jsonl output containing the raw documents that can be reranked and further processed (for example, using RankLLM), use the GenerateRerankerRequests program on the desired run file.
For example, to generate for the BM25 retrieval results:
java -cp $ANSERINI_JAR --add-modules jdk.incubator.vector io.anserini.rerank.GenerateRerankerRequests \
-index msmarco-v2.1-doc-segmented \
-run $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt \
-topics rag24.test \
-output $OUTPUT_DIR/results.msmarco-v2.1-doc-segmented.bm25.rag24.test.jsonl \
-hits 20In the above command, we only fetch the top-20 hits.
To examine the output, pipe through jq to pretty-print:
$ head -n 1 $OUTPUT_DIR/results.msmarco-v2.1-doc-segmented.bm25.rag24.test.jsonl | jq
{
"query": {
"qid": "2024-105741",
"text": "is it dangerous to have wbc over 15,000 without treatment?"
},
"candidates": [
{
"docid": "msmarco_v2.1_doc_16_287012450#4_490828734",
"score": 15.8199,
"doc": {
"url": "https://emedicine.medscape.com/article/961169-treatment",
"title": "Bacteremia Treatment & Management: Medical Care",
"headings": "Bacteremia Treatment & Management\nBacteremia Treatment & Management\nMedical Care\nHow well do low-risk criteria work?\nEmpiric antibiotics: How well do they work?\nTreatment algorithms\n",
"segment": "band-to-neutrophil ratio\n< 0.2\n< 20,000/μL\n5-15,000/μL; ABC < 1,000\n5-15,000/μL; ABC < 1,000\nUrine assessment\n< 10 WBCs per HPF; Negative for bacteria\n< 10 WBCs per HPF; Leukocyte esterase negative\n< 10 WBCs per HPF\n< 5 WBCs per HPF\nCSF assessment\n< 8 WBCs per HPF; Negative for bacteria\n< 10 WBCs per HPF\n< 10-20 WBCs per HPF\n…\nChest radiography\nNo infiltrate\nWithin reference range, if obtained\nWithin reference range, if obtained\n…\nStool culture\n< 5 WBCs per HPF\n…\n< 5 WBCs per HPF\n…\n* Acute illness observation score\nHow well do low-risk criteria work? The above guidelines are presented to define a group of febrile young infants who can be treated without antibiotics. Statistically, this translates into a high NPV (ie, a very high proportion of true negative cultures is observed in patients deemed to be at low risk). The NPV of various low-risk criteria for serious bacterial infection and occult bacteremia are as follows [ 10, 14, 16, 19, 74, 75, 76] : Philadelphia NPV - 95-100%\nBoston NPV - 95-98%\nRochester NPV - 98.3-99%\nAAP 1993 - 99-99.8%\nIn basic terms, even by the most stringent criteria, somewhere between 1 in 100 and 1 in 500 low-risk, but bacteremic, febrile infants are missed.",
"start_char": 2846,
"end_char": 4049
}
},
{
"docid": "msmarco_v2.1_doc_16_287012450#3_490827079",
"score": 15.231,
"doc": {
"url": "https://emedicine.medscape.com/article/961169-treatment",
"title": "Bacteremia Treatment & Management: Medical Care",
"headings": "Bacteremia Treatment & Management\nBacteremia Treatment & Management\nMedical Care\nHow well do low-risk criteria work?\nEmpiric antibiotics: How well do they work?\nTreatment algorithms\n",
"segment": "73] Since then, numerous studies have evaluated combinations of age, temperature, history, examination findings, and laboratory results to determine which young infants are at a low risk for bacterial infection. [ 10, 66, 74, 75, 76]\nThe following are the low-risk criteria established by groups from Philadelphia, Boston, and Rochester and the 1993 American Academy of Pediatrics (AAP) guideline. Table 11. Low-Risk Criteria for Infants Younger than 3 Months [ 10, 74, 75, 76] (Open Table in a new window)\nCriterion\nPhiladelphia\nBoston\nRochester\nAAP 1993\nAge\n1-2 mo\n1-2 mo\n0-3 mo\n1-3 mo\nTemperature\n38.2°C\n≥38°C\n≥38°C\n≥38°C\nAppearance\nAIOS * < 15\nWell\nAny\nWell\nHistory\nImmune\nNo antibiotics in the last 24 h; No immunizations in the last 48 h\nPreviously healthy\nPreviously healthy\nExamination\nNonfocal\nNonfocal\nNonfocal\nNonfocal\nWBC count\n< 15,000/μL; band-to-neutrophil ratio\n< 0.2\n< 20,000/μL\n5-15,000/μL; ABC < 1,000\n5-15,000/μL; ABC < 1,000\nUrine assessment\n< 10 WBCs per HPF; Negative for bacteria\n< 10 WBCs per HPF; Leukocyte esterase negative\n< 10 WBCs per HPF\n< 5 WBCs per HPF\nCSF assessment\n< 8 WBCs per HPF;",
"start_char": 1993,
"end_char": 3111
}
},
...
]
}To generate similar output for ArcticEmbed-L, specify the corresponding run file with -run.
❗ Beware, running these experiments will automatically download 9 indexes totaling 203.1 GB.
Currently, Anserini provides support for the following models:
- BM25
- SPLADE-v3: cached queries and ONNX query encoding
- cosDPR-distil: cached queries and ONNX query encoding
- bge-base-en-v1.5: cached queries and ONNX query encoding
- cohere-embed-english-v3.0: cached queries and ONNX query encoding
The table below reports the effectiveness of the models (dev in terms of RR@10, DL19 and DL20 in terms of nDCG@10):
| dev | DL19 | DL20 | |
|---|---|---|---|
| BM25 (k1=0.9, b=0.4) | 0.1840 | 0.5058 | 0.4796 |
| SPLADE-v3 (cached queries) | 0.3999 | 0.7264 | 0.7522 |
| SPLADE-v3 (ONNX) | 0.4000 | 0.7264 | 0.7522 |
| cosDPR-distil w/ HNSW fp32 (cached queries) | 0.3887 | 0.7250 | 0.7025 |
| cosDPR-distil w/ HNSW fp32 (ONNX) | 0.3887 | 0.7250 | 0.7025 |
| cosDPR-distil w/ HNSW int8 (cached queries) | 0.3897 | 0.7240 | 0.7004 |
| cosDPR-distil w/ HNSW int8 (ONNX) | 0.3899 | 0.7247 | 0.6996 |
| bge-base-en-v1.5 w/ HNSW fp32 (cached queries) | 0.3574 | 0.7065 | 0.6780 |
| bge-base-en-v1.5 w/ HNSW fp32 (ONNX) | 0.3575 | 0.7016 | 0.6768 |
| bge-base-en-v1.5 w/ HNSW int8 (cached queries) | 0.3572 | 0.7016 | 0.6738 |
| bge-base-en-v1.5 w/ HNSW int8 (ONNX) | 0.3575 | 0.7017 | 0.6767 |
| cohere-embed-english-v3.0 w/ HNSW fp32 (cached queries) | 0.3647 | 0.6956 | 0.7245 |
| cohere-embed-english-v3.0 w/ HNSW int8 (cached queries) | 0.3656 | 0.6955 | 0.7262 |
The following command will reproduce the above experiments:
java -cp $ANSERINI_JAR io.anserini.reproduce.RunMsMarco -collection msmarco-v1-passageTo print out the commands that will generate the above runs without performing the runs, use the options -dryRun -printCommands.
❗ Beware, running these experiments will automatically download 12 indexes totaling 698.0 GB.
The MS MARCO V2.1 collections were created for the TREC RAG Track. There were two variants: the documents corpus and the segmented documents corpus. The documents corpus served as the source of the segmented documents corpus, but the segmented documents corpus is the one used in official TREC RAG evaluations. The following table reports nDCG@20 scores for various retrieval conditions:
| RAG 24 UMBRELA | RAG 24 NIST | |
|---|---|---|
| baselines | 0.3198 | 0.2809 |
| SPLADE-v3 | 0.5167 | 0.4642 |
Arctic-embed-l (shard00, HNSW int8 indexes) |
0.3003 | 0.2449 |
Arctic-embed-l (shard01, HNSW int8 indexes) |
0.2599 | 0.2184 |
Arctic-embed-l (shard02, HNSW int8 indexes) |
0.2661 | 0.2211 |
Arctic-embed-l (shard03, HNSW int8 indexes) |
0.2705 | 0.2388 |
Arctic-embed-l (shard04, HNSW int8 indexes) |
0.2937 | 0.2253 |
Arctic-embed-l (shard05, HNSW int8 indexes) |
0.2590 | 0.2383 |
Arctic-embed-l (shard06, HNSW int8 indexes) |
0.2444 | 0.2336 |
Arctic-embed-l (shard07, HNSW int8 indexes) |
0.2417 | 0.2255 |
Arctic-embed-l (shard08, HNSW int8 indexes) |
0.2847 | 0.2765 |
Arctic-embed-l (shard09, HNSW int8 indexes) |
0.2432 | 0.2457 |
The following command will reproduce the above experiments:
java -cp $ANSERINI_JAR io.anserini.reproduce.RunMsMarco -collection msmarco-v2.1-doc-segmentedTo print out the commands that will generate the above runs without performing the runs, use the options -dryRun -printCommands.
❗ Beware, running these experiments will automatically download 2 indexes totaling 145.8 GB.
The MS MARCO V2.1 collections were created for the TREC RAG Track. There were two variants: the documents corpus and the segmented documents corpus. The documents corpus served as the source of the segmented documents corpus, but is not otherwise used in any formal evaluations. It primarily served development purposes for the TREC 2024 RAG evaluation, where previous qrels from MS MARCO V2 and DL21-DL23 were "projected over" to this corpus.
The table below reports effectiveness (dev in terms of RR@10, DL21-DL23, RAGgy in terms of nDCG@10):
| dev | dev2 | DL21 | DL22 | DL23 | RAGgy | |
|---|---|---|---|---|---|---|
| BM25 doc | 0.1654 | 0.1732 | 0.5183 | 0.2991 | 0.2914 | 0.3631 |
| BM25 doc-segmented | 0.1973 | 0.2000 | 0.5778 | 0.3576 | 0.3356 | 0.4227 |
The following command will reproduce the above experiments:
java -cp $ANSERINI_JAR io.anserini.reproduce.RunMsMarco -collection msmarco-v2.1-docTo print out the commands that will generate the above runs without performing the runs, use the options -dryRun -printCommands.
❗ Beware, running these experiments will automatically download 174 indexes totaling 391.5 GB.
Here is a selection of models that are currently supported in Anserini:
- Flat = BM25, "flat" bag-of-words baseline
- MF = BM25, "multifield" bag-of-words baseline
- S = SPLADE-v3:
- Bf = bge-base-en-v1.5 (flat)
- Bh = bge-base-en-v1.5 (HNSW)
🫙 = cached queries,
The table below reports the effectiveness of the models (nDCG@10):
| Corpus | Flat | MF | S 🫙 | S |
Bf 🫙 | Bf |
Bh 🫙 | Bh |
|---|---|---|---|---|---|---|---|---|
trec-covid |
0.5947 | 0.6559 | 0.7299 | 0.7299 | 0.7814 | 0.7815 | 0.7834 | 0.7835 |
bioasq |
0.5225 | 0.4646 | 0.5142 | 0.5142 | 0.4149 | 0.4148 | 0.4042 | 0.4042 |
nfcorpus |
0.3218 | 0.3254 | 0.3629 | 0.3629 | 0.3735 | 0.3735 | 0.3735 | 0.3735 |
nq |
0.3055 | 0.3285 | 0.5842 | 0.5842 | 0.5413 | 0.5415 | 0.5413 | 0.5415 |
hotpotqa |
0.6330 | 0.6027 | 0.6884 | 0.6884 | 0.7259 | 0.7259 | 0.7242 | 0.7241 |
fiqa |
0.2361 | 0.2361 | 0.3798 | 0.3798 | 0.4065 | 0.4065 | 0.4065 | 0.4065 |
signal1m |
0.3304 | 0.3304 | 0.2465 | 0.2465 | 0.2886 | 0.2886 | 0.2869 | 0.2869 |
trec-news |
0.3952 | 0.3977 | 0.4365 | 0.4365 | 0.4425 | 0.4424 | 0.4411 | 0.4410 |
robust04 |
0.4070 | 0.4070 | 0.4952 | 0.4952 | 0.4465 | 0.4435 | 0.4467 | 0.4437 |
arguana |
0.3970 | 0.4142 | 0.4872 | 0.4845 | 0.6361 | 0.6228 | 0.6361 | 0.6228 |
webis-touche2020 |
0.4422 | 0.3673 | 0.3086 | 0.3086 | 0.2570 | 0.2571 | 0.2570 | 0.2571 |
cqadupstack-android |
0.3801 | 0.3709 | 0.4109 | 0.4109 | 0.5075 | 0.5076 | 0.5075 | 0.5076 |
cqadupstack-english |
0.3453 | 0.3321 | 0.4255 | 0.4255 | 0.4857 | 0.4857 | 0.4855 | 0.4855 |
cqadupstack-gaming |
0.4822 | 0.4418 | 0.5193 | 0.5193 | 0.5965 | 0.5967 | 0.5965 | 0.5967 |
cqadupstack-gis |
0.2901 | 0.2904 | 0.3236 | 0.3236 | 0.4127 | 0.4131 | 0.4129 | 0.4133 |
cqadupstack-mathematica |
0.2015 | 0.2046 | 0.2445 | 0.2445 | 0.3163 | 0.3163 | 0.3163 | 0.3163 |
cqadupstack-physics |
0.3214 | 0.3248 | 0.3753 | 0.3753 | 0.4722 | 0.4724 | 0.4722 | 0.4724 |
cqadupstack-programmers |
0.2802 | 0.2963 | 0.3387 | 0.3387 | 0.4242 | 0.4238 | 0.4242 | 0.4238 |
cqadupstack-stats |
0.2711 | 0.2790 | 0.3137 | 0.3137 | 0.3732 | 0.3728 | 0.3732 | 0.3728 |
cqadupstack-tex |
0.2244 | 0.2086 | 0.2493 | 0.2493 | 0.3115 | 0.3115 | 0.3115 | 0.3115 |
cqadupstack-unix |
0.2749 | 0.2788 | 0.3196 | 0.3196 | 0.4219 | 0.4220 | 0.4219 | 0.4220 |
cqadupstack-webmasters |
0.3059 | 0.3008 | 0.3250 | 0.3250 | 0.4065 | 0.4072 | 0.4065 | 0.4072 |
cqadupstack-wordpress |
0.2483 | 0.2562 | 0.2807 | 0.2807 | 0.3547 | 0.3547 | 0.3547 | 0.3547 |
quora |
0.7886 | 0.7886 | 0.8141 | 0.8141 | 0.8890 | 0.8876 | 0.8890 | 0.8876 |
dbpedia-entity |
0.3180 | 0.3128 | 0.4476 | 0.4476 | 0.4074 | 0.4073 | 0.4077 | 0.4076 |
scidocs |
0.1490 | 0.1581 | 0.1567 | 0.1567 | 0.2170 | 0.2172 | 0.2170 | 0.2172 |
fever |
0.6513 | 0.7530 | 0.8015 | 0.8015 | 0.8630 | 0.8629 | 0.8620 | 0.8620 |
climate-fever |
0.1651 | 0.2129 | 0.2625 | 0.2625 | 0.3119 | 0.3117 | 0.3119 | 0.3117 |
scifact |
0.6789 | 0.6647 | 0.7140 | 0.7140 | 0.7408 | 0.7408 | 0.7408 | 0.7408 |
The following command will reproduce the above experiments:
java -cp $ANSERINI_JAR io.anserini.reproduce.RunBeirTo print out the commands that will generate the above runs without performing the runs, use the options -dryRun -printCommands.
❗ Beware, running these experiments will automatically download 24 indexes totaling 1.7 GB.
BRIGHT is a retrieval benchmark described here. The following table reports nDCG@10 scores.
- Sv3 = SPLADE-v3 with ONNX query encoding
| Corpus | BM25 | Sv3 |
|---|---|---|
| StackExchange | ||
| Biology | 0.1824 | 0.2101 |
| Earth Science | 0.2791 | 0.2670 |
| Economics | 0.1645 | 0.1604 |
| Psychology | 0.1342 | 0.1527 |
| Robotics | 0.1091 | 0.1578 |
| Stack Overflow | 0.1626 | 0.1290 |
| Sustainable Living | 0.1613 | 0.1497 |
| StackExchange average | 0.1705 | 0.1752 |
| Coding | ||
| LeetCode | 0.2471 | 0.2603 |
| Pony | 0.0434 | 0.1440 |
| Coding average | 0.1453 | 0.2022 |
| Theorems | ||
| AoPS | 0.0645 | 0.0692 |
| TheoremQA-Q | 0.0733 | 0.1113 |
| TheoremQA-T | 0.0214 | 0.0554 |
| Theorems average | 0.0531 | 0.0786 |
| Overall average | 0.1369 | 0.1556 |
The following command will reproduce the above experiments:
java -cp $ANSERINI_JAR io.anserini.reproduce.RunBrightTo print out the commands that will generate the above runs without performing the runs, use the options -dryRun -printCommands.