Skip to content

Reproduction log msmarco combined all [Anserini]#3180

Open
ShanaxWorld wants to merge 8 commits intocastorini:masterfrom
ShanaxWorld:reproduction-log-msmarco-combined
Open

Reproduction log msmarco combined all [Anserini]#3180
ShanaxWorld wants to merge 8 commits intocastorini:masterfrom
ShanaxWorld:reproduction-log-msmarco-combined

Conversation

@ShanaxWorld
Copy link
Copy Markdown

@ShanaxWorld ShanaxWorld commented Apr 4, 2026

#start-here-repro-log
System :
OS: Ubuntu (WSL2 on Windows)
Python: 3.11 (conda environment)
Java: OpenJDK 21
Hardware: Intel Core i5 9400F x64 architecture CPU (6 core CPU)
Ram : 16 GB
Graphics card : Nvidia GTX 1650 Super (4 GB)

Results:
1000000 58716381 374524070 collections/msmarco-passage/collection_jsonl/docs00.json
1000000 59072018 377845773 collections/msmarco-passage/collection_jsonl/docs01.json
1000000 58895092 375856044 collections/msmarco-passage/collection_jsonl/docs02.json
1000000 59277129 377452947 collections/msmarco-passage/collection_jsonl/docs03.json
1000000 59408028 378277584 collections/msmarco-passage/collection_jsonl/docs04.json
1000000 60659246 383758389 collections/msmarco-passage/collection_jsonl/docs05.json
1000000 63196730 400184520 collections/msmarco-passage/collection_jsonl/docs06.json
1000000 56920456 364726419 collections/msmarco-passage/collection_jsonl/docs07.json
841823 47767342 306155721 collections/msmarco-passage/collection_jsonl/docs08.json
8841823 523912422 3338781467 total
7437 29748 143300 collections/msmarco-passage/qrels.dev.small.tsv

Notes :
Had to setup ubuntu to allow WSL to work on the computer
Downloaded the collection from msmarco website
Install python and java sdk
Results reproduced without much of a problem

#3174
Reproduced BM25 baseline for MS MARCO passage ranking (Anserini sparse retrieval).

Environment:

Windows 11 (WSL2 Ubuntu)
Java 21
Python 3.12.3
Results:

MRR@10 = 0.1874 (dev.small)
Notes:

Installed Ubuntu (WSL2) to run the pipeline on Windows
Configured Python so that python maps to Python 3
Encountered CRLF issues when executing scripts from /mnt filesystem (fixed using dos2unix)
Observed file permission warnings (utime) on /mnt/e, but they did not affect processing
Maven build initially failed due to memory limits; resolved by skipping tests
Java 21 required for successful compilation
Successfully built Anserini index and retrieved ~7M results
Everything worked as expected after resolving the above issues.

#3175
Successfully reproduced BM25 baseline on the MS MARCO passage ranking task using the prebuilt Anserini index.

Environment
OS: Windows 11 (WSL2 Ubuntu)
Java: 21
Python: 3.12.3

Results
MRR@10 = 0.1875 (dev.small)

Notes
Initial runs with default settings (-parallelism 4, -hits 1000) resulted in the process being killed due to memory limits under WSL2.
Resolved by:
Reducing parallelism to 1
Reducing hits to 100
Setting Java heap: export JAVA_OPTS="-Xms512m -Xmx2g"
After these adjustments, BM25 retrieval completed successfully.

Dense Retrieval (BGE + HNSW)

Successfully reproduced dense retrieval using prebuilt HNSW index with BGE-base embeddings.

Results
MRR@10 = 0.3521 (dev.small)

Notes
Downloading the prebuilt dense index (~2–3GB) took approximately 30 minutes, depending on network conditions.
Query processing took ~37 minutes (~3 queries/sec), significantly slower than BM25.
Dense retrieval did not encounter memory issues under the same environment.
Results are consistent with expected performance for BGE-base on MS MARCO.

@lintool
Copy link
Copy Markdown
Member

lintool commented Apr 5, 2026

@ShanaxWorld please resolve conflicts. also missing a repro entry.

System : 
OS: Ubuntu (WSL2 on Windows)
Python: 3.11 (conda environment)
Java: OpenJDK 21
Hardware: Intel Core i5 9400F x64 architecture CPU (6 core CPU)
Ram : 16 GB
Graphics card : Nvidia GTX 1650 Super (4 GB)

Results:
 1000000 58716381 374524070 collections/msmarco-passage/collection_jsonl/docs00.json
 1000000 59072018 377845773 collections/msmarco-passage/collection_jsonl/docs01.json
 1000000 58895092 375856044 collections/msmarco-passage/collection_jsonl/docs02.json
 1000000 59277129 377452947 collections/msmarco-passage/collection_jsonl/docs03.json
 1000000 59408028 378277584 collections/msmarco-passage/collection_jsonl/docs04.json
 1000000 60659246 383758389 collections/msmarco-passage/collection_jsonl/docs05.json
 1000000 63196730 400184520 collections/msmarco-passage/collection_jsonl/docs06.json
 1000000 56920456 364726419 collections/msmarco-passage/collection_jsonl/docs07.json
  841823 47767342 306155721 collections/msmarco-passage/collection_jsonl/docs08.json
 8841823 523912422 3338781467 total
 7437   29748  143300 collections/msmarco-passage/qrels.dev.small.tsv

Notes : 
Had to setup ubuntu to allow WSL to work on the computer
Downloaded the collection from msmarco website
Install python and java sdk
Results reproduced without much of a problem
@ShanaxWorld
Copy link
Copy Markdown
Author

@ShanaxWorld please resolve conflicts. also missing a repro entry.

Fixed it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants