Reproduction log msmarco combined all [Anserini]#3180
Open
ShanaxWorld wants to merge 8 commits intocastorini:masterfrom
Open
Reproduction log msmarco combined all [Anserini]#3180ShanaxWorld wants to merge 8 commits intocastorini:masterfrom
ShanaxWorld wants to merge 8 commits intocastorini:masterfrom
Conversation
Member
|
@ShanaxWorld please resolve conflicts. also missing a repro entry. |
System : OS: Ubuntu (WSL2 on Windows) Python: 3.11 (conda environment) Java: OpenJDK 21 Hardware: Intel Core i5 9400F x64 architecture CPU (6 core CPU) Ram : 16 GB Graphics card : Nvidia GTX 1650 Super (4 GB) Results: 1000000 58716381 374524070 collections/msmarco-passage/collection_jsonl/docs00.json 1000000 59072018 377845773 collections/msmarco-passage/collection_jsonl/docs01.json 1000000 58895092 375856044 collections/msmarco-passage/collection_jsonl/docs02.json 1000000 59277129 377452947 collections/msmarco-passage/collection_jsonl/docs03.json 1000000 59408028 378277584 collections/msmarco-passage/collection_jsonl/docs04.json 1000000 60659246 383758389 collections/msmarco-passage/collection_jsonl/docs05.json 1000000 63196730 400184520 collections/msmarco-passage/collection_jsonl/docs06.json 1000000 56920456 364726419 collections/msmarco-passage/collection_jsonl/docs07.json 841823 47767342 306155721 collections/msmarco-passage/collection_jsonl/docs08.json 8841823 523912422 3338781467 total 7437 29748 143300 collections/msmarco-passage/qrels.dev.small.tsv Notes : Had to setup ubuntu to allow WSL to work on the computer Downloaded the collection from msmarco website Install python and java sdk Results reproduced without much of a problem
Author
Fixed it |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#start-here-repro-log
System :
OS: Ubuntu (WSL2 on Windows)
Python: 3.11 (conda environment)
Java: OpenJDK 21
Hardware: Intel Core i5 9400F x64 architecture CPU (6 core CPU)
Ram : 16 GB
Graphics card : Nvidia GTX 1650 Super (4 GB)
Results:
1000000 58716381 374524070 collections/msmarco-passage/collection_jsonl/docs00.json
1000000 59072018 377845773 collections/msmarco-passage/collection_jsonl/docs01.json
1000000 58895092 375856044 collections/msmarco-passage/collection_jsonl/docs02.json
1000000 59277129 377452947 collections/msmarco-passage/collection_jsonl/docs03.json
1000000 59408028 378277584 collections/msmarco-passage/collection_jsonl/docs04.json
1000000 60659246 383758389 collections/msmarco-passage/collection_jsonl/docs05.json
1000000 63196730 400184520 collections/msmarco-passage/collection_jsonl/docs06.json
1000000 56920456 364726419 collections/msmarco-passage/collection_jsonl/docs07.json
841823 47767342 306155721 collections/msmarco-passage/collection_jsonl/docs08.json
8841823 523912422 3338781467 total
7437 29748 143300 collections/msmarco-passage/qrels.dev.small.tsv
Notes :
Had to setup ubuntu to allow WSL to work on the computer
Downloaded the collection from msmarco website
Install python and java sdk
Results reproduced without much of a problem
#3174
Reproduced BM25 baseline for MS MARCO passage ranking (Anserini sparse retrieval).
Environment:
Windows 11 (WSL2 Ubuntu)
Java 21
Python 3.12.3
Results:
MRR@10 = 0.1874 (dev.small)
Notes:
Installed Ubuntu (WSL2) to run the pipeline on Windows
Configured Python so that python maps to Python 3
Encountered CRLF issues when executing scripts from /mnt filesystem (fixed using dos2unix)
Observed file permission warnings (utime) on /mnt/e, but they did not affect processing
Maven build initially failed due to memory limits; resolved by skipping tests
Java 21 required for successful compilation
Successfully built Anserini index and retrieved ~7M results
Everything worked as expected after resolving the above issues.
#3175
Successfully reproduced BM25 baseline on the MS MARCO passage ranking task using the prebuilt Anserini index.
Environment
OS: Windows 11 (WSL2 Ubuntu)
Java: 21
Python: 3.12.3
Results
MRR@10 = 0.1875 (dev.small)
Notes
Initial runs with default settings (-parallelism 4, -hits 1000) resulted in the process being killed due to memory limits under WSL2.
Resolved by:
Reducing parallelism to 1
Reducing hits to 100
Setting Java heap: export JAVA_OPTS="-Xms512m -Xmx2g"
After these adjustments, BM25 retrieval completed successfully.
Dense Retrieval (BGE + HNSW)
Successfully reproduced dense retrieval using prebuilt HNSW index with BGE-base embeddings.
Results
MRR@10 = 0.3521 (dev.small)
Notes
Downloading the prebuilt dense index (~2–3GB) took approximately 30 minutes, depending on network conditions.
Query processing took ~37 minutes (~3 queries/sec), significantly slower than BM25.
Dense retrieval did not encounter memory issues under the same environment.
Results are consistent with expected performance for BGE-base on MS MARCO.