This repo contains the code corresponding to the ECIR 2025 short paper Approximate Bag-of-Words Top-k Corpus Graphs by Lachlan Dunn, Luke Gallagher, and Joel Mackenzie.
@inproceedings{dg+25ecir,
title = {Approximate Bag-of-Words Top-$k$ Corpus Graphs},
author = {L. Dunn and L. Gallagher and J. Mackenzie},
booktitle = {Proc. ECIR},
year = {2025},
pages = {174--182},
}
This work builds on the prior work from Kulkarni, et al. Lexically-Accelerated Dense Retrieval and MacAvaney, et al. Adaptive Re-Ranking with a Corpus Graph.
-
Configure Python environment
$ mkdir -p ~/.venvs $ python3 -m venv ~/.venvs/docgraph $ source ~/.venvs/docgraph/bin/activate $ pip install -r requirements.txt -
Download data
$ ./tools/download_data.sh -
Setup dependencies.
Build PISA from revision
bb2b3dfand apply patch for LimitPairs implemented by Joel Mackenzie.$ mkdir -p deps $ git clone https://github.com/pisa-engine/pisa deps/pisa $ cd deps/pisa $ git reset --hard bb2b3df $ git submodule update --init --recursive --depth 1 $ git am ../../graph/0001-joel-limitpairs.patch $ mkdir -p build $ cd build $ cmake -DPISA_ENABLE_TESTING=OFF -D PISA_ENABLE_BENCHMARKING=OFF .. $ make -j$(nproc)It is assumed the
ciff2pisabinary is available. If required, refer to the PISA ciff repo for installation instructions. -
Build inverted indexes.
$ unxz -v data/msmarco-passage.pisa.bp.ciff.xz data/msmarco-passage.dt5q.pisa.bp.ciff.xz $ ./index/build.shThe indexes are provided in CIFF format. For reproducibility, note that document reordering was performed using faster graph bisection (revision
4ba3bb2) with theloggapgain function and minimum postings length of 128.
-
Run the graph construction timings.
$ unxz -v data/*.xz $ ./graph/build.sh -
Timing results can be found in
graph/*.log.
-
Run the non-graph baselines.
$ ./sysrun/baselineThe system runfiles will be in the
runsdirectory. The stage0 runfiles are the combined BM25 runfiles from each track and are used in the re-ranking experiments.runs ├── dt5q-bm25-dl19.res.gz ├── dt5q-bm25-dl20.res.gz ├── dt5q-bm25-tasb-dl19.res.gz ├── dt5q-bm25-tasb-dl20.res.gz ├── dt5q-stage0.res.gz ├── original-bm25-dl19.res.gz ├── original-bm25-dl20.res.gz ├── original-bm25-tasb-dl19.res.gz ├── original-bm25-tasb-dl20.res.gz ├── original-stage0.res.gz ├── tasb-dl19.res.gz └── tasb-dl20.res.gz -
Run the re-ranking phase.
$ ./sysrun/timing -
Timing results and runfiles can be found in the
runsdirectory.
This work used trec_eval (v9.0.8) for evaluation.
To build the query sets (rather than using the pre-computed version from download_data.sh), each
query reduction heuristic has an associated script in the tools directory corresponding to
title+url, tfidf and dt5q. To build them run ./tools/build_qryheur.sh.