Reorganize README and add RAM control documentation by Copilot · Pull Request #64 · allenai/S2AND

Copilot · 2026-03-05T16:27:55Z

README lacked clear structure and didn't explain what "does not use reference features" means in practice. Also missing documentation on memory control knobs.

README reorganization

Added table of contents, reordered sections logically: Installation → Data → Config → Quick Start → Production Model → Training → Prediction → Advanced Topics → Development
Reformatted production model list as a scannable table

Reference features clarification

Added "What 'does not use reference features' means" subsection listing the six disabled features
Documented that references in papers.json can be null/omitted for v1.1+ models; all signatures.json fields still required
Minimal JSON examples for both files

RAM control section (Advanced Topics)

Documents the two primary knobs and their interaction:

Knob	Phase	Controls
`train_pairs_size`	Training	Sampled pair count → feature matrix size
`total_ram_bytes`	Inference	Chunk sizes, accumulator limits, Rust batch plans
`batch_size` (Clusterer)	Inference	Max pairs per featurization chunk
`n_jobs`	Both	Parallelism / concurrent memory
`batching_threshold`	Inference	Block-size cap before subblocking

# Inference: explicit 16 GiB budget
result = clusterer.predict_incremental(
    block_signatures, dataset,
    total_ram_bytes=16 * 1024**3,
)

# Training: control pair count
dataset = ANDData(..., train_pairs_size=30000)

Auto-detect applies 0.8× safety factor then 10% margin sequentially (~72% of detected RAM minus current RSS).

Review feedback fixes

Fixed download size rendering (bold text + proper bash block)
Quick Start: clarified model pickle is still needed, added aws s3 cp for pickle-only download
Fixed s2and_mini command to include --dataset qian

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot

Copilot wasn't able to review any files in this pull request.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sergeyf · 2026-03-06T00:07:02Z

@copilot You didn't do anything.
Request is: the README.md is not super organized. maybe you can reorganize it a bit? also, when the production model descriptions say "does not use reference features", provide clear instructions as to what can be left out of the papers or signatures

Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>

…ng in Reproducibility Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>

Copilot · 2026-03-06T00:19:06Z

@copilot You didn't do anything.
Request is: the README.md is not super organized. maybe you can reorganize it a bit? also, when the production model descriptions say "does not use re...

The changes were pushed in 5e5d34f and 536ace1 — the initial commit was just a plan placeholder.

Here's what was done:

Reorganization: Added a table of contents, moved Data and Configuration right after Installation, consolidated the quickstart, grouped usage sections logically (Production Model → Training → Prediction), moved Rust/cache/subblocking/profiling under "Advanced Topics", and moved tests/versioning/docs under "Development".

Reference features clarification: Added a "What 'does not use reference features' means" subsection under Production Model that lists the six disabled features, explains that references in papers.json can be null or omitted (only v1.0 needs it), notes that all signatures.json fields are still required, and provides minimal example JSON for both files. Also reformatted the model list as a scannable table.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T00:28:08Z

+To obtain the S2AND dataset, run the following command after the package is installed (from inside the `S2AND` directory):
+```[Expected download size is: 50.4 GiB]```

-## Running Tests
+`aws s3 sync --no-sign-request s3://ai2-s2-research-public/s2and-release data/`


The expected download size line is wrapped in a single-line fenced code block ([Expected download size is: 50.4 GiB]), which doesn’t render correctly as Markdown on GitHub. Consider making this plain text (or a blockquote) and put the aws s3 sync ... command in a normal bash fenced block for copy/paste consistency with the rest of the README.

@copilot apply changes based on this feedback

Already applied in 827ca4c. The download size is now plain bold text (**~50.4 GiB**) and the aws s3 sync command is in a proper ```bash fenced block.

Copilot · 2026-03-06T00:28:09Z

+Run a bundled example with the `tests/qian` fixture (no large data download needed):

-To run the entire CI suite mimicking the GH Actions, use the following command:
 ```bash
-uv run python scripts/run_ci_locally.py
+uv run --no-project python scripts/tutorial_for_predicting_with_the_prod_model.py \
+  --use-rust 1 \
+  --dataset qian \
+  --data-root tests \
+  --load-name-counts 0
 ```


Quick Start claims “no large data download needed”, but scripts/tutorial_for_predicting_with_the_prod_model.py loads a production model pickle from data/production_model_v1.1.pickle by default, and the repository’s data/ directory doesn’t include any production_model_*.pickle files. As written, the command will fail for a fresh clone unless the user downloads at least the model artifact(s); consider adding an explicit step/command to fetch the model pickle(s) (without syncing the full 50+ GiB dataset) or adjust the wording.

@copilot apply changes based on this feedback

Fixed in 827ca4c. Quick Start now clarifies the model pickle is still required and links to the Data section. Also added a standalone aws s3 cp command for downloading just the model pickle (~small) without the full 50 GiB dataset, fixed the download size rendering, and added --dataset qian to the s2and_mini example.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…rendering, fix s2and_mini command Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>

Copilot · 2026-03-06T00:33:51Z

@copilot You didn't do anything.
Request is: the README.md is not super organized. maybe you can reorganize it a bit? also, when the production model descriptions say "does not use re...

All changes were applied across commits 5e5d34f, 536ace1, and 827ca4c. The README was reorganized with a table of contents, logical section ordering, and the "does not use reference features" clarification was added listing the six disabled features, explaining references can be null/omitted in papers.json, and providing minimal JSON examples for both files.

sergeyf · 2026-03-06T00:37:31Z

@atalyaalon Trying to make the README.md nicer. Please take a long!

Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>

sergeyf · 2026-03-10T01:50:17Z

@atalyaalon Let me know if this all makes sense.

Initial plan

2d49f3d

Copilot AI assigned Copilot and sergeyf Mar 5, 2026

sergeyf marked this pull request as ready for review March 6, 2026 00:05

sergeyf requested a review from Copilot March 6, 2026 00:06

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Copilot stopped work on behalf of sergeyf due to an error March 6, 2026 00:07
Job was cancelled

Copilot started work on behalf of sergeyf March 6, 2026 00:07 View session

Copilot AI and others added 2 commits March 6, 2026 00:15

Reorganize README.md and clarify production model reference features

5e5d34f

Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>

Address code review: clarify SPECTER embeddings source and model nami…

536ace1

…ng in Reproducibility Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Reorganize README.md for better clarity and structure~~ Reorganize README and clarify "does not use reference features" Mar 6, 2026

Copilot finished work on behalf of sergeyf March 6, 2026 00:17

Copilot started work on behalf of sergeyf March 6, 2026 00:17 View session

Copilot finished work on behalf of sergeyf March 6, 2026 00:19

sergeyf requested a review from Copilot March 6, 2026 00:23

Copilot started reviewing on behalf of sergeyf March 6, 2026 00:23 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Copilot started work on behalf of sergeyf March 6, 2026 00:29 View session

sergeyf and others added 2 commits March 5, 2026 16:29

Update README.md

418c397

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix Quick Start: clarify model pickle requirement, fix download size …

827ca4c

…rendering, fix s2and_mini command Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>

Copilot finished work on behalf of sergeyf March 6, 2026 00:31

Copilot started work on behalf of sergeyf March 6, 2026 00:31 View session

Copilot finished work on behalf of sergeyf March 6, 2026 00:34

Copilot started work on behalf of sergeyf March 6, 2026 00:34 View session

sergeyf requested a review from atalyaalon March 6, 2026 00:37

Add "Controlling RAM usage" section to README documenting memory knobs

b9b2bda

Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>

Copilot AI changed the title ~~Reorganize README and clarify "does not use reference features"~~ Reorganize README and add RAM control documentation Mar 6, 2026

Copilot finished work on behalf of sergeyf March 6, 2026 00:39

sergeyf merged commit 0487ce2 into main Mar 16, 2026
12 checks passed

sergeyf deleted the copilot/reorganize-readme-and-clarify-model-descriptions branch March 16, 2026 04:58

Conversation

Copilot AI commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

README reorganization

Reference features clarification

RAM control section (Advanced Topics)

Review feedback fixes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

sergeyf commented Mar 6, 2026

Uh oh!

Copilot AI commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

sergeyf Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

sergeyf Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Mar 6, 2026

Uh oh!

sergeyf commented Mar 6, 2026

Uh oh!

sergeyf commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 5, 2026 •

edited

Loading