Reorganize README and add RAM control documentation#64
Conversation
There was a problem hiding this comment.
Copilot wasn't able to review any files in this pull request.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@copilot You didn't do anything. |
Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>
…ng in Reproducibility Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>
The changes were pushed in Here's what was done: Reorganization: Added a table of contents, moved Data and Configuration right after Installation, consolidated the quickstart, grouped usage sections logically (Production Model → Training → Prediction), moved Rust/cache/subblocking/profiling under "Advanced Topics", and moved tests/versioning/docs under "Development". Reference features clarification: Added a "What 'does not use reference features' means" subsection under Production Model that lists the six disabled features, explains that |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| To obtain the S2AND dataset, run the following command after the package is installed (from inside the `S2AND` directory): | ||
| ```[Expected download size is: 50.4 GiB]``` | ||
|
|
||
| ## Running Tests | ||
| `aws s3 sync --no-sign-request s3://ai2-s2-research-public/s2and-release data/` |
There was a problem hiding this comment.
The expected download size line is wrapped in a single-line fenced code block ([Expected download size is: 50.4 GiB]), which doesn’t render correctly as Markdown on GitHub. Consider making this plain text (or a blockquote) and put the aws s3 sync ... command in a normal bash fenced block for copy/paste consistency with the rest of the README.
There was a problem hiding this comment.
Already applied in 827ca4c. The download size is now plain bold text (**~50.4 GiB**) and the aws s3 sync command is in a proper ```bash fenced block.
| Run a bundled example with the `tests/qian` fixture (no large data download needed): | ||
|
|
||
| To run the entire CI suite mimicking the GH Actions, use the following command: | ||
| ```bash | ||
| uv run python scripts/run_ci_locally.py | ||
| uv run --no-project python scripts/tutorial_for_predicting_with_the_prod_model.py \ | ||
| --use-rust 1 \ | ||
| --dataset qian \ | ||
| --data-root tests \ | ||
| --load-name-counts 0 | ||
| ``` |
There was a problem hiding this comment.
Quick Start claims “no large data download needed”, but scripts/tutorial_for_predicting_with_the_prod_model.py loads a production model pickle from data/production_model_v1.1.pickle by default, and the repository’s data/ directory doesn’t include any production_model_*.pickle files. As written, the command will fail for a fresh clone unless the user downloads at least the model artifact(s); consider adding an explicit step/command to fetch the model pickle(s) (without syncing the full 50+ GiB dataset) or adjust the wording.
There was a problem hiding this comment.
Fixed in 827ca4c. Quick Start now clarifies the model pickle is still required and links to the Data section. Also added a standalone aws s3 cp command for downloading just the model pickle (~small) without the full 50 GiB dataset, fixed the download size rendering, and added --dataset qian to the s2and_mini example.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…rendering, fix s2and_mini command Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>
All changes were applied across commits |
|
@atalyaalon Trying to make the README.md nicer. Please take a long! |
Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>
|
@atalyaalon Let me know if this all makes sense. |
README lacked clear structure and didn't explain what "does not use reference features" means in practice. Also missing documentation on memory control knobs.
README reorganization
Reference features clarification
referencesinpapers.jsoncan benull/omitted for v1.1+ models; allsignatures.jsonfields still requiredRAM control section (Advanced Topics)
Documents the two primary knobs and their interaction:
train_pairs_sizetotal_ram_bytesbatch_size(Clusterer)n_jobsbatching_thresholdAuto-detect applies 0.8× safety factor then 10% margin sequentially (~72% of detected RAM minus current RSS).
Review feedback fixes
aws s3 cpfor pickle-only downloads2and_minicommand to include--dataset qian💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.