Skip to content

Reorganize README and add RAM control documentation#64

Merged
sergeyf merged 6 commits intomainfrom
copilot/reorganize-readme-and-clarify-model-descriptions
Mar 16, 2026
Merged

Reorganize README and add RAM control documentation#64
sergeyf merged 6 commits intomainfrom
copilot/reorganize-readme-and-clarify-model-descriptions

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 5, 2026

README lacked clear structure and didn't explain what "does not use reference features" means in practice. Also missing documentation on memory control knobs.

README reorganization

  • Added table of contents, reordered sections logically: Installation → Data → Config → Quick Start → Production Model → Training → Prediction → Advanced Topics → Development
  • Reformatted production model list as a scannable table

Reference features clarification

  • Added "What 'does not use reference features' means" subsection listing the six disabled features
  • Documented that references in papers.json can be null/omitted for v1.1+ models; all signatures.json fields still required
  • Minimal JSON examples for both files

RAM control section (Advanced Topics)

Documents the two primary knobs and their interaction:

Knob Phase Controls
train_pairs_size Training Sampled pair count → feature matrix size
total_ram_bytes Inference Chunk sizes, accumulator limits, Rust batch plans
batch_size (Clusterer) Inference Max pairs per featurization chunk
n_jobs Both Parallelism / concurrent memory
batching_threshold Inference Block-size cap before subblocking
# Inference: explicit 16 GiB budget
result = clusterer.predict_incremental(
    block_signatures, dataset,
    total_ram_bytes=16 * 1024**3,
)

# Training: control pair count
dataset = ANDData(..., train_pairs_size=30000)

Auto-detect applies 0.8× safety factor then 10% margin sequentially (~72% of detected RAM minus current RSS).

Review feedback fixes

  • Fixed download size rendering (bold text + proper bash block)
  • Quick Start: clarified model pickle is still needed, added aws s3 cp for pickle-only download
  • Fixed s2and_mini command to include --dataset qian

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@sergeyf sergeyf marked this pull request as ready for review March 6, 2026 00:05
@sergeyf sergeyf requested a review from Copilot March 6, 2026 00:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sergeyf
Copy link
Copy Markdown
Collaborator

sergeyf commented Mar 6, 2026

@copilot You didn't do anything.
Request is: the README.md is not super organized. maybe you can reorganize it a bit? also, when the production model descriptions say "does not use reference features", provide clear instructions as to what can be left out of the papers or signatures

Copilot stopped work on behalf of sergeyf due to an error March 6, 2026 00:07
Copilot AI and others added 2 commits March 6, 2026 00:15
Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>
…ng in Reproducibility

Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>
Copilot AI changed the title [WIP] Reorganize README.md for better clarity and structure Reorganize README and clarify "does not use reference features" Mar 6, 2026
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 6, 2026

@copilot You didn't do anything.
Request is: the README.md is not super organized. maybe you can reorganize it a bit? also, when the production model descriptions say "does not use re...

The changes were pushed in 5e5d34f and 536ace1 — the initial commit was just a plan placeholder.

Here's what was done:

Reorganization: Added a table of contents, moved Data and Configuration right after Installation, consolidated the quickstart, grouped usage sections logically (Production Model → Training → Prediction), moved Rust/cache/subblocking/profiling under "Advanced Topics", and moved tests/versioning/docs under "Development".

Reference features clarification: Added a "What 'does not use reference features' means" subsection under Production Model that lists the six disabled features, explains that references in papers.json can be null or omitted (only v1.0 needs it), notes that all signatures.json fields are still required, and provides minimal example JSON for both files. Also reformatted the model list as a scannable table.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md Outdated
Comment on lines +120 to +123
To obtain the S2AND dataset, run the following command after the package is installed (from inside the `S2AND` directory):
```[Expected download size is: 50.4 GiB]```

## Running Tests
`aws s3 sync --no-sign-request s3://ai2-s2-research-public/s2and-release data/`
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected download size line is wrapped in a single-line fenced code block ([Expected download size is: 50.4 GiB]), which doesn’t render correctly as Markdown on GitHub. Consider making this plain text (or a blockquote) and put the aws s3 sync ... command in a normal bash fenced block for copy/paste consistency with the rest of the README.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already applied in 827ca4c. The download size is now plain bold text (**~50.4 GiB**) and the aws s3 sync command is in a proper ```bash fenced block.

Comment thread README.md
Comment thread README.md Outdated
Comment on lines 142 to 150
Run a bundled example with the `tests/qian` fixture (no large data download needed):

To run the entire CI suite mimicking the GH Actions, use the following command:
```bash
uv run python scripts/run_ci_locally.py
uv run --no-project python scripts/tutorial_for_predicting_with_the_prod_model.py \
--use-rust 1 \
--dataset qian \
--data-root tests \
--load-name-counts 0
```
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick Start claims “no large data download needed”, but scripts/tutorial_for_predicting_with_the_prod_model.py loads a production model pickle from data/production_model_v1.1.pickle by default, and the repository’s data/ directory doesn’t include any production_model_*.pickle files. As written, the command will fail for a fresh clone unless the user downloads at least the model artifact(s); consider adding an explicit step/command to fetch the model pickle(s) (without syncing the full 50+ GiB dataset) or adjust the wording.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 827ca4c. Quick Start now clarifies the model pickle is still required and links to the Data section. Also added a standalone aws s3 cp command for downloading just the model pickle (~small) without the full 50 GiB dataset, fixed the download size rendering, and added --dataset qian to the s2and_mini example.

sergeyf and others added 2 commits March 5, 2026 16:29
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…rendering, fix s2and_mini command

Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 6, 2026

@copilot You didn't do anything.
Request is: the README.md is not super organized. maybe you can reorganize it a bit? also, when the production model descriptions say "does not use re...

All changes were applied across commits 5e5d34f, 536ace1, and 827ca4c. The README was reorganized with a table of contents, logical section ordering, and the "does not use reference features" clarification was added listing the six disabled features, explaining references can be null/omitted in papers.json, and providing minimal JSON examples for both files.

@sergeyf
Copy link
Copy Markdown
Collaborator

sergeyf commented Mar 6, 2026

@atalyaalon Trying to make the README.md nicer. Please take a long!

Co-authored-by: sergeyf <1874668+sergeyf@users.noreply.github.com>
Copilot AI changed the title Reorganize README and clarify "does not use reference features" Reorganize README and add RAM control documentation Mar 6, 2026
@sergeyf
Copy link
Copy Markdown
Collaborator

sergeyf commented Mar 10, 2026

@atalyaalon Let me know if this all makes sense.

@sergeyf sergeyf merged commit 0487ce2 into main Mar 16, 2026
12 checks passed
@sergeyf sergeyf deleted the copilot/reorganize-readme-and-clarify-model-descriptions branch March 16, 2026 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants