Skip to content
This repository was archived by the owner on Apr 30, 2026. It is now read-only.

Subset Selection Integration#542

Merged
mergify[bot] merged 35 commits intoinstructlab:mainfrom
eshwarprasadS:subset-selection-integration
Apr 14, 2025
Merged

Subset Selection Integration#542
mergify[bot] merged 35 commits intoinstructlab:mainfrom
eshwarprasadS:subset-selection-integration

Conversation

@eshwarprasadS
Copy link
Copy Markdown
Contributor

@eshwarprasadS eshwarprasadS commented Feb 4, 2025

Minimal Implementation of Subset Selection

This resolves #541

Important Incoming Changes

  • Added h5py>=3.12.1 to requirements
  • Added submodlib source to requirements
  • Added subset_selection.py
  • Added encoders dir housing bge and arctic encoders
  • Added subset_selection_utils.py under src/instructlab/sdg/utils/
  • Support for snowflake-arctic-embed-l-v2.0 enabled

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
This reverts commit e4a4730.

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
@mergify mergify Bot added ci-failure dependencies Pull requests that update a dependency file labels Feb 4, 2025
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
@mergify mergify Bot added ci-failure and removed ci-failure labels Feb 4, 2025
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
@mergify mergify Bot removed the ci-failure label Feb 4, 2025
@shivchander shivchander self-requested a review February 5, 2025 20:01
Comment thread requirements.txt Outdated
openai>=1.13.3,<2.0.0
sentencepiece>=0.2.0
# Note: this dependency has to be built from source
submodlib @ git+https://github.com/decile-team/submodlib.git
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we were doing previously with GPTDolomite and Dolomite Engine. We will likely need to move this to live in its own repo as we did for GPTDolomite so we can do builds.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have contacted the authors and we expect submodlib to have a public release soon, so, I think we can fall back to moving it into its own repo, if that does not go through?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobotSail is it necessary to move it to its own repo? We're already resorting to forking over some code from another repo in this PR, so I'd hesitate to bring code in from another source that we're also committing to maintain

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khaledsulayman We would likely maintain it to the same extent that we do for GPTDolomite -- i.e., it's a temporary workaround that we can get rid of once there is a more official and steady package to depend on. Otherwise, if that never happens, we can always just copy the new code over into the repo if we wanted to.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyPI does not accept wheels that have a dependency on a package with a link. You will not be able to upload SDG wheels with this change.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning to add a check to the downstream builds that would prevent a dependency with a git URL like this from building, because we can't ensure that it is safe to pull code from an arbitrary git branch. Please work with the submodlib authors to produce a real release of that code so we do not have to express the dependency using a git URL.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been resolved and can we replace this dependency with a real release yet? If not, is there progress on this in that upstream repository?

Comment thread src/instructlab/sdg/subset_selection.py Outdated
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
@abhi1092
Copy link
Copy Markdown
Member

abhi1092 commented Feb 7, 2025

@eshwarprasadS can we also add a functional test? Something like this.

#!/usr/bin/env python3
from instructlab.sdg.subset_selection import subset_datasets


if __name__ == "__main__":
    dataset_files = ["<dataset_path>"]
    subset_config = {
    "instruction": "conversation",
    "query_description": "conversation",
    "templates": {
      "message": "{% for msg in messages if msg.role != 'system' %}{{ msg.role }}: {{ msg.content }}\n{% endfor %}",
        "conversation": "{% for conv in conversation %}{{ conv.from }}: {{ conv.value }}\n{% endfor %}",
    },
    "batch_size": 100000,
    "num_folds": 25
    ,
    "subset_sizes": [0.97],
    "seed": 42,
    "template_name": "text",
    "combine_files": False,
    "encoder_type": "bge",
    "encoder_model": "BAAI/bge-base-en"
  }
    subset_datasets(input_files=dataset_files, **subset_config)

Also, are you planning on adding the artic snowflake model?

@eshwarprasadS
Copy link
Copy Markdown
Contributor Author

@eshwarprasadS can we also add a functional test? Something like this.

#!/usr/bin/env python3
from instructlab.sdg.subset_selection import subset_datasets


if __name__ == "__main__":
    dataset_files = ["<dataset_path>"]
    subset_config = {
    "instruction": "conversation",
    "query_description": "conversation",
    "templates": {
      "message": "{% for msg in messages if msg.role != 'system' %}{{ msg.role }}: {{ msg.content }}\n{% endfor %}",
        "conversation": "{% for conv in conversation %}{{ conv.from }}: {{ conv.value }}\n{% endfor %}",
    },
    "batch_size": 100000,
    "num_folds": 25
    ,
    "subset_sizes": [0.97],
    "seed": 42,
    "template_name": "text",
    "combine_files": False,
    "encoder_type": "bge",
    "encoder_model": "BAAI/bge-base-en"
  }
    subset_datasets(input_files=dataset_files, **subset_config)

Also, are you planning on adding the artic snowflake model?

Thanks @abhi1092 for the comments. Yes, I am planning to add a host of unit and functional tests in the coming week for the feature, thanks for the suggestion.

And yes, I am planning to incorporate the arctic-snowflake encoder class in encoders.py and change default encoder to that one. I seem to have gotten hold of the implementation for that from the author, so we should have it in here soon.

Copy link
Copy Markdown
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to go through in further detail but here are some comments so far

Comment thread src/instructlab/sdg/encoders.py Outdated
Comment thread src/instructlab/sdg/utils/subset_selection_utils.py Outdated
@eshwarprasadS
Copy link
Copy Markdown
Contributor Author

I'll need to go through in further detail but here are some comments so far

Thanks a lot for looking at this! Good parts of this implementation was adapted directly from the upstream repository, so this really helps to have extra eyes on it.

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
@mergify mergify Bot added the one-approval label Feb 12, 2025
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
@mergify mergify Bot added the ci-failure label Feb 13, 2025
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
@mergify mergify Bot added ci-failure and removed ci-failure labels Mar 6, 2025
@bbrowning
Copy link
Copy Markdown
Contributor

Note that submodlib does not build properly - see the e2e-medium job logs, with a snippet copied here:

  writing manifest file '/home/tmp/pip-modern-metadata-ljrr7qrp/submodlib_py.egg-info/SOURCES.txt'
  creating '/home/tmp/pip-modern-metadata-ljrr7qrp/submodlib_py-0.0.1.dist-info'
  Preparing metadata (pyproject.toml): finished with status 'done'
  WARNING: Generating metadata for package submodlib produced metadata for project name submodlib-py. Fix your #egg=submodlib fragments.
Discarding git+https://github.com/decile-team/submodlib.git: Requested submodlib-py from git+https://github.com/decile-team/submodlib.git (from instructlab-sdg==0.7.1.dev124) has inconsistent name: expected 'submodlib', but metadata has 'submodlib-py'

...

ERROR: Could not find a version that satisfies the requirement submodlib (unavailable) (from instructlab-sdg) (from versions: none)
ERROR: No matching distribution found for submodlib (unavailable)

…t for subset sizes parameter

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
@mergify mergify Bot added ci-failure and removed ci-failure labels Apr 10, 2025
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
@mergify mergify Bot added ci-failure and removed ci-failure labels Apr 10, 2025
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
@mergify mergify Bot added ci-failure and removed ci-failure labels Apr 11, 2025
@bbrowning
Copy link
Copy Markdown
Contributor

I briefly looked at the failing Mac tests, and it looks like submodlib didn't release anything that works with Macs. Their only release is binary-only and I only see wheels that will work for some x86_64 linux systems. I'm not sure how they created or uploaded this release, but it doesn't appear to be done in the typical way where pypi can build from source for systems that don't have precompiled wheels. We either need to change this back to building from source OR adjust this requirement to only pull in submodlib for linux systems.

@eshwarprasadS
Copy link
Copy Markdown
Contributor Author

I briefly looked at the failing Mac tests, and it looks like submodlib didn't release anything that works with Macs. Their only release is binary-only and I only see wheels that will work for some x86_64 linux systems. I'm not sure how they created or uploaded this release, but it doesn't appear to be done in the typical way where pypi can build from source for systems that don't have precompiled wheels. We either need to change this back to building from source OR adjust this requirement to only pull in submodlib for linux systems.

Yes @bbrowning, I just realized that the failures are localized to MacOS. I was thinking along the lines of the python versions, since we had an issue with the wheels only working for 3.10 last week.

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
@mergify mergify Bot added ci-failure and removed ci-failure labels Apr 12, 2025
Subset selection only works on Linux and CUDA devices for now, so
don't add it as a requirement except on Linux machines.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
@mergify mergify Bot added ci-failure and removed ci-failure labels Apr 14, 2025
@bbrowning
Copy link
Copy Markdown
Contributor

I adjusted the submodlib requirement to only get pulled in for Linux, since that's the only OS its pypi binary is built for. And, I marked the functional test that uses it as one that requires GPU, since the test uses cuda stuff directly. That got CI green, so things should be good to merge if the others involved are comfortable with this.

Copy link
Copy Markdown
Contributor

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm!

@mergify mergify Bot merged commit b5e2ae6 into instructlab:main Apr 14, 2025
28 checks passed
@mergify mergify Bot removed the one-approval label Apr 14, 2025
@eshwarprasadS
Copy link
Copy Markdown
Contributor Author

Thanks @bbrowning for adjusting the requirements. Thanks everyone for the reviews along the way as well.

@RobotSail
Copy link
Copy Markdown
Member

Congrats on getting this in @eshwarprasadS !!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

dependencies Pull requests that update a dependency file testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Minimal Integration of Subset Selection into SDG

9 participants