Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds fine tuning contrib example #667

Merged
merged 5 commits into from
Jan 30, 2024
Merged

Adds fine tuning contrib example #667

merged 5 commits into from
Jan 30, 2024

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Jan 28, 2024

This module shows how one can perform fine tuning using Hamilton. It is a basic example using the transformers library that connects to huggingface to pull and fine tune a FLAN model.

One should be able to adapt this code to their needs.

For new dataflows:

Do you have the following?

  • Added a directory mapping to my github user name in the contrib/hamilton/contrib/user directory.
    • If my author names contains hyphens I have replaced them with underscores.
    • If my author name starts with a number, I have prefixed it with an underscore.
    • If your author name is a python reserved keyword. Reach out to the maintainers for help.
    • Added an author.md file under my username directory and is filled out.
    • Added an init.py file under my username directory.
  • Added a new folder for my dataflow under my username directory.
    • Added a README.md file under my dataflow directory that follows the standard headings and is filled out.
    • Added a init.py file under my dataflow directory that contains the Hamilton code.
    • Added a requirements.txt under my dataflow directory that contains the required packages outside of Hamilton.
    • Added tags.json under my dataflow directory to curate my dataflow.
    • Added valid_configs.jsonl under my dataflow directory to specify the valid configurations.
    • Added a dag.png that shows one possible configuration of my dataflow.
  • I hearby acknowledge that to the best of my ability, that the code I have contributed contains correct attribution
    and notices as appropriate.

How I tested this

Ran it locally in a docker container

Notes

We might want to invest in making it so that multiple modules could make a contribution because that could help segment the code better.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Dataflow documentation has been updated if adding/changing functionality.

This module shows how one can perform fine tuning using
Hamilton. It is a basic example using the transformers
library that connects to huggingface to pull and fine tune
a FLAN model.

One should be able to adapt this code to their needs.
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify a bit by:

  1. Removing inference from this -- that should be a snippet of code in the README of an ipynb or something
  2. Processing features prior to splitting the dataset
  3. Fixing up/removing some of the docstrings

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we tokenize the dataset before splitting? That'll kill a lot of duplicate code and make it easier to read.

Otherwise this looks fine, I think you should break it out cause it's doing too much and I find that hard to follow, but let's just get it out.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, maybe _preprocess_function should do the dataset .map call

@elijahbenizzy elijahbenizzy self-requested a review January 30, 2024 01:06
So that people know to change it to match
their dataset.
With latest contrib additions.
@skrawcz skrawcz merged commit 4fa02a1 into main Jan 30, 2024
2 checks passed
@skrawcz skrawcz deleted the contrib/fine-tuning branch January 30, 2024 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants