New inference-time approach for Private MedHelm Tasks#3913
Open
sronaghi wants to merge 42 commits intostanford-crfm:mainfrom
Open
New inference-time approach for Private MedHelm Tasks#3913sronaghi wants to merge 42 commits intostanford-crfm:mainfrom
sronaghi wants to merge 42 commits intostanford-crfm:mainfrom
Conversation
Author
MiguelAFH
requested changes
Oct 22, 2025
| @@ -0,0 +1,192 @@ | |||
| # MedHELM RunSpecs for the private benchmarks from Stanford. | |||
Collaborator
There was a problem hiding this comment.
@yifanmai what are your thoughts on adding this file?
sronaghi
commented
Oct 22, 2025
Author
sronaghi
left a comment
There was a problem hiding this comment.
I've made edits based on @MiguelAFH's comments.
…es_medhelm_private_proxy_tuning.conf
yifanmai
requested changes
Oct 24, 2025
Collaborator
yifanmai
left a comment
There was a problem hiding this comment.
In general:
- The files need more documentation, which can be placed as a module level docstring in
proxy_tuning_client.py, in the comment inmodel_metadata.yamlandmodel_deployments.yaml, and in the comment on top ofrun_entries_medhelm_private_proxy_tuning.conf. - If this is experimental code, rather than intended for general use, your documentation should clearly say so.
- Please run the linter:
pip install black==24.3.0 mypy==1.16.0 flake8==5.0.4
./pre-commit.shI did not look at your model code too closely, let me know if there's any specific things you would like me to look at.
yifanmai
reviewed
Oct 24, 2025
yifanmai
requested changes
Oct 24, 2025
This addition allows for proxy tuning class to run for MedHelm scenarios. After creating conda environment, only need to run pip install -U "crfm-helm[proxy_tuning]"
Author
|
@yifanmai @MiguelAFH @aunell @suhana13 @HennyJie I ran the formatting check and added documentation. Please let me know what else to do for this PR! |
MiguelAFH
approved these changes
Oct 31, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I provide the code for testing a new inference-time approach which involves combining general and clinical domain LMs for some private MedHelm tasks.
I want to test my method on CLEAR, PatientInstruct, and NoteExtract.
To run the models, it involves downloading the following models locally and changing the model paths at the top of the proxy_tuning_client.py file. I can provide a script to download into carina as well. Here are the models and places for download:
Below are the model configurations and the amount of A100 40GB GPUs they use each:
I have added each model configuration to model_metadata.yaml, model_deployments.yaml, and tokenizer_config.yaml files in both prod_env and src/helm/config. run_entries_medhelm_private_proxy_tuning.conf contains the model run entries for each task. I can also create separate conf files based on amount of GPUs needed.
Each model for each task takes me ~7-22 hours each. I run the models with -n = 1 flag as my code doesn't support multi-threading.
I ended up using basic_summarization_metrics because I couldn't configure what was needed in my helm_env while maintaining compatibility with my code. If there are conda environment issues, I can share my env file and the modified run_specs.