Skip to content

[Executorch][llm] Enable local global attention in export_llama script #10612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: gh/kimishpatel/189/base
Choose a base branch
from

Conversation

kimishpatel
Copy link
Contributor

@kimishpatel kimishpatel commented May 1, 2025

Stack from ghstack (oldest at bottom):

Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention.
For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want
to repeat.

Differential Revision: D73891423

cc @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng

Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention.
For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want
to repeat.

Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented May 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10612

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 6 New Failures

As of commit 678ba9f with merge base cd3b53d (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kimishpatel added a commit that referenced this pull request May 1, 2025
Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention.
For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want
to repeat.

Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/)

ghstack-source-id: 281455704
Pull Request resolved: #10612
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73891423

…llama script"

Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention.
For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want
to repeat.

Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/)

[ghstack-poisoned]
kimishpatel added a commit that referenced this pull request May 5, 2025
Pull Request resolved: #10612

Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention.
For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want
to repeat.
ghstack-source-id: 282013415
@exported-using-ghexport

Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73891423

@kimishpatel kimishpatel added release notes: llm To capture llm specific changes in release notes module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code labels May 5, 2025
Copy link
Contributor

@jackzhxng jackzhxng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding local global attention is tied closely to the model, and isn't something you can adjust given a particular checkpoint unlike these other export options such as qmode, enable_kv_cache, etc. I think it would be better to have this in model_args instead so that we can represent it in the config json file, so it is part of model configuration, not export configuration

@kimishpatel
Copy link
Contributor Author

From my understanding local global attention is tied closely to the model, and isn't something you can adjust given a particular checkpoint unlike these other export options such as qmode, enable_kv_cache, etc. I think it would be better to have this in model_args instead so that we can represent it in the config json file, so it is part of model configuration, not export configuration

You can also make all the layers just do sliding window attention for example, so you can view this also as pure optimization.

I havent thought a lot of about making this a model arg vs. export arg. I think your point definitely makes sense but I think it can be configurable as well if you want to support infinite generation as many runners do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code release notes: llm To capture llm specific changes in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants