-
Notifications
You must be signed in to change notification settings - Fork 536
[Executorch][llm] Enable local global attention in export_llama script #10612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/kimishpatel/189/base
Are you sure you want to change the base?
Conversation
Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10612
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 6 New FailuresAs of commit 678ba9f with merge base cd3b53d ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/) ghstack-source-id: 281455704 Pull Request resolved: #10612
This pull request was exported from Phabricator. Differential Revision: D73891423 |
…llama script" Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/) [ghstack-poisoned]
Pull Request resolved: #10612 Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. ghstack-source-id: 282013415 @exported-using-ghexport Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/)
This pull request was exported from Phabricator. Differential Revision: D73891423 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding local global attention is tied closely to the model, and isn't something you can adjust given a particular checkpoint unlike these other export options such as qmode
, enable_kv_cache
, etc. I think it would be better to have this in model_args instead so that we can represent it in the config json file, so it is part of model configuration, not export configuration
You can also make all the layers just do sliding window attention for example, so you can view this also as pure optimization. I havent thought a lot of about making this a model arg vs. export arg. I think your point definitely makes sense but I think it can be configurable as well if you want to support infinite generation as many runners do. |
Stack from ghstack (oldest at bottom):
Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention.
For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want
to repeat.
Differential Revision: D73891423
cc @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng