Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ allow single-shard paged attention #86

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

✨ allow single-shard paged attention #86

wants to merge 2 commits into from

Conversation

joerunde
Copy link
Collaborator

@joerunde joerunde commented May 6, 2024

This is a small little change to allow llama and bigcode models to work with paged attention on a single shard. Currently if FLASH_ATTENTION is not alos set, it will raise

@tdoublep
Copy link
Member

tdoublep commented May 8, 2024

Currently if FLASH_ATTENTION is not alos set, it will raise

Not 100% sure but I think we do actually want FLASH_ATTENTION to be set in addition to PAGED_ATTENTION. I can't remember why exactly..going to look into it.

@@ -52,6 +53,11 @@ def __init__(
raise NotImplementedError(
f"Flash attention currently only supported by the following model types: {NONTP_FLASH_TYPES}"
)
elif PAGED_ATTENTION:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think right now we require both PAGED_ATTENTION and FLASH_ATTENTION to be set, so not sure if this should be elif.

@joerunde
Copy link
Collaborator Author

joerunde commented May 9, 2024

@tdoublep ah, I was assuming that they were mutually exclusive, if they both need to be set then let me know if you find out why!

Xaenalt pushed a commit to Xaenalt/text-generation-inference that referenced this pull request Sep 16, 2024
Sync release to main branches for 2.11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants