✨ allow single-shard paged attention #86

joerunde · 2024-05-06T20:52:31Z

This is a small little change to allow llama and bigcode models to work with paged attention on a single shard. Currently if FLASH_ATTENTION is not alos set, it will raise

Signed-off-by: Joe Runde <[email protected]>

tdoublep · 2024-05-08T19:51:04Z

Currently if FLASH_ATTENTION is not alos set, it will raise

Not 100% sure but I think we do actually want FLASH_ATTENTION to be set in addition to PAGED_ATTENTION. I can't remember why exactly..going to look into it.

tdoublep · 2024-05-08T19:53:04Z

server/text_generation_server/inference_engine/tgis_native.py

@@ -52,6 +53,11 @@ def __init__(
                raise NotImplementedError(
                    f"Flash attention currently only supported by the following model types: {NONTP_FLASH_TYPES}"
                )
+        elif PAGED_ATTENTION:


I think right now we require both PAGED_ATTENTION and FLASH_ATTENTION to be set, so not sure if this should be elif.

joerunde · 2024-05-09T15:20:50Z

@tdoublep ah, I was assuming that they were mutually exclusive, if they both need to be set then let me know if you find out why!

Sync release to main branches for 2.11

joerunde added 2 commits May 6, 2024 14:51

✨ allow single-shard paged attention

9fc037c

Signed-off-by: Joe Runde <[email protected]>

Merge branch 'main' into paged-attn

7b76cd0

njhill assigned tdoublep May 8, 2024

tdoublep reviewed May 8, 2024

View reviewed changes

Xaenalt pushed a commit to Xaenalt/text-generation-inference that referenced this pull request Sep 16, 2024

Merge pull request IBM#86 from opendatahub-io/main

d5340ca

Sync release to main branches for 2.11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ allow single-shard paged attention #86

✨ allow single-shard paged attention #86

joerunde commented May 6, 2024

tdoublep commented May 8, 2024

tdoublep May 8, 2024

joerunde commented May 9, 2024

✨ allow single-shard paged attention #86

Are you sure you want to change the base?

✨ allow single-shard paged attention #86

Conversation

joerunde commented May 6, 2024

tdoublep commented May 8, 2024

tdoublep May 8, 2024

Choose a reason for hiding this comment

joerunde commented May 9, 2024