Skip to content

Add support for Llama 3.1 8B#655

Open
AnishPahilajani wants to merge 2 commits intovllm-project:mainfrom
AnishPahilajani:Llama3.1-enablement
Open

Add support for Llama 3.1 8B#655
AnishPahilajani wants to merge 2 commits intovllm-project:mainfrom
AnishPahilajani:Llama3.1-enablement

Conversation

@AnishPahilajani
Copy link

Description

Added support for Llama 3.1 8B model

@github-actions
Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

Signed-off-by: AnishPahilajani <anishhp13@gmail.com>

# Log once upfront that we detected the model
logger.info(
"Llama 3.1 8b dense model with tensor parallel size 4 detected. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a dense model?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Signed-off-by: AnishPahilajani <anishhp13@gmail.com>
)

@classmethod
def configure_llama_3_1_8b(cls, vllm_config: VllmConfig):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is all copy-pasted, which I would rather not do. I think @tjohnson31415 has been working on cleaning this up a bit to be more reusable, we should sync up on that

@tjohnson31415
Copy link
Collaborator

RE: #655 (comment)

After the configuration refactor is complete and merged from #669, adding new model support should just be a bit of YAML:

  # Llama 3.1 8b
  meta-llama/Llama-3.1-8B-Instruct:
    architecture:
      model_type: llama
      num_hidden_layers: 32
      max_position_embeddings: 131072
      hidden_size: 4096
      vocab_size: 128256
      num_key_value_heads: 8
      num_attention_heads: 32

    # Continuous batching configurations
    continuous_batching_configs:
      - tp_size: 4
        max_model_len: 32768
        max_num_seqs: 32
        device_config:
          env_vars:
            VLLM_DT_MAX_BATCH_TKV_LIMIT: 131072  # 128k
            FLEX_HDMA_P2PSIZE: 268435456  # 256MB
            FLEX_HDMA_COLLSIZE: 33554432  # 32MB
          num_gpu_blocks_override:
            default: 8192

TODO: could add other CB configs or tune numbers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants