Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
llama_3.1_405b_slurm.sh	llama_3.1_405b_slurm.sh

Llama 3.1 405B - Multi-Node Server

This workflow demonstrates deploying the Llama 3.1 405B model using vLLM on a multi-node multi-GPU setup with 16×H100 GPUs.

Overview

Llama 3.1 405B is a 405 billion parameter language model from Meta AI, one of the largest openly available models. Based on GPU memory requirements:

Weights VRAM: 810 GB
KV Cache VRAM (128k tokens): 123.05 GB
Minimum GPUs: 11×H100 for weights only, 12×H100 for full context length

This workflow uses 16×H100 GPUs across multiple nodes to ensure sufficient memory for both model weights and KV cache with performance headroom.

Environment: c250609_vllm085
Model: meta-llama/Llama-3.1-405B
Precision: Standard (FP16/BF16)
Context Length: 128k tokens
Maintainers: Timothy Ngotiaoco, Max Shad

Requirements

Hardware: 16×NVIDIA H100 80GB GPUs (multi-node)
Environment: conda environment c250609_vllm085 with vLLM 0.8.5.post1
Model Path: /n/holylfs06/LABS/kempner_shared/Everyone/testbed/models/Llama-3.1-405B

Quick Start

1. Set Up Environment

cd ../../envs/conda/c250609_vllm085
# Follow README.md for conda environment setup
conda activate vllm-inference

2. Configure SLURM Script

Edit llama_3.1_405b_slurm.sh to set your SLURM parameters:

--account: SLURM Fairshare Account
--output and --error: Log file paths
--partition: Partition name
--job-name: Job name
--time: Job duration

3. Submit SLURM Job

sbatch llama_3.1_405b_slurm.sh

The script will:

Create a Ray cluster across multiple nodes
Start a vLLM server on the first node
Load model weights onto GPUs (this can take up to a couple hours due to the model size)

Monitor progress in the error logs:

Loading safetensors checkpoint shards:   0% Completed | 0/191 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   1% Completed | 1/191 [00:04<13:59,  4.42s/it]
Loading safetensors checkpoint shards:   1% Completed | 2/191 [00:09<15:58,  5.07s/it]

When ready, you'll see:

INFO:     Started server process [XXXXX]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Usage

Find Head Node

Check the SLURM logs or use squeue to find the first node:

Head node: holygpu8aXXXXX

SSH to Head Node

ssh holygpu8aXXXXX

Send Requests

The server runs on localhost:8000. Use the /v1/completions endpoint:

cURL example:

curl http://localhost:8000/v1/completions \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/n/holylfs06/LABS/kempner_shared/Everyone/testbed/models/Llama-3.1-405B",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Python example:

import requests

response = requests.post('http://localhost:8000/v1/completions', json={
    "model": "/n/holylfs06/LABS/kempner_shared/Everyone/testbed/models/Llama-3.1-405B",
    "prompt": "San Francisco is a",
    "max_tokens": 500,
    "temperature": 0
})

output = response.json()
print(output)

Additional sampling parameters are available: top_k, min_p, etc. See vLLM sampling docs.

Performance

Metric	Value
GPUs	16×H100 80GB
Context Length	128k tokens
Weights VRAM	810 GB
KV Cache VRAM	123.05 GB (128k tokens)

Files

llama_3.1_405b_slurm.sh - SLURM job script for 16×H100 multi-node deployment
README.md - This file

References

Troubleshooting

Q: Model loading is very slow
A: Loading 810 GB of weights can take up to 2 hours. This is expected for such a large model. Monitor progress in the SLURM error logs.

Q: Out of memory error
A: Ensure you're using 16×H100 80GB GPUs across multiple nodes. Reduce max_model_len in the SLURM script if needed.

Q: Can't connect to server
A: Make sure you've SSH'd to the head node (the first node in your SLURM allocation) and the server has finished loading. Check for "Uvicorn running" message in logs.

Q: Ray cluster not starting
A: Verify that all nodes can communicate with each other. Check firewall rules and network configuration.

Maintainers: Timothy Ngotiaoco, Max Shad
Last Updated: 2026-03-04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Llama 3.1 405B - Multi-Node Server

Overview

Requirements

Quick Start

1. Set Up Environment

2. Configure SLURM Script

3. Submit SLURM Job

Usage

Find Head Node

SSH to Head Node

Send Requests

Performance

Files

References

Troubleshooting

FilesExpand file tree

Llama-3.1-405B_multinode-server

Directory actions

More options

Directory actions

More options

Latest commit

History

Llama-3.1-405B_multinode-server

Folders and files

parent directory

README.md

Llama 3.1 405B - Multi-Node Server

Overview

Requirements

Quick Start

1. Set Up Environment

2. Configure SLURM Script

3. Submit SLURM Job

Usage

Find Head Node

SSH to Head Node

Send Requests

Performance

Files

References

Troubleshooting