Running Llama3 Locally

Works with v1.0+

Use the Llama family of models locally from HuggingFace using Spice.

Requirements

Spice CLI installed.
The following environment variables set or configured in .env:
- SPICE_HUGGINGFACE_API_KEY
- Granted access to the Llama-3.2-3B-Instruct model on HuggingFace.

For more information, see the Spice HuggingFace documentation.

Steps

Initialize a new spicepod:

spice init llama-spicepod
cd llama-spicepod

Configure the spicepod with the Llama model:

Edit the spicepod.yml file to include the Llama model configuration:

models:
  - name: llama3
    from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
    params:
      hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }

An example spicepod.yml is also provided in the recipe directory.

Update .env with the HuggingFace variable:

Create or update the .env file with your HuggingFace API key:
```
echo "SPICE_HUGGINGFACE_API_KEY=your_huggingface_api_key" >> .env
```
Run the spicepod:
```
spice run
```
The model will download and load. It will be cached at ~/.cache/huggingface for subsequent use.

Use Spice Chat to interact with the model:

You can now start interacting with the Llama model through the Spice Chat interface.

In a new terminal window run:

spice chat

Enter a question. It will use the locally running Llama model.

Using model: llama3
chat> Roughly how much memory do I need to run llama 3.2-3B-instruct locally as GBs?
Llama 3 is a large transformer model, and its memory requirements can be significant. According to the Hugging Face documentation, the inference memory required for Llama 3-3B can vary depending on the specific use case and settings.

However, here are some rough estimates:

- In-app memory usage for Llama 3-3B models is typically in the range of 6-12 GB of memory per instance for inference.
- For batched inference, Llama-3B-6B (which is the 6GB variant) is suggested to have around 12 GB per run, either in picoraw bytes (GB is the correct unit for your request).

You can also interact with the llama model by sending a one-shot chat request through spice chat <message>

spice chat "Roughly how much memory do I need to run llama 3.2-3B-instruct locally as GBs?"
Using model: llama3
The amount of memory required to run Llama 3 on a local machine can vary greatly depending on several factors, such as the size of the input dataset, the computational resources, and the specific implementation.

However, as a rough estimate, the Llama 3 model has a small footprint of around 350 MB to 400 MB in its `model` directory after being installed, plus additional GBs of memory used for in-processing inputs, caching results, and outputting results.

Assuming an average size of 600 MB, 1 GB, of overall memory usage you could expect be sufficient for mostly small to moderate-sized local training and inference tasks.

Time: 16.09s (first token 0.53s). Tokens: 197. Prompt: 64. Completion: 133 (8.55/s).

Optional: Enable Hardware Acceleration

If you have the required hardware (NVIDIA GPU or Apple M-series processor), you can build and run Spice with hardware acceleration.

See Building Spice for general instructions to build Spice from source.

For NVIDIA GPU (CUDA)

Install CUDA Toolkit:

Follow the CUDA Toolkit installation guide to install the appropriate version for your system.

Build Spice with CUDA support:

git clone git@github.com:spiceai/spiceai.git
cd spiceai
make install-with-models-cuda

For Apple M-series (Metal)

Ensure you have the latest macOS updates:

Make sure your macOS is up to date to leverage the latest Metal support.
Build Spice with Metal support:

git clone git@github.com:spiceai/spiceai.git
cd spiceai
make install-with-models-metal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Llama3 Locally

Requirements

Steps

Optional: Enable Hardware Acceleration

For NVIDIA GPU (CUDA)

For Apple M-series (Metal)

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Running Llama3 Locally

Requirements

Steps

Optional: Enable Hardware Acceleration

For NVIDIA GPU (CUDA)

For Apple M-series (Metal)