Works with v1.0+
Use the Llama family of models locally from HuggingFace using Spice.
- Spice CLI installed.
- The following environment variables set or configured in
.env:SPICE_HUGGINGFACE_API_KEY- Granted access to the Llama-3.2-3B-Instruct model on HuggingFace.
For more information, see the Spice HuggingFace documentation.
-
Initialize a new spicepod:
spice init llama-spicepod cd llama-spicepod -
Configure the spicepod with the Llama model:
Edit the
spicepod.ymlfile to include the Llama model configuration:models: - name: llama3 from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct params: hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }
An example
spicepod.ymlis also provided in the recipe directory. -
Update
.envwith the HuggingFace variable:Create or update the
.envfile with your HuggingFace API key:echo "SPICE_HUGGINGFACE_API_KEY=your_huggingface_api_key" >> .env
-
Run the spicepod:
spice run
The model will download and load. It will be cached at
~/.cache/huggingfacefor subsequent use. -
Use Spice Chat to interact with the model:
You can now start interacting with the Llama model through the Spice Chat interface.
In a new terminal window run:
spice chat
Enter a question. It will use the locally running Llama model.
Using model: llama3 chat> Roughly how much memory do I need to run llama 3.2-3B-instruct locally as GBs? Llama 3 is a large transformer model, and its memory requirements can be significant. According to the Hugging Face documentation, the inference memory required for Llama 3-3B can vary depending on the specific use case and settings. However, here are some rough estimates: - In-app memory usage for Llama 3-3B models is typically in the range of 6-12 GB of memory per instance for inference. - For batched inference, Llama-3B-6B (which is the 6GB variant) is suggested to have around 12 GB per run, either in picoraw bytes (GB is the correct unit for your request).
You can also interact with the llama model by sending a one-shot chat request through
spice chat <message>spice chat "Roughly how much memory do I need to run llama 3.2-3B-instruct locally as GBs?" Using model: llama3 The amount of memory required to run Llama 3 on a local machine can vary greatly depending on several factors, such as the size of the input dataset, the computational resources, and the specific implementation. However, as a rough estimate, the Llama 3 model has a small footprint of around 350 MB to 400 MB in its `model` directory after being installed, plus additional GBs of memory used for in-processing inputs, caching results, and outputting results. Assuming an average size of 600 MB, 1 GB, of overall memory usage you could expect be sufficient for mostly small to moderate-sized local training and inference tasks. Time: 16.09s (first token 0.53s). Tokens: 197. Prompt: 64. Completion: 133 (8.55/s).
If you have the required hardware (NVIDIA GPU or Apple M-series processor), you can build and run Spice with hardware acceleration.
See Building Spice for general instructions to build Spice from source.
- Install CUDA Toolkit:
Follow the CUDA Toolkit installation guide to install the appropriate version for your system.
- Build Spice with CUDA support:
git clone git@github.com:spiceai/spiceai.git
cd spiceai
make install-with-models-cuda-
Ensure you have the latest macOS updates:
Make sure your macOS is up to date to leverage the latest Metal support.
-
Build Spice with Metal support:
git clone git@github.com:spiceai/spiceai.git
cd spiceai
make install-with-models-metal