Llama 3.2 for AI Assistants on Edge Devices

September 28, 2024 · 3 min read

Astra Team

Synaptics

Meta recently announced Llama 3.2 with new lightweight text-only models (1B and 3B) designed specifically for edge devices! These include both pre-trained and instruction-tuned versions with a 128K token context length.

Building for Synaptics Astra SL1680

To run Llama 3.2 1B on the Synaptics Astra SL1680, follow these detailed steps:

1. Download the Llama 3.2 1B GGUF Model

Visit the Llama-3.2-1B-Instruct-GGUF page on Hugging Face.
Download the quantized model file, such as Llama-3.2-1B-Instruct-Q8_0.gguf. This version offers a good balance between performance and memory usage.
Transfer the model file to your Astra SL1680 board. You can use scp (secure copy) from your local machine:
```
scp Llama-3.2-1B-Instruct-Q8_0.gguf root@astra-ip-address:/home/root/
```
Replace astra-ip-address with your board's username and IP.

2. Connect to Your Astra SL1680 Board

Open a terminal on your development machine.
Start an SSH session to the Astra SL1680:
```
ssh root@astra-ip-address
```
Ensure you have internet access on the board for the next steps.

3. Build llama.cpp Natively

Next, build llama.cpp using these commands:

Once you llama-cli binary is built, run:

./build/bin/llama-cli -m ~/Llama-3.2-1B-Instruct-Q8_0.gguf -p "You are a helpful assistant" -cnv -c 4096 -b 512

Llama.cpp will run in interactive mode, so you can chat just as you would with ChatGPT, but on the edge - and of course with a much smaller model - although it is surprisingly good and fast!:

main: interactive mode on. sampler seed: 184923642
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_1r = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 1
= Running in interactive mode. =
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with 'l'.
system
You are a helpful assistant

> Who are Synaptics Inc?

Synaptics Inc. is a technology company that specializes in designing, manufacturing, and selling semiconductor solutions for various industries, including:

Memory Usage on Synaptics Astra SL1680

The table below shows memory consumption when running different Llama 3.2 models on the SL1680:

Memory (MB)	SL1680 Mem %	Llama-3.2 Model
1393	41%	1B-Instruct-Q4_K_M.gguf
1882	55%	1B-Instruct-Q8_0.gguf
2845	83%	3B-Instruct-Q4_K_M.gguf

Building for Synaptics Astra SL1680​

1. Download the Llama 3.2 1B GGUF Model​

2. Connect to Your Astra SL1680 Board​

3. Build llama.cpp Natively​

Memory Usage on Synaptics Astra SL1680​

Further Reading​

Building for Synaptics Astra SL1680

1. Download the Llama 3.2 1B GGUF Model

2. Connect to Your Astra SL1680 Board

3. Build llama.cpp Natively

Memory Usage on Synaptics Astra SL1680

Further Reading