LLMs Optimized for the Edge
In the evolving landscape of natural language processing, LLMs (Large Language Models) and SLMs (Small Language Models) have emerged as powerful tools for various applications, from chatbots to text completion. Running Llama.cpp on embedded systems with the Astra Machina development kit unlocks new potential for deploying localized, efficient AI solutions, ideal for edge computing environments. In this blog, you will learn the high-level approach to bring llama.cpp to life on the Astra Machina development kit, enabling advanced LLM capabilities directly on-device.
Llama.cpp makes it significantly easier to run Llama and other supported models on edge devices or a wide variety of hardware locally. Its lightweight design is optimized, allowing deployment without the need for powerful GPUs or cloud infrastructure. By utilizing CPU inference, Llama.cpp enables models to run efficiently. It also has support to run on GPU. Developers can leverage OpenCL to run it on GPU for Machina.

Compile llama.cpp Binary Natively on Machina
You can build the binary natively on the Machina Board, as we support the required packages and compilers in our OOBE (Out of Box Experience) image v1.2.0 and above.
You could also cross-compile by setting up a cross-compilation environment on your host machine. This includes obtaining the cross-toolchain that matches the architecture of the Astra Machina. You can download the pre-built toolchain for Astra from here.
With the pre-built toolchain in hand, set up the Poky build environment on your host machine. This involves integrating the toolchain and configuring build settings to ensure that the binaries you generate are compatible with the SL1680 processor.
Deploying llama.cpp on the Machina
With the binary ready, the next phase involves deploying it onto the Machina Board. The SL1680 processor can easily run the binary on CPU. Developers might need to employ optimizations to manage workloads more efficiently on our GPU.
You also need to download a model in gguf format to the Machina board from HuggingFace. GGUF is a binary format that's designed to load and save models rapidly, allowing for efficient inference.
We evaluated the performance of various models in the llama.cpp family. The metrics are summarized in the table below:
| Model | Threads | Memory Used | Eval Time (Tokens/Second) |
|---|---|---|---|
| tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf | 4 | 941 MB | ~10.09 tokens/second |
| tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf | 1 | 916 MB | ~2.90 tokens/second |
| gemma-2b-it-q4_k_m.gguf | 4 | 2515 MB | ~4.97 tokens/second |
| gemma-2b-it-q4_k_m.gguf | 1 | 2491 MB | ~1.33 tokens/second |
Observations
Deploying advanced AI models like llama.cpp on embedded systems showcases the immense potential of edge computing for language processing. Leveraging the power of the SL1680 processor on the Astra Machina development kit, you can now run sophisticated language models directly on devices without relying on cloud infrastructure. This enables real-time natural language understanding and generation with low latency and enhanced privacy, making it ideal for applications requiring efficient, local AI capabilities in environments where connectivity or cloud dependence might be limited.
-
Using smaller versions of the model, such as TinyLlama, is recommended for edge devices, as they require less computational power compared to larger models like LLaMA2. These smaller models are better suited for devices with limited hardware resources.
-
The Q4_K_M quantization scheme provided the best tradeoff between accuracy and inference speed during testing. It can be said that Q4_K_M delivered optimal results without a significant drop in accuracy.
-
The Astra Machina has 4GB of RAM, with around 3.5GB available for use. The model should be loaded within the available RAM to prevent memory swapping during inference, as this can lead to substantial delays. Ensuring the model fits within the physical memory is crucial for maintaining real-time performance, as everything operates more efficiently when confined to RAM.
Further Reading
For those eager to explore further, here are some detailed, step-by-step tutorials to guide you through the process.