Skip to main content

LLMs on Astra using llama.cpp

This tutorial will guide you through the process of running the TinyLlama model using llama.cpp Natively on an Synaptics Astra™ Machina™ using the SL1680 processor.

note

This tutorial is compatible with all SL16xx boards. While inference may vary, the steps remain the same across all processors.

Additionally, llama.cpp is not limited to just the LLaMA models; it also supports other models such as Phi3, Mistral, TinyLlama, and many more, making it a versatile tool for various machine learning applications.

View more details about llama.cpp at Official GitHub Repo.

Prerequisites

You can natively compile the binary on the Machina board since we support the required packages and compilers in our OOBE (Out of Box Experience) image v1.2.0 and above.

If you prefer to cross-compile (building binaries on a host machine for customization purposes), please follow the steps from Cross Compile llama.cpp tutorial.

Step 1: Generate Binary for llama.cpp

You can generate binaries for llama.cpp on Machina natively, just open a terminal in Machina/ SSH to Machina and clone the llama.cpp repository from GitHub and build llama-cli binary:

The binary for llama-cli will be created in ~/build/bin/. This binary will help you run the supported models.

Step 2: Download Supported Models for llama.cpp

You can run any supported model mentioned in the Supported Models section in the llama.cpp GitHub Repo. Make sure they are in GGUF format.

You can get the TinyLlama 1.1 Billion model from Hugging Face. In this tutorial, we will use the TinyLlama quantized model tinyllama-1.1b-chat-v1.0.Q4_0.gguf. Download it on the board.

wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_0.gguf

Step 3: Running llama.cpp on Machina Board

With the model downloaded and the binary built, you can now run the TinyLlama model inside the llama.cpp folder using:

./build/bin/llama-cli -m tinyllama-1.1b-chat-v1.0.Q4_0.gguf -p "Tell me about Synaptics Incorporated" -n 128

The output from your Astra Machina board should look like:

.. ..

Synaptics Inc. Is a leading provider of touch controller solutions for the consumer, industrial, and enterprise markets. They specialize in touch controllers, sensors, and other technologies for advanced display and robotics applications.

Synaptics has a global presence, with operations in 30 countries and a workforce of 1,000 employees. They have a strong focus on innovation, with a portfolio of over 2,000 patents and patents pending.

Synaptics offers a wide range of touch controller solutions, including surface touch controllers

llama_perf_sampler_print: sampling time = 12.89 ms / 156 runs ( 0.08 ms per token, 12101.47 tokens per second) llama_perf_context_print: load time = 469.69 ms llama_perf_context_print: prompt eval time = 2417.15 ms / 28 tokens ( 86.33 ms per token, 11.58 tokens per second) llama_perf_context_print: eval time = 14311.09 ms / 127 runs ( 112.69 ms per token, 8.87 tokens per second) llama_perf_context_print: total time = 27692.10 ms / 155 tokens

Congratulations

You have successfully run the TinyLlama model using llama.cpp on your Astra Machina Board. You can now explore further by trying different models or integrating llama.cpp into your projects. For more advanced usage and options, refer to the llama.cpp GitHub repository.

For further reading, you can checkout LLMs Optimized for the Edge blog to understand more on running LLMs on Edge devices. Also, Checkout Llama 3.2 for AI Assistants on Edge Devices for trying out Llama models on Astra.

If you prefer to cross-compile (building binaries on a host machine for customization purposes), please follow the steps in the next Tutorial.