Skip to main content

Cross Compile llama.cpp

This tutorial will guide you to cross-compile llama.cpp binaries (building binaries on a host machine for customization purposes) for Synaptics Astra™ Machina™.

note

This tutorial is compatible with all SL16xx boards. While inference may vary, the steps remain the same across all processors.

Additionally, llama.cpp is not limited to just the LLaMA models; it also supports other models such as Phi3, Mistral, TinyLlama, and many more, making it a versatile tool for various machine learning applications.

View more details about llama.cpp at Official GitHub Repo.

Prerequisites

If you prefer to natively compile the binary directly on the Machina board , please follow the previous LLMs on Astra using llama.cpp tutorial.

Now for Cross compilation, use your host development machine as Ubuntu. On an Ubuntu system, you can first build a Yocto Environment specific for the Astra Machina Board. For this, get the pre-built toolchain from Astra SDK Release page.

Download the standalone toolchain for your processor. In this tutorial, we will use SL1680.

Once downloaded, open a terminal in Ubuntu and run the command:

bash sl1680-poky-glibc-x86_64-astra-media-cortexa73-sl1680-toolchain-4.0.17.sh

Now run this command to activate the environment:

. /opt/poky/4.0.17/environment-setup-cortexa73-poky-linux
tip

To check if the environment is active, use the command in your Ubuntu Terminal:

echo $CC

Step 1: Generate Binary for llama.cpp

Open a terminal in Ubuntu (Host machine) and clone the llama.cpp repository from GitHub:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

These steps build the llama-cli binary from source with custom linker flags to ensure compatibility by replacing deprecated -lstdc++fs with -lstdc++:

The binary for llama-cli will be created in ~/build/bin/. This binary will help you run the supported models, so you need to copy it to the Machina Board.

Step 2: Setting up Astra Machina Board

Use ADB to access the Astra Machina Board from a host machine such as Ubuntu.

Follow these steps from the Access Machina Board tutorial to Setup ADB.

Once in the ADB shell, create a new directory inside the home directory of the Machina Board:

mkdir llama

Now, open a new Ubuntu terminal in your development system and use push to copy the binary files you generated from Step 1 to the Machina board:

adb push ~/llama.cpp/build/bin/llama-cli /home/llama

Once you have copied the binary files to the Machina board, you need to give them permissions using:

chmod 777 main

Step 3: Download Supported Models for llama.cpp

You can run any supported model mentioned in the Supported Models section in the llama.cpp GitHub Repo.

You can get the TinyLlama models from Hugging Face. In this tutorial, we will use the TinyLlama quantized model tinyllama-1.1b-chat-v1.0.Q4_0.gguf. Download it on the board.

wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_0.gguf

Step 4: Running llama.cpp on Machina Board

With the model downloaded and the binary built, you can now run the TinyLlama model inside the llama folder:

./llama-cli -m tinyllama-1.1b-chat-v1.0.Q4_0.gguf -p "Tell me about Synaptics Incorporated" -n 128

The output from your Synaptics Astra board should look like:

.. ..

Synaptics Inc. Is a leading provider of touch controller solutions for the consumer, industrial, and enterprise markets. They specialize in touch controllers, sensors, and other technologies for advanced display and robotics applications.

Synaptics has a global presence, with operations in 30 countries and a workforce of 1,000 employees. They have a strong focus on innovation, with a portfolio of over 2,000 patents and patents pending.

Synaptics offers a wide range of touch controller solutions, including surface touch controllers

llama_perf_sampler_print: sampling time = 12.89 ms / 156 runs ( 0.08 ms per token, 12101.47 tokens per second) llama_perf_context_print: load time = 469.69 ms llama_perf_context_print: prompt eval time = 2417.15 ms / 28 tokens ( 86.33 ms per token, 11.58 tokens per second) llama_perf_context_print: eval time = 14311.09 ms / 127 runs ( 112.69 ms per token, 8.87 tokens per second) llama_perf_context_print: total time = 27692.10 ms / 155 tokens

Congratulations

You have successfully run the TinyLlama model using llama.cpp on your Astra Machina Board. You can now explore further by trying different models or integrating llama.cpp into your projects. For more advanced usage and options, refer to the llama.cpp GitHub repository.

Learn more about how you can run Vision language models in the next tutorial.