Large Language Models

Limitations of LLMs

LLMs are powerful tools which have many uses, but the output may contain inaccuracies, bias, or safety issues.

Large Language Models (LLMs) are models trained on large amounts of text to understand and generate human-like language, helping with tasks like answering questions, writing content or summarizing text.

Large Language Models (LLMs) underpin popular services like OpenAI's ChatGPT. Let's run an LLM on Synaptics Astra right now!

note

This assumes you are familiar with setting up your Astra board. If not, please refer to the setup tutorial.

info

This quick guide is compatible with all SL16xx boards. While inference may vary, the steps remain the same across all Astra SL-Series processors.

Running LLMs like Gemma and Qwen

Gemma 3 (270M parameters) and Qwen 1.5 (0.5B parameters) are compact, efficient LLMs designed for edge devices. Their lightweight architecture allows them to run smoothly on hardware with limited resources, such as the Synaptics Astra SL1680 with 4GB RAM.

Installing `llama-cpp-python`

To run LLMs on Astra board, we will use the llama-cpp-python package which provides a convenient Python binding for Georgi Gerganov's llamacpp.

We have pre-built llama-cpp-python for Astra Yocto Linux SDK,

SQLite3 is required for certain AI model operations. Astra SDK OOBE v1.7 images and above already have SQLite3 pre-installed. For previous version, install it using the following commands:

Running Gemma 3 -270M Model

Gemma 3 270M is a compact, 270-million parameter model designed from the ground up for task-specific fine-tuning with strong instruction-following and text structuring capabilities already trained in.

coco

To start an interactive chat session with the Gemma 3 model, use:

python3 -m llm.gemma

You can also run other large language models, such as Qwen, from the Examples repository with:

python3 -m llm.qwen

qwen

You can make changes on the questions you want the LLM to answer by changing the response_stream in examples/llm/qwen.py/

response_stream = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Tell me about Synaptics Inc.?"}], #change "content" accordingly
    stream=True  # Enable streaming
)

Running LLMs like Deepseek

DeepSeek R1 gained a lot of attention for benchmarking close to OpenAI's o1 models, along with claims of a far more efficient training process. The full DeepSeek R1 model is 671B parameters and requires > 1TB RAM to run inference, far too big for embedded devices today. However, a number of edge-friendly distillations of R1 were also released by DeepSeek.

DeepSeek R1 Distill Qwen2.5 1.5B

Our focus is the practical implications of DeepSeek R1 for edge AI inference, so we'll be looking at DeepSeek R1 Distill Qwen2.5 1.5B - the smallest model in DeepSeek's R1 release - and it can run comfortably on Synaptics Astra SL1680.

WHAT'S IN A NAME?


DeepSeek R1	An open-weight model that used reinforcement learning to incentivize chain-of-thought (CoT) reasoning
Distill	The result of knowledge distillation, where a smaller student model is trained from a larger pre-trained teacher model (DeepSeek R1 in this case)
Qwen2.5	Alibaba's Qwen2.5 Math model serves as the student model in the distillation
1.5B	1.5B Parameters is a measure of the size of the Qwen student model

The following commands downloads DeepSeek R1 Distill Qwen2.5 1.5B and runs a prompt on it:

python3 -m llm.deepseek

And then after some considerable explanation should eventually reply with the correct result:

Final Answer:
\[
\boxed{504}
\]

The code example runs the AIME question on the DeepSeek R1 distilled model which can be changed in examples/llm/deepseek.py/

problem = "An isosceles trapezoid has an inscribed circle tangent to each of its four sides. The radius of the circle is 3, and the area of the trapezoid is 72. Let the parallel sides of the trapezoid have lengths r and s, with r not equal to s. Find r^2 + s^2."
response_stream = llm.create_chat_completion(
    messages=[{"role": "user", "content": problem}],
    stream=True 
)

Addition resources

For more step by step tutorial on how to generate these binaries for llama.cpp, please read more in LLMs on Astra tutorial.
For running Meta Llama 3.2 models using Llamafile, please read more in Llama on Astra.

For more advance hands-on tutorials, visit next section : Tutorials.

Running LLMs like Gemma and Qwen​

Installing llama-cpp-python​

Running Gemma 3 -270M Model​

Running LLMs like Deepseek​

DeepSeek R1 Distill Qwen2.5 1.5B​

Addition resources​