Skip to main content

Vision Language Models on Astra

This tutorial will guide you through running Vision Language Models (VLMs) using llama.cpp natively on Synaptics Astra™ Machina™ boards. VLMs are multimodal AI models that can understand and generate information using both images and text.

info
  • VLMs combine a language model (LLM) with a vision encoder and a multi-modal projection (mmproj) model.
  • Popular open-source VLMs include SmolVLM, MobileVLM, LLaVA, Qwen2-VL, and more.
note

This tutorial is compatible with SL1680 and SL1640 boards with 4 GB RAM. While inference may vary, the steps remain the same across all processors.

In tutorial, we will guide you on running SmolVLM-256M model on Astra, a lightweight vision-language model optimized for edge devices. You can also run LFM2-VL-450M and MobileVLM-1.7B models on Astra SL Series using llama.cpp.

Other Vision language models with larger size in params such as Qwen2-VL-2B, LLaVA-Phi-3 , Llama 3.2 Vision etc require significantly more RAM during runtime than SL1680 can provide. These models would require memory optimizations and we're exploring ways to support them on SL1680.

View more details about llama.cpp at Official GitHub Repo.

Prerequisites

You can natively compile the binary on the Machina board since we support the required packages and compilers in our OOBE (Out of Box Experience) image v1.2.0 and above.

If you prefer to cross-compile (building binaries on a host machine for customization purposes), please follow the steps from Cross Compile llama.cpp tutorial and generate llama-mtmd-cli binary instead.

Step 1: Generate Binary for llama.cpp

You can generate binaries for llama.cpp on Machina natively, just open a terminal in Machina/ SSH to Machina and clone the llama.cpp repository from GitHub and build llama--mtmd-cli binary which is for Multi model:

llama-mtmd-cli is a command-line interface tool designed to run Multi models using LLaMA-based backends. Learn more at llama.cpp mtmd examples

Also, create binary for llama-mtmd-cli, if its not created :

cmake --build build --target llama-mtmd-cli -j$(nproc)

The binary for llama-mtmd-cli will be created in ~/build/bin/. This binary will help you run the supported models.

Step 2: Download Supported VLM Models

You can run any supported model mentioned in the Supported Models section in the llama.cpp GitHub Repo.

SmolVLM is a compact, fast VLM suitable for edge devices built by Hugging Face 🤗 Team. So for this tutorial we will use SmolVLM-256M-Instruct-GGUF model which is one of the smallest multimodal model in the world.

For vision-language models, we need two types of models:

1. Language Model (LLM) – This is the core large language model responsible for generating text responses. It understands and processes natural language. In tools like llama-mtmd-cli, this is specified using the -m flag.

SmolVLM-256M-Instruct-Q8_0.gguf is a quantized version (Q8_0) of the SmolVLM 256M parameter language model, designed for fast inference on edge devices.

2. Multi-modal projection (mmproj) – The multi-modal projection model maps the image embeddings (from the vision encoder) into the same embedding space as the language model. This is specified using the --mmproj flag.

mmproj-SmolVLM-256M-Instruct-Q8_0.gguf is the mmproj model in GGUF format for SmolVLM.

Together, these models enable the system to interpret images and generate meaningful text based on both visual and textual inputs. The typical flow is:

Image → Vision Encoder → mmproj → Language Model → Text Output

Download the quantized model and mmproj:

wget https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/SmolVLM-256M-Instruct-Q8_0.gguf
wget https://huggingface.co/ggml-org/SmolVLM-256M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-256M-Instruct-Q8_0.gguf

Step 3: Running llama.cpp on Machina Board

With both the models downloaded and the binary built, Download an image (for example catdog.jpg) you want your vision model to describe.

catdog

Now, you can run the SmolVLM-256M model inside the llama.cpp folder:

./build/bin/llama-mtmd-cli \
-m models/SmolVLM-256M-Instruct-Q8_0.gguf \
--mmproj models/mmproj-SmolVLM-256M-Instruct-Q8_0.gguf \
--image catdog1.jpg \
-p "Give polite answers to the user's questions. USER: <image>\n What's in this image? ASSISTANT:"

The output from your Synaptics Astra board should look like:

.. main: loading model: models/SmolVLM-256M-Instruct-Q8_0.gguf encoding image slice... image slice encoded in 14593 ms decoding image batch 1/1, n_tokens_batch = 64 image decoded (batch 1/1) in 643 ms

A golden retriever lying on the floor next to a grey and white cat.

llama_perf_context_print: load time = 413.81 ms llama_perf_context_print: prompt eval time = 15596.12 ms / 101 tokens ( 154.42 ms per token, 6.48 tokens per second) llama_perf_context_print: eval time = 374.54 ms / 17 runs ( 22.03 ms per token, 45.39 tokens per second) llama_perf_context_print: total time = 16190.02 ms / 118 tokens llama_perf_context_print: graphs reused = 16

Most time consuming step here is Image Encoding (Converting the image into embeddings doing feature extraction) which takes around 14 secs. Text generation is much faster (more than 40 tokens/sec).

Congratulations

You have successfully run a Vision Language Model using llama.cpp on your Astra Machina Board. Try different models for speed/accuracy trade-offs, or integrate llama.cpp into your projects.

We also have different approaches like On-device AI Assistant, which allow users for RAG based responses. We can also add Object Detection (running on NPU).

For more advanced usage and options, refer to the llama.cpp Multimodel GitHub.