跳到主要内容

Framework API

Synap 框架的核心功能是执行预编译的神经网络。这是通过 Network 类实现的。Network 类的设计在最常见的情况下使用简单,同时对于大多数高级用例来说也足够灵活。实际的推理将根据模型的编译方式在不同的硬件单元(NPU、GPU、CPU 或它们的组合)上进行。

基本用法

Network 类

Network 类非常简单,如下图所示。

使用网络只能做两件事:

  • 加载模型,提供 .synap 格式的已编译模型。
  • 执行推理。

网络还具有输入张量数组用于存放要处理的数据,以及输出张量数组用于在每次推理后包含结果。

network5

图 5 Network

synaptics::synap::Network

在 NPU 加速器上加载和执行神经网络。

概要
函数描述
bool load_model(const std::string &model_file, const std::string &meta_file = "")从文件加载模型。
bool load_model(const void *model_data, size_t model_size, const char *meta_data = nullptr)从内存加载模型。
bool predict()运行推理。
公共函数
bool load_model(const std::string &model_file, const std::string &meta_file = "")
  • 加载模型。
  • 如果之前已加载其他模型,在加载指定模型之前会先释放之前的模型。
  • 参数
    • model_file.synap 模型文件的路径。也可以是旧版 .nb 模型文件的路径。
    • meta_file:对于旧版 .nb 模型,必须是模型元数据文件的路径(JSON 格式)。在所有其他情况下,必须是空字符串。
  • 返回值:如果成功则返回 true
bool load_model(const void *model_data, size_t model_size, const char *meta_data = nullptr)
  • 加载模型。
  • 如果之前已加载其他模型,在加载指定模型之前会先释放之前的模型。
  • 参数
    • model_data:模型数据,例如从 model.synap 通过 fread() 读取。调用者保留模型数据的所有权,可以在此方法结束时删除它们。
    • model_size:模型大小(字节)。
    • meta_data:对于旧版 .nb 模型,必须是模型的元数据(JSON 格式)。在所有其他情况下,必须是 nullptr
  • 返回值:如果成功则返回 true
bool predict()
  • 运行推理。
  • 从输入张量读取要处理的输入数据。推理结果生成在输出张量中。
  • 返回值:如果成功则返回 true,如果推理失败或网络未正确初始化则返回 false
公共成员
Tensors *inputs*
  • 输入张量的集合,可以通过索引访问和迭代。
Tensors *outputs*
  • 输出张量的集合,可以通过索引访问和迭代。

使用网络

执行神经网络的先决条件是创建一个 Network 对象并加载其 .synap 格式的模型。此文件是使用 Synap 工具包转换网络时生成的。这只需要执行一次,加载网络后就可以用于推理:

  1. 将输入数据放入网络输入张量。
  2. 调用网络的 predict() 方法。
  3. 从网络输出张量获取结果。

network6

图 6 运行推理

示例

Network net;
net.load_model("model.synap");
vector[uint8_t](uint8_t) in_data = custom_read_input_data();
net.inputs[0].assign(in_data.data(), in_data.size());
net.predict();
custom_process_result(net.outputs[0].as_float(), net.outputs[0].item_count());

请注意:

  • 权重和输入/输出数据的所有内存分配和对齐都由 Network 对象自动完成。
  • 当 Network 对象被销毁时,所有内存都会自动释放。
  • 为简单起见,省略了所有错误检查。如果出现问题,方法通常返回 false。不返回显式错误代码,因为错误通常太复杂,无法用简单的枚举代码解释。有关错误的详细信息可以在日志中找到。
  • 示例中名为 custom_read_input_data 的例程是用户代码的占位符。
  • 在上面的代码中,将 in_data 向量分配给张量时会进行数据复制。in_data 向量中包含的数据不能直接用于推理,因为无法保证它们按照硬件要求正确对齐和填充。在大多数情况下,这种额外复制的成本可以忽略不计,当这不是问题时,有时可以通过直接写入张量数据缓冲区来避免复制,例如:
custom_generate_input_data(net.inputs[0].data(), net.inputs[0].size());
net.predict();
  • 张量中的数据类型取决于网络的生成方式。常见的数据类型包括 float16float32、量化的 uint8int16assign()as_float() 负责处理所有必需的数据转换。

仅使用本节所示的简单方法,就可以使用 NPU 硬件加速器执行推理。这几乎是在大多数应用程序中使用 SyNAP 所需要知道的全部内容。以下各节将详细解释幕后发生的事情:这使得可以充分利用可用的硬件来处理更高要求的用例。

高级主题

张量

我们在上一节中看到,对网络输入和输出数据的所有访问都是通过张量对象完成的,因此值得详细了解 Tensor 对象可以做什么。基本上,张量允许:

  • 获取有关所包含数据的信息和属性。
  • 访问数据。
  • 访问用于包含数据的底层 Buffer。更多内容将在下一节中介绍。

network7

图 7 Tensor

synaptics::synap::Tensor

Synap 数据张量。

不可能在 Network 外部创建张量,用户只能访问由 Network 本身创建的张量。

概要
函数描述
const std::string &name() const获取张量的名称。
const Shape &shape() const获取张量的形状。
const Dimensions dimensions() const获取张量的维度。
Layout layout() const获取张量的布局。
std::string format() const获取张量的格式。
DataType data_type() const获取张量数据类型。
Security security() const获取张量安全属性。
size_t size() const获取张量数据的字节大小。
size_t item_count() const获取张量中的项目数量。
bool is_scalar() const检查张量是否为标量。
bool assign(const uint8_t *data, size_t count)规范化并将数据复制到张量数据缓冲区。
bool assign(const int16_t *data, size_t count)与前一个 assign 函数类似,但用于 int16_t 数据。
bool assign(const float *data, size_t count)与前一个 assign 函数类似,但用于 float 数据。
bool assign(const void *data, size_t size)将原始数据复制到张量数据缓冲区。
bool assign(const Tensor &src)将张量的内容复制到张量数据缓冲区。
bool assign(int32_t value)将值写入张量数据缓冲区。
template[typename T](typename T) T *data()获取指向张量数据缓冲区内数据开始处的指针。
void *data()获取指向张量数据缓冲区内原始数据的指针。
const float *as_float() const获取指向转换为 float 的张量内容的指针。
Buffer *buffer()获取指向张量当前数据 Buffer 的指针。
bool set_buffer(Buffer *buffer)设置张量的当前数据缓冲区。
公共函数
const std::string &name() const
  • 获取张量的名称。
  • 在具有多个输入或输出的网络中,可以使用字符串而不是位置索引来标识张量时很有用。
  • 返回值:张量名称。
const Shape &shape() const
  • 获取张量的形状。
  • 获取 Tensor 的形状,即每个维度中的元素数量。 维度的顺序由张量布局指定。
  • 返回值:张量形状。
const Dimensions dimensions() const
  • 获取 Tensor 的维度,即每个维度中的元素数量。 返回的值与张量布局无关。
  • 返回值:张量维度(如果张量的秩不是 4,则全为 0)。
Layout layout() const
  • 获取 Tensor 的布局,即数据在内存中的组织方式。 SyNAP 支持两种布局:NCHWNHWC。N 维度(样本数量)为了与标准约定兼容而存在,但必须始终为 1。
  • 返回值:张量布局。
std::string format() const
  • 获取 Tensor 的格式,即数据表示的内容的描述。 这是一个自由格式的字符串,其含义取决于应用程序,例如,"rgb"、"bgr"。
  • 返回值:张量格式。
DataType data_type() const
  • 获取张量数据类型。整数类型用于表示量化数据。量化参数和量化方案的详细信息不直接可用,用户可以使用下面的 as_float() 方法将量化数据转换为 32 位 float 来使用。
  • 返回值:张量中每个项目的类型。
Security security() const
  • 获取张量安全属性。
  • 返回值:张量的安全属性(如果模型不安全则为 none)。
size_t size() const
  • 返回值:张量数据的字节大小。
size_t item_count() const
  • 获取张量中的项目数量。张量的 size() 始终等于 item_count() 乘以张量数据类型的大小。
  • 返回值:张量中的数据项目数量。
bool is_scalar() const
  • 返回值:如果这是标量张量(即只包含一个元素)则返回 true。(标量张量的形状有一个维度,等于 1)。
bool assign(const uint8_t *data, size_t count)
  • 规范化并将数据复制到张量数据缓冲区。
  • 数据会被规范化并转换为张量的类型和量化方案。 数据数量必须等于张量的 item_count()
  • 参数
    • data:指向要复制的数据的指针。
    • count:要复制的数据项目数量。
bool assign(const int16_t *data, size_t count)
  • 与前一个 assign 函数类似,但用于 int16_t 数据。
  • 返回值:如果成功则返回 true
bool assign(const float *data, size_t count)
  • 与前一个 assign 函数类似,但用于 float 数据。
  • 返回值:如果成功则返回 true
bool assign(const void *data, size_t size)
  • 将原始数据复制到张量数据缓冲区。数据被视为原始数据,因此不会进行规范化或转换。数据大小必须等于张量的 size()
  • 返回值:如果成功则返回 true
bool assign(const Tensor &src)
  • 将张量的内容复制到张量数据缓冲区。
  • 不会进行规范化或转换;两个张量的数据类型和大小必须匹配。
  • 参数
    • src:包含要复制的数据的源张量。
  • 返回值:如果成功则返回 true,如果类型或大小不匹配则返回 false
bool assign(int32_t value)
  • 将值写入张量数据缓冲区。
  • 仅当张量是标量时才有效。该值也会转换为张量数据类型:8、16 或 32 位整数。 在将值写入数据缓冲区之前,该值也会根据张量格式属性进行重新缩放(如果需要)。
  • 参数
    • value:要复制值。
  • 返回值:如果成功则返回 true
template[typename T](typename T) T *data()
  • 获取指向张量数据缓冲区内数据的指针(如果可以直接访问)。
  • 这仅在 T 匹配张量的 data_type() 且不需要规范化/量化时才有效。示例用法:uint8_t* data8 = tensor.data[uint8_t](uint8_t)();
  • 返回值:指向数据缓冲区内数据的指针或 nullptr
void *data()
  • 获取指向张量数据缓冲区内原始数据的指针(如果有)。
  • 该方法返回一个 void 指针,因为实际数据类型是 data_type() 方法返回的类型。
  • 返回值:指向数据缓冲区内原始数据的指针,如果没有则返回 nullptr
const float *as_float() const
  • 获取指向转换为 float 的张量内容的指针。
  • 该方法始终返回一个 float 指针。如果张量的实际数据类型不是 float,则会在内部执行转换,因此用户不需要关心数据如何内部表示。请注意,这是一个指向浮点数据内部的张量指针:这意味着返回的指针不能释放,内存将在张量被销毁时自动释放。
  • 返回值:指向 float[item_count()] 数组的指针,表示转换为 float 的张量内容(如果张量没有数据则返回 nullptr)。
Buffer *buffer()
  • 获取指向张量当前数据 Buffer 的指针(如果有)。
  • 这将是张量的默认缓冲区,除非用户使用 set_buffer() 分配不同的缓冲区。
  • 返回值:当前数据缓冲区或 nullptr(如果没有)。
bool set_buffer(Buffer *buffer)
  • 设置张量的当前数据缓冲区。
  • 缓冲区大小必须为 0 或匹配张量大小,否则将被拒绝(空缓冲区将自动调整为张量大小)。通常提供的缓冲区应至少与张量本身一样长。如果缓冲区对象在张量之前被销毁,它将自动取消设置,张量将保持缓冲区。
  • 参数
    • buffer:要用于此张量的缓冲区。缓冲区大小必须匹配张量大小(或为 0)。
  • 返回值:如果成功则返回 true

Here below a list of all the data types supported in a tensor:

enum class synaptics::synap::DataType

Enumerators Values
Enumerator ValuesDescription
enumerator invalidInvalid data type.
enumerator byteByte data type.
enumerator int88-bit signed integer.
enumerator uint88-bit unsigned integer.
enumerator int1616-bit signed integer.
enumerator uint1616-bit unsigned integer.
enumerator int3232-bit signed integer.
enumerator uint3232-bit unsigned integer.
enumerator float1616-bit floating-point.
enumerator float3232-bit floating-point.

Buffers

The memory used to store a tensor data has to satisfy the following requirements:

  • Must be correctly aligned
  • Must be correctly padded
  • In some cases must be contiguous
  • Must be accessible by the NPU HW accelerator and by the CPU or other HW components

Memory allocated with malloc() or new or std::vector doesn't satisfy these requirements so can't be used directly as input or output of a Network. For this reason Tensor objects use a special Buffer class to handle memory. Each tensor internally contains a default Buffer object to handle the memory used for the data.

The API provided by the Buffer is similar when possible to the one provided by std::vector. The main notable exeception is that a buffer content can't be indexed since a buffer is just a container for raw memory, without a data type. The data type is known by the tensor which is using the buffer. Buffer is also taking care of disposing the allocated memory when it is destroyed (RAII) to avoid all possible memory leakages. The actual memory allocation is done via an additional Allocator object. This allows to allocate memory with different attributes in different memory area. When a buffer object is created it will use the default allocator unless a different allocator is specified. The allocator can be specified directly in the constructor or later using the set_allocator() method.

network8

Figure 8 Buffer class

In order for the buffer data to be shared by the CPU and NPU hardware some extra operations have to be done to ensure that the CPU caches and system memory are correcly aligned. All this is done automatically when the buffer content is used in the Network for inference. There are cases when the CPU is not going to read/write the buffer data directly, for example when the data is generated by another HW component (eg. video decoder). In these cases it's possible to have some performance improvements by disabling CPU access to the buffer using the method provided.

备注

It is possible to create a buffer that refers to an existing memory area instead of using an allocator. This memory must be registered with the TrustZone kernel and correctly aligned and padded. The Buffer object will not free the memory when destroyed, as the memory is owned by the SW module that allocated it.

class synaptics::synap::Buffer

Synap data buffer.

Summary
FunctionDescription
Buffer(Allocator *allocator = nullptr)Create an empty data buffer.
Buffer(size_t size, Allocator *allocator = nullptr)Create and allocate a data buffer.
Buffer(uint32_t mem_id, size_t offset, size_t size)Refer to an existing memory area.
Buffer(uint32_t handle, size_t offset, size_t size, bool is_mem_id)Refer to an existing memory area.
Buffer(const Buffer &rhs, size_t offset, size_t size)Refer to a part of the memory area of an existing buffer.
Buffer(Buffer &&rhs) noexceptMove constructor.
Buffer &operator=(Buffer &&rhs) noexceptMove assignment.
bool resize(size_t size)Resize buffer.
bool assign(const void *data, size_t size)Copy data in buffer.
size_t size() constGet actual data size.
const void *data() constGet actual data.
bool allow_cpu_access(bool allow)Enable/disable CPU access to buffer data.
bool set_allocator(Allocator *allocator)Change the allocator.
Public Functions
Buffer(Allocator *allocator = nullptr)
  • Create an empty data buffer.
  • Parameters:
    • allocator: Allocator to be used (default is malloc-based).
Buffer(size_t size, Allocator *allocator = nullptr)
  • Create and allocate a data buffer.
  • Parameters:
    • size: Buffer size.
    • allocator: Allocator to be used (default is malloc-based).
Buffer(uint32_t mem_id, size_t offset, size_t size)
  • Create a data buffer to refer to an existing memory area.
  • The user must ensure that the provided memory is correctly aligned and padded. The specified memory area will not be deallocated when the buffer is destroyed. It is the responsibility of the caller to release mem_id after the Buffer has been destroyed.
  • Parameters:
    • mem_id: ID of an existing memory area registered with the TZ kernel.
    • offset: Offset of the actual data inside the memory area.
    • size: Size of the actual data.
Buffer(uint32_t handle, size_t offset, size_t size, bool is_mem_id)
  • Create a data buffer to refer to an existing memory area.
  • The user must ensure that the provided memory is correctly aligned and padded. The specified memory area will not be deallocated when the buffer is destroyed. It is the responsibility of the caller to release mem_id after the Buffer has been destroyed.
  • Parameters:
    • handle: FD of an existing dmabuf or mem_id registered with the TZ kernel.
    • offset: Offset of the actual data inside the memory area.
    • size: Size of the actual data.
    • is_mem_id: true if the first argument is mem_id, false if it is a FD.
Buffer(const Buffer &rhs, size_t offset, size_t size)
  • Create a data buffer that refers to a part of the memory area of an existing buffer.**
  • The memory of the provided buffer must already be allocated. To avoid referring to released memory, the existing buffer memory must not be deallocated before this buffer is destroyed.
  • Parameters:
    • rhs: An existing Buffer.
    • offset: Offset of the desired data inside the Buffer memory area.
    • size: Size of the desired data.
Buffer(Buffer &&rhs) noexcept
  • Move constructor. Only possible for buffers not yet in use by a Network.
Buffer &operator=(Buffer &&rhs) noexcept
  • Move assignment. Only possible for buffers not yet in use by a Network.
bool resize(size_t size)
  • Resize buffer. Only possible if an allocator was provided. Any previous content is lost.
  • Parameters:
    • size: New buffer size.
  • Returns: true if successful.
bool assign(const void *data, size_t size)
  • Copy data in buffer. Always successful if the input data size is the same as the current buffer size; otherwise, the buffer is resized if possible.
  • Parameters:
    • data: Pointer to data to be copied.
    • size: Size of data to be copied.
  • Returns: true if successful.
size_t size() const
  • Get actual data size.
const void *data() const
  • Get actual data.
bool allow_cpu_access(bool allow)
  • Enable/disable the possibility for the CPU to read/write the buffer data.
  • By default CPU access to data is enabled. CPU access can be disabled in case the CPU doesn't need to read or write the buffer data and can provide some performance improvements when the data is only generated/used by another HW components.
备注

Reading or writing buffer data while CPU access is disabled might cause loss or corruption of the data in the buffer.

  • Parameters:
    • allow: false to indicate the CPU will not access buffer data.
  • Returns: Current setting.
  • Note: Reading or writing buffer data while CPU access is disabled might cause loss or corruption of the data in the buffer.
bool set_allocator(Allocator *allocator)
  • Change the allocator. Can only be done if the buffer is empty.
  • Parameters:
    • allocator: Allocator.
  • Returns: true if successful.

Allocators

Two allocators are provided for buffer objects:

  • Standard Allocator: This is the default allocator used by buffers created without explicitly specifying an allocator. The memory is paged (non-contiguous).
  • CMA Allocator: Allocates contiguous memory. Contiguous memory is required for some HW components and can provide some small performance improvement if the input/output buffers are very large since less overhead is required to handle memory pages. Should be used with great care since the contiguous memory available in the system is quite limited.
Allocator *standard_allocator()
  • Returns a pointer to the system standard allocator.
Allocator *contiguous_allocator()
  • Returns a pointer to the system contiguous allocator.
important

The calls above return pointers to global objects, so they must NOT be deleted after use.

Advanced Examples

Accessing Tensor Data

Data in a Tensor is normally written using the Tensor::assign(const T* data, size_t count) method. This method will take care of any required data normalization and data type conversion from the type T to the internal representation used by the network.

Similarly, the output data is normally read using the Tensor::as_float() method, which provides a pointer to the tensor data converted to floating point values from whatever internal representation is used.

These conversions, even if quite optimized, present a runtime cost that is proportional to the size of the data. For input data, this cost could be avoided by generating them directly in the Tensor data buffer, but this is only possible when the tensor data type corresponds to that of the data available in input and no additional normalization/quantization is required. Tensor provides a type-safe data[T](T) () access method that will return a pointer to the data in the tensor only if the above conditions are satisfied, for example:

uint8_t* data_ptr = net.inputs[0].data[uint8_t](uint8_t)();
if (data_ptr) {
custom_generate_data(data_ptr, net.inputs[0].item_count());
}

If the data in the tensor is not uint8_t or normalization/[de]quantization is required, the returned value will be nullptr. In this case, the direct write or read is not possible and assign() or as_float() is required.

It's always possible to access the data directly by using the raw data() access method which bypasses all checks:

void* in_data_ptr = net.inputs[0].data();
void* out_data_ptr = net.outputs[0].data();

In the same way, it's also possible to assign raw data (without any conversion) by using a void* data pointer:

const void* in_raw_data_ptr = ....;
net.inputs[0].assign(in_raw_data_ptr, size);

In these cases, it is the responsibility of the user to know how the data is represented and how to handle them.

Setting Buffers

If the properties of the default tensor buffer are not suitable, the user can explicitly create a new buffer and use it instead of the default one. For example, suppose we want to use a buffer with contiguous memory:

Network net;
net.load_model("model.synap");

// Replace the default buffer with one using contiguous memory
Buffer cma_buffer(net.inputs[0].size(), contiguous_allocator());
net.inputs[0].set_buffer(&cma_buffer);

// Do inference as usual
custom_generate_input_data(net.inputs[0].data(), net.inputs[0].size());
net.predict();

Settings Default Buffer Properties

A simpler alternative to replacing the buffer used in a tensor as seen in the previous section is to directly change the properties of the default tensor buffer. This can only be done at the beginning, before the tensor data is accessed:

Network net;
net.load_model("model.synap");

// Use contiguous allocator for default buffer in input[0]
net.inputs[0].buffer()->set_allocator(contiguous_allocator());

// Do inference as usual
custom_generate_input_data(net.inputs[0].data(), net.inputs[0].size());
net.predict();

Buffer Sharing

The same buffer can be shared among multiple networks if they need to process the same input data. This avoids the need for redundant data copies:

Network net1;
net1.load_model("nbg1.synap");
Network net2;
net2.load_model("nbg2.synap");

// Use a common input buffer for the two networks (assume same input size)
Buffer in_buffer;
net1.inputs[0].set_buffer(&in_buffer);
net2.inputs[0].set_buffer(&in_buffer);

// Do inference as usual
custom_generate_input_data(in_buffer.data(), in_buffer.size());
net1.predict();
net2.predict();

Another interesting case of buffer sharing is when the output of a network must be processed directly by another network. For example, the first network can do some preprocessing, and the second one can perform the actual inference. In this case, setting the output buffer of the first network as the input buffer of the second network allows completely avoiding data copying (the two tensors must have the same size, of course). Furthermore, since the CPU has no need to access this intermediate data, it is convenient to disable its access to this buffer, avoiding the unnecessary overhead of cache flushing and providing an additional improvement in performance.

Network net1;
net1.load_model("nbg1.synap");
Network net2;
net2.load_model("nbg2.synap");

// Use net1 output as net2 input. Disable CPU access for better performance.
net1.outputs[0].buffer()->allow_cpu_access(false);
net2.inputs[0].set_buffer(net1.outputs[0].buffer());

// Do inference as usual
custom_generate_input_data(net1.inputs[0].data(), net1.inputs[0].size());
net1.predict();
net2.predict();

One last case is when the output of the first network is smaller than the input of the second network, and we still want to avoid copy. Imagine, for example, that the output of net1 is an image 640x360 that we want to generate inside the input of net2, which expects an image 640x480. In this case, the buffer-sharing technique shown above can't work due to the mismatch in size of the two tensors. What we need instead is to share part of the memory used by the two Buffers.

Network net2;  // Important: this has to be declared first, so it is destroyed after net1
net2.load_model("nbg2.synap");
Network net1;
net1.load_model("nbg1.synap");

// Initialize the entire destination tensor now that we still have CPU access to it
memset(net2.inputs[0].data(), 0, net2.inputs[0].size());

// Replace net1 output buffer with a new one using (part of) the memory of net2 input buffer
*net1.outputs[0].buffer() = Buffer(*net2.inputs[0].buffer(), 0, net2.outputs[0].size());

// Disable CPU access for better performance
net1.outputs[0].buffer()->allow_cpu_access(false);
net2.inputs[0].buffer()->allow_cpu_access(false);

// Do inference as usual
custom_generate_input_data(net1.inputs[0].data(), net1.inputs[0].size());
net1.predict();
net2.predict();
备注

Since net1 input tensor now uses the memory allocated by net2, it is important that net1 is destroyed before net2; otherwise, it will be left pointing to unallocated memory. This limitation will be fixed in the next release.

Recycling Buffers

It is possible for the user to explicitly set at any time which buffer to use for each tensor in a network. The cost of this operation is very low compared to the creation of a new buffer, so it is possible to change the buffer associated with a tensor at each inference if desired.

Despite this, the cost of creating a buffer and setting it to a tensor the first time is quite high since it involves multiple memory allocations and validations. It is possible but deprecated to create a new Buffer at each inference; better to create the required buffers in advance and then just use set_buffer() to choose which one to use.

As an example, consider a case where we want to do inference on the current data while at the same time preparing the next data. The following code shows how this can be done:

Network net;
net.load_model("model.synap");

// Create two input buffers
const size_t input_size = net.inputs[0].size();
vector[Buffer](Buffer) buffers { Buffer(input_size), Buffer(input_size) };

int current = 0;
custom_start_generating_input_data(&buffers[current]);
while (true) {
custom_wait_for_input_data();

// Do inference on current data while filling the other buffer
net.inputs[0].set_buffer(&buffers[current]);
current = !current;
custom_start_generating_input_data(&buffers[current]);
net.predict();
custom_process_result(net.outputs[0]);
}

Using BufferCache

There are situations where the data to be processed comes from other components that provide each time a data block taken from a fixed pool of blocks. Each block can be uniquely identified by an ID or by an address. This is the case, for example, of a video pipeline providing frames.

Processing in this case should proceed as follows:

  1. Get the next block to be processed.
  2. If this is the first time we see this block, create a new Buffer for it and add it to a collection.
  3. Get the Buffer corresponding to this block from the collection.
  4. Set it as the current buffer for the input tensor.
  5. Do inference and process the result.

The collection is needed to avoid the expensive operation of creating a new Buffer each time. This is not complicated to code, but steps 2 and 3 are always the same. The BufferCache template takes care of all this. The template parameter allows specifying the type to be used to identify the received block; this can be, for example, a BlockID or directly the address of the memory area.

备注

In this case, the buffer memory is not allocated by the Buffer object. The user is responsible for ensuring that all data is properly padded and aligned. Furthermore, the buffer cache does not take ownership of the data block; it is the responsibility of the user to deallocate them in due time after the BufferCache has been deleted.

Copying and Moving

Network, Tensor, and Buffer objects internally access hardware resources, so they can't be copied. For example:

Network net1;
net1.load_model("model.synap");
Network net2;
net2 = net1; // ERROR, copying networks is not allowed

However, Network and Buffer objects can be moved since this has no overhead and can be convenient when the point of creation is not the point of use. Example:

Network my_create_network(string nb_name, string meta_name) {
Network net;
net.load(nb_name, meta_name);
return net;
}

void main() {
Network network = my_create_network("model.synap");
...
}

The same functionality is not available for Tensor objects; they can exist only inside their own Network.

NPU Locking

An application can decide to reserve the NPU for its exclusive usage. This can be useful in case of realtime applications that have strict requirements in terms of latency, for example video or audio stream processing.

Locking the NPU can be done at two levels:

  1. Reserve NPU access to the current process using Npu::lock().
  2. Reserve NPU for offline use only (that is, disable NPU access from NNAPI).

NPU Locking

The NPU locking is done by process, this means that once the Npu::lock() API is called no other process will be able to run inference on the NPU. Other processes will still be able to load networks, but if they try to do offline or online NNAPI inference or to lock() the NPU again, they will fail.

The process which has locked the NPU is the only one which has the rights to unlock it. If a process with a different PID tries to unlock() the NPU, the operation will be ignored and have no effect.

备注

There is currently no way for a process to test if the NPU has been locked by some other process. The only possibility is to try to lock() the NPU. If this operation fails, it means that the NPU is already locked by another process or unavailable due to some failure.

备注

If the process owning the NPU lock terminates or is terminated for any reason, the lock is automatically released.

NNAPI Locking

A process can reserve the NPU for offline use only so that nobody will be able to run online inference on the NPU via NNAPI. Other processes will still be able to run offline inference on the NPU. SyNAP has no dedicated API for this, NNAPI can be disabled by setting the property vendor.NNAPI_SYNAP_DISABLE to 1 using the standard Android API __system_property_set() or android::base::SetProperty(). Sample code in setprop.cpp file in the Android source code.

See also: Disabling NPU Usage from NNAPI

备注

It will still be possible to perform online inference on the NPU using the timvx tflite delegate.

Description

The Npu class controls the locking and unlocking of the NPU. Normally only one object of this class needs to be created when the application starts and destroyed when the application is going to terminate.

network9

Figure 9 NPU class

class synaptics::synap::Npu

Reserve NPU usage.

Summary
FunctionDescription
bool available() constCheck if the NPU is successfully initialized.
bool lock()Lock exclusive right to perform inference for the current process.
bool unlock()Release exclusive right to perform inference.
bool is_locked() constCheck if the NPU lock is currently owned.
Public Functions
bool available() const
  • Check if the NPU is successfully initialized.
  • Returns: true if the NPU is successfully initialized.
bool lock()
  • Lock exclusive right to perform inference for the current process.
  • All other processes attempting to execute inference will fail, including those using NNAPI.
    The lock will stay active until unlock() is called or the Npu object is deleted.
  • Returns:
    • true if NPU is successfully locked. Calling this method on an Npu object that is already locked has no effect; just returns true.
    • false if NPU is unavailable or locked by another process.
bool unlock()
  • Release exclusive right to perform inference.
  • Returns:
    • true if successful. Calling this method on an Npu object that is not locked has no effect; just returns true.
bool is_locked() const
  • Check if the NPU lock is currently owned.
  • Note: The only way to test if the NPU is locked by someone else is to try to lock() it.
  • Returns: true if we currently own the NPU lock.
struct Private
  • Npu private implementation.
备注

The Npu class uses the RAII technique, this means that when an object of this class is destroyed and it was locking the NPU, the NPU is automatically unlocked. This helps ensure that when a program terminates the NPU is in all cases unlocked.

Sample Usage

The following diagrams show some example use of the NPU locking API.

network10

Figure 10 Locking the NPU

network11

Figure 11 Locking and inference

network12

Figure 12 Locking NNAPI

network13

Figure 13 Automatic lock release

Preprocessing and Postprocessing

When using neural networks, the input and output data are rarely used in their raw format. Most often, data conversion has to be performed on the input data to make them match the format expected by the network. This step is called preprocessing.

Examples of preprocessing in the case of an image are:

  • Scale and/or crop the input image to the size expected by the network.
  • Convert planar to interleaved or vice-versa.
  • Convert RGB to BGR or vice-versa.
  • Apply mean and scale normalization.

These operations can be performed using the NPU at inference time by enabling preprocessing when the model is converted using the SyNAP Toolkit, or they can be performed in software when the data is assigned to the Network.

Similarly, the inference results contained in the network output tensor(s) normally require further processing to make the result usable. This step is called postprocessing. In some cases, postprocessing can be a non-trivial step in both complexity and computation time.

Examples of postprocessing are:

  • Convert quantized data to floating point representation.
  • Analyze the network output to extract the most significant elements.
  • Combine the data from multiple output tensors to obtain a meaningful result.

The classes in this section are not part of the SyNAP API; they are intended mainly as utility classes that can help in writing SyNAP applications by combining the three usual steps of preprocess-inference-postprocess just explained.

Full source code is provided, so they can be used as a reference implementation for the user to extend.

InputData Class

The main role of the InputData class is to wrap the actual input data and complement it with additional information to specify what the data represents and how it is organized. The current implementation is mainly focused on image data.

InputData functionality includes:

  • Reading raw files (binary).
  • Reading and parsing images (jpeg or png) from file or memory.
  • Getting image attributes, e.g., dimensions and layout.

The input filename is specified directly in the constructor and can't be changed. As an alternative to a filename, it is also possible to specify a memory address in case the content is already available in memory.

备注

No data conversion is performed. Even for jpeg or png images, the data is kept in its original form.

network14

Figure 14 InputData class

Example:

Network net;
net.load_model("model.synap");
InputData image("sample_rgb_image.dat");
net.inputs[0].assign(image.data(), image.size());
net.predict();
custom_process_result(net.outputs[0]);

Preprocessor Class

This class takes as input an InputData object and assigns its content to the input Tensor(s) of a network by performing all the necessary conversions. The conversion(s) required are determined automatically by reading the attributes of the tensor itself.

Supported conversions include:

  • Image decoding (jpeg, png, or nv21 to rgb)
  • Layout conversion: nchw to nhwc or vice-versa
  • Format conversion: rgb to bgr or grayscale
  • Image cropping (if preprocessing with cropping enabled in the compiled model)
  • Image rescaling to fit the tensor dimensions

The conversion (if needed) is performed when an InputData object is assigned to a Tensor.

Cropping is only performed if enabled in the compiled model and the multi-tensor assign API is used: Preprocessor::assign(Tensors& ts, const InputData& data).

Rescaling by default preserves the aspect ratio of the input image. If the destination tensor is taller than the rescaled input image, gray bands are added at the top and bottom. If the destination tensor is wider than the rescaled input image, gray bands are added at the left and right. It is possible to configure the gray level of the fill using the fill_color=N option in the format string of the input tensor, where N is an integer between 0 (black) and 255 (white).

The preservation of the aspect ratio can be disabled by specifying the keep_proportions=0 option in the format string of the input tensor. In this case, the input image is simply resized to match the size of the tensor.

备注

The Preprocessor class performs preprocessing using the CPU. f the conversion to be done is known in advance it may be convenient to perform it using the NPU by adding a preprocessing layer when the network is converted, see Preprocessing.

ImagePostprocessor Class

ImagePostprocessor functionality includes:

  • Reading the content of a set of Tensors.
  • Converting the raw content of the Tensors to a standard representation (currently only nv21 is supported). The format of the raw content is determined automatically by reading the attributes of the tensors themselves. For example, in some super-resolution networks, the different components of the output image (y, uv) are provided in separate outputs. The converted data is made available in a standard vector.

network15

Figure 15 ImagePostprocessor class

Example:

Preprocessor preprocessor;
Network net;
ImagePostprocessor postprocessor;

net.load_model("model.synap");
InputData image("sample_image.jpg");
preprocessor.assign(net.inputs[0], image);
net.predict();
// Convert to nv21
ImagePostprocessor::Result out_image = postprocessor.process(net.outputs);
binary_file_write("out_file.nv21", out_image.data.data(), out_image.data.size());

Classifier Class

The Classifier class is a postprocessor for the common use case of image classification networks.

There are just two things that can be done with a classifier:

  • Initialize it.
  • Process network outputs: this will return a list of possible classifications sorted in order of decreasing confidence, each containing the following information:
    • class_index
    • confidence

network16

Figure 16 Classifier class

class synaptics::synap::Classifier

Classification post-processor for Network output tensors.

Determine the top-N classifications of an image.

Summary
FunctionDescription
inline Classifier(size_t top_count = 1)Constructor to initialize the classifier.
Result process(const Tensors &tensors)Perform classification on network output tensors.
Public Functions
inline Classifier(size_t top_count = 1)
  • Constructor.
  • Parameters: -top_count: Number of most probable classifications to return.
Result process(const Tensors &tensors)
  • Perform classification on network output tensors.**
  • Parameters:
    • tensors: Output tensors of the network tensors[0] is expected to contain a list of confidences, one for each image class.
  • Returns: Classification results.
struct Result
  • Classification result.
Public Members of Result
bool success = {}
  • True if classification successful, false if failed.
std::vector[Item](Item) items
  • List of possible classifications for the input, sorted in descending confidence order, that is items[0] is the classification with the highest confidence.
  • Empty if classification failed.
struct Item

Classification item.

Public Members of Item
int32_t class_index
  • Index of the class.
float confidence
  • Confidence of the classification, normally in the range [0, 1].

Example:

Preprocessor preprocessor
Network net;
Classifier classifier(5);
net.load_model("model.synap");
InputData image("sample_image.jpg");
preprocessor.assign(net.inputs[0], image);
net.predict();
Classifier::Result top5 = classifier.process(net.outputs);

The standard content of the output tensor of a classification network is a list of probabilities, one for each class on which the model has been trained (possibly including an initial element to indicate a "background" or "unrecognized" class). In some cases, the final SoftMax layer of the model is cut away to improve inference time: in this case, the output values can't be interpreted as probabilities anymore but since SoftMax is monotonic this doesn't change the result of the classification. The postprocessing can be parametrized using the format field of the corresponding output in the Conversion metafile:

Format TypeOut#ShapeDescription
confidence_array0NxCList of probabilities, one per class
AttributeDefaultDescription
class_index_base0Class index corresponding to the first element of the output vector

Where:

  • N: Number of samples, must be 1.
  • C: Number of recognized classes.

Detector Class

The Detector class is a postprocessor for the common use case of object detection networks. Here object is a generic term that can refer to actual objects, people, or anything used to train the network.

There are just two things that can be done with a detector:

  • Initialize it.
  • Run a detection: this will return a list of detection items, each containing the following information:
    • class_index
    • confidence
    • bounding box
    • landmarks (optional)

network17

Figure 17 Detector class

class synaptics::synap::Detector

Object-detector.

The output format of object-detection networks is not always the same but depends on the network architecture used. The format type must be specified in the format field of the output tensor in the network metafile when the network is compiled.
The following formats are currently supported: retinanet_boxes, tflite_detection_boxes, yolov5.

Summary
FunctionDescription
Detector(float score_threshold = 0.5, int n_max = 0, bool nms = true, float iou_threshold = .5, bool iou_with_min = false)Constructor.
bool init(const Tensors &tensors)Initialize detector.
Result process(const Tensors &tensors, const Rect &input_rect)Perform detection on network output tensors.

Public Functions
Detector(float score_threshold = 0.5, int n_max = 0, bool nms = true, float iou_threshold = .5, bool iou_with_min = false)
  • Constructor.
  • Parameters:
    • score_threshold: Detections below this score are discarded.
    • n_max: Maximum number of detections (0: all).
    • nms: If true, apply non-max-suppression to remove duplicate detections.
    • iou_threshold: Intersection-over-union threshold (used if nms is true).
    • iou_with_min: Use min area instead of union to compute intersection-over-union.
bool init(const Tensors &tensors)
  • Initialize detector. If not called, the detector is automatically initialized the first time process() is called.
  • Parameters:
    • tensors: Output tensors of the network (after the network has been loaded).
  • Returns: true if successful.
Result process(const Tensors &tensors, const Rect &input_rect)
  • Perform detection on network output tensors.
  • Parameters:
    • tensors: Output tensors of the network.
    • input_rect: Coordinates of the (sub)image provided in input (to compute bounding boxes).
  • Returns: Detection results.
class Impl

Subclassed by DetectorBoxesScores, DetectorTfliteODPostprocessOut, DetectorYoloBase, DetectorYolov5Pyramid

struct Result

Object-detector result.

Public Members of struct Result
bool success = {}
  • True if detection successful, false if detection failed.
std::vector[Item](Item) items
  • One entry for each detection.
  • Empty if nothing detected or detection failed.
struct Item

Detection item.

Public Members of struct Item
int32_t class_index
  • Index of the object class.
float confidence
  • Confidence of the detection, in the range [0, 1].
Rect bounding_box
  • Top,left corner plus horizontal and vertical size (in pixels).
std::vector[Landmark](Landmark) landmarks
  • One entry for each landmark.
  • Empty if no landmark available.

Example:

Preprocessor preprocessor
Network net;
Detector detector;
net.load_model("model.synap");
InputData image("sample_image.jpg");
Rect image_rect;
preprocessor.assign(net.inputs[0], image, &image_rect);
net.predict();
Detector::Result objects = detector.process(net.outputs, image_rect);

The rectangle argument passed to the process() method is needed so that it can compute bounding boxes and landmarks in coordinates relative to the original image, even if the image has been resized and/or cropped during the assignment to the network input tensor.

Postprocessing consists of the following steps:

  • For each possible position in the input grid, compute the score of the highest class there.
  • If this score is too low, nothing is detected at that position.
  • If above the detection threshold, compute the actual bounding box of the object by combining information about the anchor's location, the regressed deltas from the network, and the actual size of the input image.
  • Once all the detections have been computed, filter them using the Non-Min-Suppression algorithm to discard spurious overlapping detections and keep only the one with the highest score at each position. The NMS filter applies only for bounding boxes that have an overlap above a minimum threshold. The overlap itself is computed using the IntersectionOverUnion formula (Intersection Over Union). In order to provide more filtering for boxes of different sizes, the "intersection" area is sometimes replaced by the "minimum" area in the computation. SyNAP Detector implements both formula.

The content of the output tensor(s) from an object detection network is not standardized. Several formats exist for the major families of detection networks, with variants inside each family. The information contained is always the same; what changes is the way they are organized. The Detector class currently supports the following output formats:

  • retinanet_boxes
  • tflite_detection_input
  • tflite_detection
  • yolov5
  • yolov8

The desired label from the above list must be placed in the "format" field of the first output tensor of the network in the conversion metafile so the Detector knows how to interpret the output.

  • retinanet_boxes is the output format used by Synaptics sample detection networks (e.g., mobilenet224_full80 for COCO detection and mobilenet224_full1 for people detection).

  • tflite_detection_input is the format of the input tensors of the TFLite_Detection_PostProcess layer, used for example in the ssd_mobilenet_v1_1_default_1.tflite object-detection model.

This format is used when the TFLite_Detection_PostProcess layer is removed from the network at conversion time and the corresponding postprocessing algorithm is performed in software.

In both cases above the model has two output tensors: the first one is a regression tensor, and contains the bounding box deltas for the highest-score detected object in each position of the input grid. The second one is the classification tensor and for each class contains the score of that class, that is the confidence that this class is contained in the corresponding position of the input grid.

  • tflite_detection is the format of the output tensors of the TFLite_Detection_PostProcess layer, used for example in the ssd_mobilenet_v1_1_default_1.tflite object-detection model.

  • yolov5 is the output format used by models derived from the well-known yolov5 architecture. In this case the model has a single output 3D tensor organized as a list of detections, where each detection contains the following fields:

    • bounding box deltas (x, y, w, h)
    • overall confidence for this detection
    • landmarks deltas (x, y) if supported by the model
    • confidence vector, one entry per class
  • yolov8 is the output format used by models derived from the yolov8 arcitecture, the most recent update to the yolo family. The organization of the output tensor is very similar to that for yolov5 here above, the only difference is that the overall confidence field is missing.

In some cases the final layers in the model can be executed more efficiently in the CPU, so they are cut away when the model is generated or compiled with the SyNAP Toolkit. In this case the network will have one output tensor for each item of the image pyramid (normally 3) and each output will be a 4D or 5D tensor, whose layout depends on where exacly the model has been cut.

SyNAP Detector is able to automatically deduce the layout used, it just requires an indication if the information in the tensor are transposed.

Format TypeOut#ShapeDescriptionNotes
retinanet_boxes0Nx4bounding box deltas
1NxCPer-class probability
tflite_detection_input0Nx4bounding box deltas
tflite_detection_boxes1NxCPer-class probability
tflite_detection0NxMx4Bounding boxes
1NxMIndex of detected class
2NxMScore of detected class
31Actual number of detections
yolov50..P-1NxTxDProcessing done in the model
NxHxWxAxDOne 5D tensor per pyramid element
NxHxWx(A*D)One 4D tensor per pyramid element
NxAxHxWxDOne 5D tensor per pyramid elementRequires transposed=1
NxAxDxHxWOne 5D tensor per pyramid elementRequires transposed=1
Nx(A*D)xHxWOne 4D tensor per pyramid elementRequires transposed=1
yolov80NxTxDProcessing done in the modelOverall confidence missing

Where:

  • N: number of samples, must be 1
  • C: number of classes detected
  • T: total number of detections
  • M: maximum number of detections
  • D: detection size (includes: bounding box deltas xywh, confidence, landmarks, per-class confidences)
  • A: number of anchors
  • H: height of the image in the pyramid
  • W: width of the image in the pyramid
  • P: number of images in the pyramid

Attributes for retinanet_boxes and tflite_detection_input Formats

AttributeDefaultDescription
class_index_base0Class index corresponding to the first element of the output vector
transposed0Must be 1 if the output tensor uses the transposed format
anchorsAnchor points
x_scale10See x_scale parameter in the TFLite_Detection_PostProcess layer
y_scale10See y_scale parameter in the TFLite_Detection_PostProcess layer
h_scale5See h_scale parameter in the TFLite_Detection_PostProcess layer
w_scale5See w_scale parameter in the TFLite_Detection_PostProcess layer

In this case, the anchor points can be defined using the built-in variable ${ANCHORS}:

anchors=${ANCHORS}

This variable is replaced at conversion time with the content of the anchor tensor from the TFLite_Detection_PostProcess layer (if present in the model).

Attributes for tflite_detection Format

AttributeDefaultDescription
class_index_base0Class index corresponding to the first element of the output vector
h_scale0Vertical scale of the detected boxes (normally the H of the input tensor)
w_scale0Horizontal scale of the detected boxes (normally the W of the input tensor)

Attributes for yolov5 and yolov8 Formats

AttributeDefaultDescription
class_index_base0Class index corresponding to the first element of the output vector
transposed0Must be 1 if the output tensor uses the transposed format
landmarks0Number of landmark points
anchors0Anchor points. Not needed if processing done in the model
h_scale0Vertical scale of the detected boxes (normally the H of the input tensor when processing is done in the model)
w_scale0Horizontal scale of the detected boxes (normally the W of the input tensor when processing is done in the model)
bb_normalized0Must be 1 if the bounding box deltas are normalized (only for yolov8) Indicates that bounding boxes are normalized to the range [0, 1] while landmarks are in the range h_scale, w_scale

For yolov5 format, the anchors attribute must contain one entry for each pyramid element from P0, where each entry is a list of the x,y anchor deltas. For example for yolov5s-face, the anchors are defined in yolov5s.yaml:

- [4,5,  8,10,  13,16]  # P3/8
- [23,29, 43,55, 73,105] # P4/16
- [146,217, 231,300, 335,433] # P5/32

The corresponding outputs in the metafile can be defined as follows:

outputs:
- format: yolov5 landmarks=5 anchors=[[],[],[],[4,5,8,10,13,16],[23,29,43,55,73,105],[146,217,231,300,335,433]]
dequantize: true
- dequantize: true
- dequantize: true

Building Sample Code

The source code of the sample applications (e.g. synap_cli, synap_cli_ic, etc.) is included in the SyNAP release, together with that of the SyNAP libraries. Users based on the ASTRA distribution can build SyNAP using the provided Yocto recipe.

For other users, building SyNAP code requires the following components installed:

  1. VSSDK tree
  2. cmake

Build steps

cd synap/src
mkdir build
cd build
cmake -DVSSDK_DIR=/path/to/vssdk-directory -DCMAKE_INSTALL_PREFIX=install ..
make install

The above steps will create the binaries for the sample applications in synap/src/build/install/bin. The binaries can then be pushed to the board using adb:

cd synap/src/build/install/bin
adb push synap_cli_ic /vendor/bin

Users are free to change the source code provided to adapt it to their specific requirements.