AI 系列：Model Formats

2023-11-04

As ML model applications increase, so too does the need for optimising the models for specific use-cases.

近期涌现了很多模型格式（model formats），用于解决成本过高和可移植性问题。

Table 8 Comparison of popular model formats#

Feature	ONNX	GGML	TensorRT
Ease of Use	🟢 good	🟡 moderate	🟡 moderate
Integration with Deep Learning Frameworks	🟢 most	🟡 growing	🟡 growing
Deployment Tools	🟢 yes	🔴 no	🟢 yes
Interoperability	🟢 yes	🔴 no	🔴 no
Inference Boost	🟡 moderate	🟢 good	🟢 good
Quantisation Support	🟡 good	🟢 good	🟡 moderate
Custom Layer Support	🟢 yes	🔴 limited	🟢 yes
Maintainer	LF AI & Data Foundation	ggerganov	NVIDIA

Table 9 Model Formats Repository Statistics#

Repository	Commit Rate	Stars	Contributors	Issues	Pull Requests
ggerganov/ggml
ggerganov/llama.cpp
onnx/onnx
microsoft/onnxruntime
nvidia/tensorrt

Based on the above stats, it looks like ggml is the most popular library currently, followed by onnx. Also one thing to note here is onnx repositories are around ~9x older compared to ggml repositories.

ONNX feels truly OSS, since it’s run by an OSS community, whereas both GGML and friends, TensorRT are run by Organisations (even though they are open source), and final decisions are made by a single (sometimes closed) entity which can finally affect on what kind of features that entity prefers or has biases towards even though both can have amazing communities at the same time.

ONNX#

ONNX (Open Neural Network Exchange，开放神经网络交换) 提供了一个开源的AI模型格式，它通过定义可扩展的计算图模型，以及内置操作符和标准数据类型的定义，为AI模型提供了一个标准格式。它得到了广泛的支持，可以在许多框架、工具和硬件中找到，从而实现了不同框架之间的互操作性。ONNX是您模型的一个中间表示，使您能够轻松地在不同的环境中切换。

Features and Benefits#

Fig. 56 https://cms-ml.github.io/documentation/inference/onnx.html #

Model Interoperability: 模型互操作性：ONNX实现了AI框架之间的互通，允许模型在它们之间无缝传输，消除了复杂的转换需求。
Computation Graph Model: 计算图模型：ONNX的核心是一个图模型，将AI模型表示为有向图，其中包含用于操作的节点，提供了灵活性。
Standardised Data Types: 标准化数据类型：ONNX建立了标准数据类型，确保在交换模型时保持一致性，减少数据类型问题。
Built-in Operators: 内置操作符：ONNX拥有丰富的内置操作符库，用于常见的AI任务，实现了跨框架的一致计算。
ONNX Ecosystem:
- microsoft/onnxruntime A high-performance inference engine for cross-platform ONNX models.
- onnx/onnxmltools Tools for ONNX model conversion and compatibility with frameworks like TensorFlow and PyTorch.
- onnx/models A repository of pre-trained models converted to ONNX format for various tasks.
- Hub: Helps sharing and collaborating on ONNX models within the community.

Usage#

Usability around ONNX is fairly developed and has lots of tooling support around it by the community, let’s see how we can directly export into onnx and make use of it.

Firstly the model needs to be converted to ONNX format using a relevant converter, for example if our model is created using Pytorch, for conversion we can use:

torch.onnx.export
- For custom operators support same exporter can be used.
optimum by huggingface

Once exported we can load, manipulate, and run ONNX models. Let’s take a Python example:

To install the official onnx python package:

pip install onnx

To load, manipulate, and run ONNX models in your Python applications:

import onnx
	
\# Load an ONNX model
model \= onnx.load("your\_awesome\_model.onnx")
	
\# Perform inference with the model
\# (Specific inference code depends on your application and framework)

Support#

Many frameworks/tools are supported, with many examples/tutorials at onnx/tutorials.

It has support for Inference runtime binding APIs written in few programming languages (python, rust, js, java, C#).

ONNX model’s inference depends on the platform which runtime library supports, called Execution Provider. Currently there are few ranging from CPU based, GPU based, IoT/edge based and few others. A full list can be found here.

Onnxruntime has few example tools that can be used to quantize select ONNX models. Support is currenty based on operators in the model. Read more here.

Also there are few visualisation tools support like lutzroeder/Netron and more for models converted to ONNX format, highly recommended for debugging purposes.

Future#

Currently ONNX is part of LF AI Foundation, conducts regular Steering committee meetings and community meetups are held atleast once a year. Few notable presentations from this year’s meetup:

ONNX 2.0 Ideas.
Analysis of Failures and Risks in Deep Learning Model Converters: A Case Study in the ONNX Ecosystem.
On-Device Training with ONNX Runtime: enabling training models on edge devices without the data ever leaving the device.

Checkout the full list here.

Limitations#

Onnx uses Opsets (Operator sets) number which changes with each ONNX package minor/major releases, new opsets usually introduces new operators. Proper opset needs to be used while creating the onnx model graph.

Also it currently doesn’t support 4-bit quantisation (microsoft/onnxruntime#14997).

There are lots of open issues (microsoft/onnxruntime#12880, #10303, #7233, #17116) where users are getting slower inference speed after converting their models to ONNX format when compared to base model format, it shows that conversion might not be easy for all models. On similar grounds an user comments 3 years ago here though it’s old, few points still seems relevant. The troubleshooting guide by ONNX runtime community can help with commonly faced issues.

Usage of Protobuf for storing/reading of ONNX models also seems to be causing few limitations which is discussed here.

There’s a detailed failure analysis (video, ppt) done by James C. Davis and Purvish Jajal on ONNX converters.

Fig. 57 Analysis of Failures and Risks in Deep Learning Model Converters [143]#

一些主要的发现包括：

崩溃（56%）和错误的模型（33%）是最常见的问题症状。
最常见的失败原因是不兼容性和类型问题，分别占大约25%的原因。
大多数失败发生在节点转换阶段（74%），另有10%发生在图优化阶段（主要来自tf2onnx）。

See also

How to add a new ONNX Operator
ONNX Backend Scoreboard
Intro to ONNX
ONNX Runtime
webonnx/wonnx (GPU-based ONNX inference runtime in Rust)
Hacker News discussion on ONNX runtimes & ONNX

GGML#

ggerganov/ggml is a tensor library for machine learning to enable large models and high performance on commodity hardware – the “GG” refers to the initials of its originator Georgi Gerganov. In addition to defining low-level machine learning primitives like a tensor type, GGML defines a binary format for distributing large language models (LLMs). llama.cpp and whisper.cpp are based on it.

ggerganov/ggml 是一个用于机器学习的张量库(tensor library)，旨在在通用硬件上实现大型模型和高性能。除了定义低级机器学习原语，如张量类型，GGML 还定义了一种用于分发大型语言模型（LLMs）的二进制格式。llama.cpp 和 whisper.cpp 是基于它构建的。

“GG” 是其创始人 Georgi Gerganov 的姓名缩写。

Features and Benefits#

这是关于 ggerganov/ggml 库的一些特点：

用 C 语言编写
支持 16 位浮点数和整数量化（例如 4 位、5 位、8 位）
自动微分
内置的优化算法（如 ADAM、L-BFGS）
针对 Apple Silicon 进行了优化，在 x86 架构上使用 AVX / AVX2 指令集
通过 WebAssembly 和 WASM SIMD 支持 Web
无第三方依赖
运行时不需要内存分配

To know more, see their manifesto here

Usage#

总体而言，GGML 在可用性方面属于中等水平，因为它是一个相对较新的项目，正在不断发展，但已经得到了许多社区支持。

Here’s an example inference of GPT-2 GGML:

git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make \-j4 gpt\-2

\# Run the GPT-2 small 117M model
../examples/gpt\-2/download\-ggml\-model.sh 117M
./bin/gpt\-2 \-m models/gpt\-2\-117M/ggml\-model.bin \-p "This is an example"

Working#

For usage, the model should be saved in the particular GGML file format which consists binary-encoded data that has a particular format specifying what kind of data is present in the file, how it is represented, and the order in which it appears.

要创建一个有效的GGML文件，必须按照以下顺序包含以下信息：

GGML版本号，GGML version number：GGML使用版本控制来支持快速开发，同时保持向后兼容性。有效的GGML文件的第一个值是一个“魔术数字”，表示使用的GGML版本。Here’s a GPT-2 conversion example where it’s getting written.
LLM组件 Components of LLMs：
1. 超参数，Hyperparameters：这些参数配置了模型的行为。有效的GGML文件，按正确顺序列出这些值，并使用正确的数据类型表示。Here’s an example for GPT-2.
2. 词汇表，Vocabulary：这包括模型支持的所有标记 tokens。 Here’s an example for GPT-2.
3. 权重， Weights：这些也被称为模型的参数（parameters of the model）。在GGML格式中，一个张量（tensor）由以下几个部分组成：
  - 名称
  - 表示张量的维数及其长度的4元素列表
  - 张量中的权重列表

// Let’s consider the following weights:
    
weight_1 = [[0.334, 0.21], [0.0, 0.149]]
weight_2 = [0.123, 0.21, 0.31]
    
// Then GGML representation would be:
    
{"weight_1", [2, 2, 1, 1], [0.334, 0.21, 0.0, 0.149]}
{"weight_2", [3, 1, 1, 1], [0.123, 0.21, 0.31]}

For each weight representation the first list denotes dimensions and second list denotes weights. Dimensions list uses 1 as a placeholder for unused dimensions.

Quantisation#

Quantisation is a process where high-precision foating point values are converted to low-precision values. This overall reduces the resources required to use the values in Tensor, making model easier to run on low resources. GGML uses a hacky version of quantisation and supports a number of different quantisation strategies (e.g. 4-bit, 5-bit, and 8-bit quantisation), each of which offers different trade-offs between efficiency and performance. Check out this amazing article by Merve for a quick walkthrough.

Support#

It’s most used projects include:

whisper.cpp

High-performance inference of OpenAI’s Whisper automatic speech recognition model The project provides a high-quality speech-to-text solution that runs on Mac, Windows, Linux, iOS, Android, Raspberry Pi, and Web. Used by rewind.ai

Optimised version for Apple Silicon is also available as a Swift package.
llama.cpp

Inference of Meta’s LLaMA large language model

The project demonstrates efficient inference on Apple Silicon hardware and explores a variety of optimisation techniques and applications of LLMs

Inference and training of many open sourced models (StarCoder, Falcon, Replit, Bert, etc.) are already supported in GGML. Track the full list of updates here.

Tip

TheBloke currently has lots of LLM variants already converted to GGML format.

GPU based inference support for GGML format models discussion initiated few months back, examples started with MNIST CNN support, and showing other example of full GPU inference, showed on Apple Silicon using Metal, offloading layers to CPU and making use of GPU and CPU together.

Check llamacpp part of LangChain’s docs on how to use GPU or Metal for GGML models inference. Here’s an example from LangChain docs showing how to use GPU for GGML models inference.

Currently Speculative Decoding for sampling tokens is being implemented (ggerganov/llama.cpp#2926) for Code LLaMA inference as a POC, which as an example promises full float16 precision 34B Code LLAMA at >20 tokens/sec on M2 Ultra.

Future#

`GGUF` format#

There’s a new successor format to GGML named GGUF introduced by llama.cpp team on August 21st 2023. It has an extensible, future-proof format which stores more information about the model as metadata. It also includes significantly improved tokenisation code, including for the first time full support for special tokens. Promises to improve performance, especially with models that use new special tokens and implement custom prompt templates.

Some clients & libraries supporting GGUF include:

ggerganov/llama.cpp
oobabooga/text-generation-webui – the most widely used web UI, with many features and powerful extensions
LostRuins/koboldcpp – a fully featured web UI, with full GPU accel across multiple platforms and GPU architectures. Especially good for story telling
ParisNeo/lollms-webui – a great web UI with many interesting and unique features, including a full model library for easy model selection
marella/ctransformers – a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server
abetlen/llama-cpp-python – a Python library with GPU accel, LangChain support, and OpenAI-compatible API server
huggingface/candle – a Rust ML framework with a focus on performance, including GPU support, and ease of use
LM Studio – an easy-to-use and powerful local GUI with GPU acceleration on both Windows (NVidia and AMD), and macOS

See also

For more info on GGUF, see ggerganov/llama.cpp#2398 and its spec.

Limitations#

Models are mostly quantised versions of actual models, taking slight hit from quality side if not much. Similar cases reported which is totally expected from a quantised model, some numbers can be found on this reddit discussion.
GGML is mostly focused on Large Language Models, but surely looking to expand.

See also

GGML: Large Language Models for Everyone – a description of the GGML format (by the maintainers of the llm Rust bindings for GGML)
marella/ctransformers – Python bindings for GGML models
go-skynet/go-ggml-transformers.cpp – Golang bindings for GGML models
smspillaz/ggml-gobject – GObject-introspectable wrapper for using GGML on the GNOME platform
Hacker News discussion on GGML

TensorRT#

TensorRT 是NVIDIA提供的深度学习推断软件开发工具包（SDK），它提供API和解析器，用于导入来自所有主要深度学习框架的训练模型，然后生成经过优化的运行时引擎，可部署在不同的系统上。

Features and Benefits#

TensorRT’s main capability comes under giving out high performance inference engines. Few notable features include:

C++ and Python APIs.
Supports float32, float16, int8, int32, uint8, and bool data types.
Plugin interface to extend TensorRT with operations not supported natively.
Works with both GPU (CUDA) and CPU.
Works with pre-quantised models.
Supports NVIDIA’s Deep Learning Accelerator (DLA).
Dynamic shapes for Input and Output.
Updating weights
Added tooling support like trtexec

TensorRT can also act as a provider when using onnxruntime delivering better inferencing performance on the same hardware compared to generic GPU acceleration by setting proper Execution Provider.

Usage#

使用 NVIDIA’s TensorRT containers 容器可以简化设置流程，前提是已知TensorRT TensorRT \ CUDA toolkit 版本(if required).

Fig. 58 Path to convert and deploy with TensorRT.#

Support#

While creating a serialised TensorRT engine, except using TF-TRT or ONNX, for higher customisability one can also manually construct a network using the TensorRT API (C++ or Python)

TensorRT also includes a standalone runtime with C++ and Python bindings, apart from directly using NVIDIA’s Triton Inference server for deployment.

ONNX has a TensorRT backend that parses ONNX models for execution with TensorRT, having both Python and C++ support. Current full list of supported ONNX operators for TensorRT is maintained here. It only supports DOUBLE, FLOAT32, FLOAT16, INT8 and BOOL ONNX data types, and limited support for INT32, INT64 and DOUBLE types.

NVIDIA also kept few tooling support around TensorRT:

trtexec: For easy generation of TensorRT engines and benchmarking.
Polygraphy: A Deep Learning Inference Prototyping and Debugging Toolkit
trt-engine-explorer: It contains Python package trex to explore various aspects of a TensorRT engine plan and its associated inference profiling data.
onnx-graphsurgeon: It helps easily generate new ONNX graphs, or modify existing ones.
polygraphy-extension-trtexec: polygraphy extension which adds support to run inference with trtexec for multiple backends, including TensorRT and ONNX-Runtime, and compare outputs.
pytorch-quantization and tensorflow-quantization: For quantisation aware training or evaluating when using Pytorch/Tensorflow.

Limitations#

Currently every model checkpoint one creates needs to be recompiled first to ONNX and then to TensorRT, so for using microsoft/LoRA it has to be added into the model at compile time. More issues can be found in this reddit post.

INT4 and INT16 quantisation is not supported by TensorRT currently. Current support on quantisation can be found here.

Many ONNX operators are not yet supported by TensorRT and few supported ones have restrictions.

Supports no Interoperability since conversion to onnx or TF-TRT format is a necessary step and has intricacies which needs to be handled for custom requirements.

See also

FasterTransformer#

Work in Progress

Feel free to open a PR :)

Future#

Feedback

This chapter is still being written & reviewed. Please do post links & discussion in the comments below, or open a pull request!

AI 系列：Model Formats

ONNX#

Features and Benefits#

Usage#

Support#

Future#

Limitations#

GGML#

Features and Benefits#

Usage#

Working#

Quantisation#

Support#

Future#

GGUF format#

Limitations#

TensorRT#

Features and Benefits#

Usage#

Support#

Limitations#

FasterTransformer#

Future#

同类文章:

`GGUF` format#