NingG +

AI 系列:Model Formats

原文:Model Formats

As ML model applications increase, so too does the need for optimising the models for specific use-cases.

近期涌现了很多模型格式(model formats),用于解决成本过高可移植性问题。

Table 8 Comparison of popular model formats#

Feature ONNX GGML TensorRT
Ease of Use 🟢 good 🟡 moderate 🟡 moderate
Integration with Deep Learning Frameworks 🟢 most 🟡 growing 🟡 growing
Deployment Tools 🟢 yes 🔴 no 🟢 yes
Interoperability 🟢 yes 🔴 no 🔴 no
Inference Boost 🟡 moderate 🟢 good 🟢 good
Quantisation Support 🟡 good 🟢 good 🟡 moderate
Custom Layer Support 🟢 yes 🔴 limited 🟢 yes
Maintainer LF AI & Data Foundation ggerganov NVIDIA

Table 9 Model Formats Repository Statistics#

Repository Commit Rate Stars Contributors Issues Pull Requests
ggerganov/ggml
ggerganov/llama.cpp
onnx/onnx
microsoft/onnxruntime
nvidia/tensorrt

Based on the above stats, it looks like ggml is the most popular library currently, followed by onnx. Also one thing to note here is onnx repositories are around ~9x older compared to ggml repositories.

ONNX feels truly OSS, since it’s run by an OSS community, whereas both GGML and friends, TensorRT are run by Organisations (even though they are open source), and final decisions are made by a single (sometimes closed) entity which can finally affect on what kind of features that entity prefers or has biases towards even though both can have amazing communities at the same time.

ONNX#

ONNX (Open Neural Network Exchange,开放神经网络交换) 提供了一个开源的AI模型格式,它通过定义可扩展的计算图模型,以及内置操作符标准数据类型的定义,为AI模型提供了一个标准格式。它得到了广泛的支持,可以在许多框架、工具和硬件中找到,从而实现了不同框架之间的互操作性。ONNX是您模型的一个中间表示,使您能够轻松地在不同的环境中切换。

Features and Benefits#

https://static.premai.io/book/model-formats-onnx.png

Fig. 56 https://cms-ml.github.io/documentation/inference/onnx.html#

Usage#

Usability around ONNX is fairly developed and has lots of tooling support around it by the community, let’s see how we can directly export into onnx and make use of it.

Firstly the model needs to be converted to ONNX format using a relevant converter, for example if our model is created using Pytorch, for conversion we can use:

Once exported we can load, manipulate, and run ONNX models. Let’s take a Python example:

To install the official onnx python package:

pip install onnx

To load, manipulate, and run ONNX models in your Python applications:

import onnx
	
\# Load an ONNX model
model \= onnx.load("your\_awesome\_model.onnx")
	
\# Perform inference with the model
\# (Specific inference code depends on your application and framework)

Support#

Many frameworks/tools are supported, with many examples/tutorials at onnx/tutorials.

It has support for Inference runtime binding APIs written in few programming languages (python, rust, js, java, C#).

ONNX model’s inference depends on the platform which runtime library supports, called Execution Provider. Currently there are few ranging from CPU based, GPU based, IoT/edge based and few others. A full list can be found here.

Onnxruntime has few example tools that can be used to quantize select ONNX models. Support is currenty based on operators in the model. Read more here.

Also there are few visualisation tools support like lutzroeder/Netron and more for models converted to ONNX format, highly recommended for debugging purposes.

Future#

Currently ONNX is part of LF AI Foundation, conducts regular Steering committee meetings and community meetups are held atleast once a year. Few notable presentations from this year’s meetup:

Checkout the full list here.

Limitations#

Onnx uses Opsets (Operator sets) number which changes with each ONNX package minor/major releases, new opsets usually introduces new operators. Proper opset needs to be used while creating the onnx model graph.

Also it currently doesn’t support 4-bit quantisation (microsoft/onnxruntime#14997).

There are lots of open issues (microsoft/onnxruntime#12880, #10303, #7233, #17116) where users are getting slower inference speed after converting their models to ONNX format when compared to base model format, it shows that conversion might not be easy for all models. On similar grounds an user comments 3 years ago here though it’s old, few points still seems relevant. The troubleshooting guide by ONNX runtime community can help with commonly faced issues.

Usage of Protobuf for storing/reading of ONNX models also seems to be causing few limitations which is discussed here.

There’s a detailed failure analysis (video, ppt) done by James C. Davis and Purvish Jajal on ONNX converters.

https://static.premai.io/book/model-formats_onnx-issues.png

https://static.premai.io/book/model-formats_onnx-issues-table.png

Fig. 57 Analysis of Failures and Risks in Deep Learning Model Converters [143]#

一些主要的发现包括:

See also

GGML#

ggerganov/ggml is a tensor library for machine learning to enable large models and high performance on commodity hardware – the “GG” refers to the initials of its originator Georgi Gerganov. In addition to defining low-level machine learning primitives like a tensor type, GGML defines a binary format for distributing large language models (LLMs). llama.cpp and whisper.cpp are based on it.

ggerganov/ggml 是一个用于机器学习的张量库(tensor library),旨在在通用硬件上实现大型模型和高性能。 除了定义低级机器学习原语,如张量类型,GGML 还定义了一种用于分发大型语言模型(LLMs)的二进制格式llama.cppwhisper.cpp 是基于它构建的。

“GG” 是其创始人 Georgi Gerganov 的姓名缩写。

Features and Benefits#

这是关于 ggerganov/ggml 库的一些特点:

To know more, see their manifesto here

Usage#

总体而言,GGML 在可用性方面属于中等水平,因为它是一个相对较新的项目,正在不断发展,但已经得到了许多社区支持。

Here’s an example inference of GPT-2 GGML:

git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make \-j4 gpt\-2

\# Run the GPT-2 small 117M model
../examples/gpt\-2/download\-ggml\-model.sh 117M
./bin/gpt\-2 \-m models/gpt\-2\-117M/ggml\-model.bin \-p "This is an example"

Working#

For usage, the model should be saved in the particular GGML file format which consists binary-encoded data that has a particular format specifying what kind of data is present in the file, how it is represented, and the order in which it appears.

要创建一个有效的GGML文件,必须按照以下顺序包含以下信息:

  1. GGML版本号,GGML version number:GGML使用版本控制来支持快速开发,同时保持向后兼容性。有效的GGML文件的第一个值是一个“魔术数字”,表示使用的GGML版本。Here’s a GPT-2 conversion example where it’s getting written.

  2. LLM组件 Components of LLMs

    1. 超参数,Hyperparameters:这些参数配置了模型的行为。有效的GGML文件,按正确顺序列出这些值,并使用正确的数据类型表示。Here’s an example for GPT-2.
    2. 词汇表,Vocabulary:这包括模型支持的所有标记 tokens。 Here’s an example for GPT-2.
    3. 权重, Weights:这些也被称为模型的参数(parameters of the model)。在GGML格式中,一个张量(tensor)由以下几个部分组成:
      • 名称
      • 表示张量的维数及其长度的4元素列表
      • 张量中的权重列表
// Let’s consider the following weights:
    
weight_1 = [[0.334, 0.21], [0.0, 0.149]]
weight_2 = [0.123, 0.21, 0.31]
    
// Then GGML representation would be:
    
{"weight_1", [2, 2, 1, 1], [0.334, 0.21, 0.0, 0.149]}
{"weight_2", [3, 1, 1, 1], [0.123, 0.21, 0.31]}

For each weight representation the first list denotes dimensions and second list denotes weights. Dimensions list uses 1 as a placeholder for unused dimensions.

Quantisation#

Quantisation is a process where high-precision foating point values are converted to low-precision values. This overall reduces the resources required to use the values in Tensor, making model easier to run on low resources. GGML uses a hacky version of quantisation and supports a number of different quantisation strategies (e.g. 4-bit, 5-bit, and 8-bit quantisation), each of which offers different trade-offs between efficiency and performance. Check out this amazing article by Merve for a quick walkthrough.

Support#

It’s most used projects include:

Inference and training of many open sourced models (StarCoder, Falcon, Replit, Bert, etc.) are already supported in GGML. Track the full list of updates here.

Tip

TheBloke currently has lots of LLM variants already converted to GGML format.

GPU based inference support for GGML format models discussion initiated few months back, examples started with MNIST CNN support, and showing other example of full GPU inference, showed on Apple Silicon using Metal, offloading layers to CPU and making use of GPU and CPU together.

Check llamacpp part of LangChain’s docs on how to use GPU or Metal for GGML models inference. Here’s an example from LangChain docs showing how to use GPU for GGML models inference.

Currently Speculative Decoding for sampling tokens is being implemented (ggerganov/llama.cpp#2926) for Code LLaMA inference as a POC, which as an example promises full float16 precision 34B Code LLAMA at >20 tokens/sec on M2 Ultra.

Future#

GGUF format#

There’s a new successor format to GGML named GGUF introduced by llama.cpp team on August 21st 2023. It has an extensible, future-proof format which stores more information about the model as metadata. It also includes significantly improved tokenisation code, including for the first time full support for special tokens. Promises to improve performance, especially with models that use new special tokens and implement custom prompt templates.

Some clients & libraries supporting GGUF include:

See also

For more info on GGUF, see ggerganov/llama.cpp#2398 and its spec.

Limitations#

See also

TensorRT#

TensorRT 是NVIDIA提供的深度学习推断软件开发工具包(SDK),它提供API和解析器,用于导入来自所有主要深度学习框架的训练模型,然后生成经过优化的运行时引擎,可部署在不同的系统上。

Features and Benefits#

TensorRT’s main capability comes under giving out high performance inference engines. Few notable features include:

TensorRT can also act as a provider when using onnxruntime delivering better inferencing performance on the same hardware compared to generic GPU acceleration by setting proper Execution Provider.

Usage#

使用 NVIDIA’s TensorRT containers 容器 可以简化设置流程,前提是已知TensorRT TensorRT \ CUDA toolkit 版本(if required).

https://static.premai.io/book/model-formats_tensorrt-usage-flow.png

Fig. 58 Path to convert and deploy with TensorRT.#

Support#

While creating a serialised TensorRT engine, except using TF-TRT or ONNX, for higher customisability one can also manually construct a network using the TensorRT API (C++ or Python)

TensorRT also includes a standalone runtime with C++ and Python bindings, apart from directly using NVIDIA’s Triton Inference server for deployment.

ONNX has a TensorRT backend that parses ONNX models for execution with TensorRT, having both Python and C++ support. Current full list of supported ONNX operators for TensorRT is maintained here. It only supports DOUBLE, FLOAT32, FLOAT16, INT8 and BOOL ONNX data types, and limited support for INT32, INT64 and DOUBLE types.

NVIDIA also kept few tooling support around TensorRT:

Limitations#

Currently every model checkpoint one creates needs to be recompiled first to ONNX and then to TensorRT, so for using microsoft/LoRA it has to be added into the model at compile time. More issues can be found in this reddit post.

INT4 and INT16 quantisation is not supported by TensorRT currently. Current support on quantisation can be found here.

Many ONNX operators are not yet supported by TensorRT and few supported ones have restrictions.

Supports no Interoperability since conversion to onnx or TF-TRT format is a necessary step and has intricacies which needs to be handled for custom requirements.

See also

FasterTransformer#

Work in Progress

Feel free to open a PR :)

Future#

Feedback

This chapter is still being written & reviewed. Please do post links & discussion in the comments below, or open a pull request!

See also:

原文地址:https://ningg.top/ai-series-prem-07-model-formats/
微信公众号 ningg, 联系我

同类文章:

微信搜索: 公众号 ningg, 联系我, 交个朋友.

Top