NingG +

AI 系列:MLOps Engines

原文:MLOps Engines

Work in Progress

This chapter is still being written & reviewed. Please do post links & discussion in the comments below, or open a pull request!

Some ideas:

本章重点关注最近开源的MLOps引擎开发,这在很大程度上是由于大型语言模型的兴起所驱动的。尽管MLOps通常关注模型训练,但 LLMOps 专注于模型微调。在生产中,两者都需要良好的推理引擎

Table 10 Comparison of Inference Engines#

Inference Engine Open-Source GPU optimisations Ease of use
Nvidia Triton 🟢 Yes Dynamic Batching, Tensor Parallelism, Model concurrency 🔴 Difficult
Text Generation Inference 🟢 Yes Continuous Batching, Tensor Parallelism, Flash Attention 🟢 Easy
vLLM 🟢 Yes Continuous Batching, Tensor Parallelism, Paged Attention 🟢 Easy
BentoML 🟢 Yes None 🟢 Easy
Modular 🔴 No N/A 🟡 Moderate
LocalAI 🟢 Yes 🟢 Yes 🟢 Easy

Feedback

Is the table above outdated or missing an important model? Let us know in the comments below, or open a pull request!

Nvidia Triton Inference Server#

Fig. 59 Nvidia Triton Architecture#

这个inference server, 推理服务器 支持多种模型格式,如PyTorchTensorFlowONNXTensorRT等。它有效地利用GPU来提升深度学习模型的性能。

Pros:

Cons:

Text Generation Inference#

Fig. 60 Text Generation Inference Architecture#

Compared to Triton, huggingface/text-generation-inference is easier to setup and supports most of the popular LLMs on Hugging Face.

Pros:

Cons:

vLLM#

This is an open-source project created by researchers at Berkeley to improve the performance of LLM inferencing. vLLM primarily optimises LLM throughput via methods like PagedAttention and Continuous Batching. The project is fairly new and there is ongoing development.

Pros:

Cons:

BentoML#

BentoML is a fairly popular tool used to deploy ML models into production. It has gained a lot of popularity by building simple wrappers that can convert any model into a REST API endpoint. Currently, BentoML does not support some of the GPU optimizations such as tensor parallelism. However, the main benefit BentoML provides is that it can serve a wide variety of models.

Pros:

Cons:

张量并行(tensor parallelism),是指在深度学习模型中,同时处理和计算多个张量数据的能力。

Modular#

Modular is designed to be a high performance AI engine that boosts the performance of deep learning models. The secret is in their custom compiler and runtime environment that improves the inferencing of any model without the developer needing to make any code changes.

The Modular team has designed a new programming language, Mojo, which combines the Python friendly syntax with the performance of C. The purpose of Mojo is to address some of the shortcomings of Python from a performance standpoint while still being a part of the Python ecosystem. This is the programming language used internally to create the Modular AI engine’s kernels.

Pros:

Cons:

This is not an exhaustive list of MLOps engines by any means. There are many other tools and frameworks developer use to deploy their ML models. There is ongoing development in both the open-source and private sectors to improve the performance of LLMs. It’s up to the community to test out different services to see which one works best for their use case.

LocalAI#

LocalAI from mudler/LocalAI (not to be confused with local.ai from louisgv/local.ai) is the free, Open Source alternative to OpenAI. LocalAI act as a drop-in replacement REST API that’s compatible with OpenAI API specifications for local inferencing. It can run LLMs (with various backend such as ggerganov/llama.cpp or vLLM), generate images, generate audio, transcribe audio, and can be self-hosted (on-prem) with consumer-grade hardware.

Pros:

Cons:

Challenges in Open Source#

MLOps solutions come in two flavours [144]:

Some companies (e.g. Hugging Face) push for open-source models & datasets, while others (e.g. OpenAI, Anthropic) do the opposite.

The main challenges with open-source MLOps are Maintenance, Performance, and Cost.

Fig. 61 Open-Source vs Closed-Source MLOps#

Maintenance#

使用开源组件,大部分的设置和配置必须手动完成。

需要负责监控管道pipeline的运行情况,并迅速解决问题,以避免应用程序的停机。特别是在项目的早期阶段,当鲁棒性可伸缩性尚未实施时,开发人员需要进行大量的问题排查工作。

Performance#

Performance could refer to:

By comparison, closed-source engines (e.g. Cohere) tend to give better baseline operational performance due to default-enabled inference optimisations [146].

Cost#

自行维护的开源解决方案,如果实施得当,无论是在设置还是长期运行方面,都可以非常便宜。然而,许多人低估了使开源生态系统无缝运行所需的工作量。

例如,一个能够运行一个36GB开源模型的单个GPU节点,从主要云提供商那里很容易每月花费超过2000美元。由于这项技术仍然很新,尝试和维护自托管基础设施可能会很昂贵。相比之下,封闭源的价格模型通常根据使用(例如令牌)而不是基础设施(例如ChatGPT每千个令牌的费用约为0.002美元,足够用于一页文本),使它们在小规模的探索性任务中更便宜。

Inference#

推理是当前LLMs领域的热门话题之一。像ChatGPT这样的大型模型具有非常低的延迟和出色的性能,但随着使用量的增加,成本也会更高。

与此相反,像LLaMA-2或Falcon这样的开源模型有更小的变体,但很难在仍然具有成本效益的同时,匹配ChatGPT提供的延迟和吞吐量 [147]。

使用Hugging Face管道运行的模型,没有必要的优化来在生产环境中运行。开源LLM推理市场仍在不断发展,所以目前还没有能以高速运行任何开源LLM的灵丹妙药。

推理 inference 速度慢,一般是下面几个原因:

Models are growing larger in size#

Python as the choice of programming language for AI#

相比于像C++这样的编译型语言,Python天生速度较慢。

  1. Python是一种解释型语言,而不是编译型语言。
  2. 解释型语言需要在运行时逐行解释代码,这使得它在某些情况下的执行速度相对较慢。
  3. 相比之下,编译型语言在运行前会先将代码转换成机器语言,这通常会带来更高的执行效率。
  4. 尽管Python的速度较慢,但其简单易用、灵活性和广泛的库支持使其成为数据分析、机器学习等领域中流行的语言之一。

Larger inputs#

Future#

由于运行大型语言模型(LLMs)存在挑战/门槛,企业可能选择使用推理服务器而不是自行将模型容器化/本地化。LLMs推理优化,需要高水平的专业知识,而大多数公司可能并不具备。推理服务器可以通过提供简单且统一的界面,并且规模化部署AI模型,来获得成本优势。

另一个正在出现的模式是,模型将移至数据所在地,而不是将数据传输到模型中。目前,在调用ChatGPT API时,数据被发送到模型。在过去的十年中,企业努力在云中建立了稳健的数据基础设施。将模型引入到与数据相同的云环境中更为合理。这就是开源模型,具备云无关性的巨大优势所在。

在“MLOps”这个词出现之前,数据科学家通常会在本地手动训练和运行模型。那时,数据科学家主要是在实验较小规模的统计模型。当他们试图将这项技术应用于生产时,他们遇到了许多与数据存储、数据处理、模型训练、模型部署和模型监控相关的问题。公司开始解决这些挑战,并提出了“MLOps”来运行AI模型的解决方案。

目前,我们处于LLMs的实验阶段。当公司尝试将这项技术应用于生产时,他们将面临一系列新的挑战。解决这些挑战的方案,将基于现有的MLOps概念,同时也可能诞生新的领域知识。

原文地址:https://ningg.top/ai-series-prem-08-mlops-engines/
微信公众号 ningg, 联系我

同类文章:

微信搜索: 公众号 ningg, 联系我, 交个朋友.

Top