NingG +

AI 系列:Models


Work in Progress

This chapter is still being written & reviewed. Please do post links & discussion in the comments below, or open a pull request!

Some ideas:

The emergence of Large Language Models, notably with the advent of GPT-3, ChatGPT, Midjourney, Whisper helped bloom a new era. Beyond revolutionising just language models, these models also pushed innovation in other domains like Vision (ViT, DALL-E, Stable Diffusion SAM, etc), Audio Wave2vec [91], Bark) or even Multimodal models.

Fig. 27 Page 7, A Survey of Large Language Models [90]#

1.Proprietary Models 专有模型#


For performance comparisons, Chatbot Arena helps (though it’s a bit old and doesn’t reflect latest results).






ChatGPT is a language model developed by OpenAI. It is fine-tuned from a model in the GPT-3.5 series and was trained on an Azure AI supercomputing infrastructure. ChatGPT is designed for conversational AI applications, such as chatbots and virtual assistants.

ChatGPT is sensitive to tweaks to the input phrasing or attempting the same prompt multiple times. It’s still not fully reliable and can “hallucinate” facts and make reasoning errors.


GPT-4 is a language model developed by OpenAI. It is the successor to GPT-3 and has been made publicly available via the paid chatbot product ChatGPT Plus and via OpenAI’s API.



它可以处理长达 32k 个标记的输入提示,这相比于GPT-3.5的 4k 个标记有了显著增加。


Despite its capabilities, GPT-4 still sometimes “hallucinates” facts and makes reasoning errors.


Claude 2 是由Anthropic开发的一款语言模型。它于2023年7月11日宣布推出,与其前身Claude相比,具有更好的性能和更长的回应,用户可以通过API和他们的网站访问它。




StableAudio is a proprietary model developed by Stability AI. It is designed to improve the accuracy of audio processing tasks, such as speech recognition and speaker identification.



Midjourney is a proprietary model for Image generation developed by Midjourney.

2.Open-Source Models#

Note: “Open source” does not necessarily mean “open licence”.

Subsection Description
Before Public Awareness Pre-ChatGPT; before widespread LLMs use, and a time of slow progress.
Early Models Post-ChatGPT; time of Stable Diffusion and LLaMA
Current Models Post-LLaMA leak; open-source LLMs quickly catching up to closed-source, new solutions emerging (e.g. GPU-poor), Alpaca 7B, LLaMA variants, etc.


早期性能出色的LLMs都是专有的,只能通过组织的付费API访问,这限制了透明度,引发了关于数据隐私、偏见、模型对齐和鲁棒性的担忧,使得满足特定领域用例的可能性受到限制,而不受 RLHF 对齐(alignment)的干扰。

2.1.Before Public Awareness#


There has been few notable open LLMs pre-ChatGPT era like BLOOM, GPT-NewX 20B [93], GPT-J 6B, OPT [94].


GPT-J 6B is an early English-only casual language model, which at the time of its release was the largest publicly available GPT-3 style language model. Code and weights are open sourced along with a blog by Aran Komatsuzaki, one of the authors of the model.


Before ChatGPT‘s (GPT-3.5) public release we had GPT-3 being one of the “best” Base Language Model which released ~2.1 years before ChatGPT. And following that we’ve had LLMs like Bard, Claude, GPT-4 and others.

2.2.Early Models#

There has been a few visible marks across modalities of AI models, highly catalysing growth of open source:

Stable Diffusion#

Stable Diffusion is a latent text-to-image diffusion model [97]. Created by Stability AI and support from LAION, where they used 512x512 images from a subset of the LAION 5B database for training. Similar to Google’s Imagen [98], this model uses a frozen CLIP ViT-L/14 [99] text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and runs on a GPU with at least 10GB VRAM.


While training:



Under LLaMA [100], Meta AI released a collection of foundation language models ranging from 7B to 65B parameters, pre-trained over a corpus containing more than 1.4 trillion tokens. It was designed to be versatile and applicable for many different use cases, and possibly fine-tuned for domain specific tasks if required.

It showed better performance across domains compared to its competitors.

Fig. 28 LLaMA: Open and Efficient Foundation Language Models [100]#

LLaMA 13B outperforms GPT-3 (175B) on most benchmarks while being more than 10x smaller, and LLaMA 65B is competitive with models like Chinchilla 70B [101] and PaLM 540B. LLaMA 65B performs similarly to the closed-source GPT-3.5 on the MMLU and GSM8K benchmarks [100].



  1. 预规范化(GPT-3):使用 RMSNorm 来规范化 Transformer子层的输入[102]。
  2. SwiGLU激活函数(PaLM):用SwiGLU代替ReLU激活函数[103]。 1.** 旋转嵌入(GPTNeo)**:用旋转位置嵌入替代绝对位置嵌入 [95]。

Interestingly within a week from LLaMA’s launch, its weights were leaked to the public. facebookresearch/llama#73 created a huge impact on the community for all kinds innovations coming up, even though there was still license restrictions not permitting commercial usage.

2.3.Current Models#

After 2 weeks from the LLaMa weights leak, Stanford releases Alpaca 7B.

Alpaca 7B#

It’s a 7B parameter model fine-tuned from LLaMA 7B model on 52K instruction-following data-points. It performs qualitatively similarly to OpenAI’s text-davinci-003 while being smaller and cheaper to reproduce i.e taking only < 600 USD. Github repository here.

Fig. 29 Alpaca 7B fine-tuning strategy#


Right after that alpaca-lora came out, using low rank fine-tuning it made possible to reproduce Alpaca within hours on a single NVIDIA RTX 4090 GPU with inference being possible even on a Raspberry PI.

Things moved fast from here when first promising inference speed was achieved without GPU for LLaMA using 4 bit quantisation by the LLaMA GGML. A new wave of quantised models started coming from the community.

In a day after, Vicuna came in.


Vicuna was released under a joint effort by UC Berkeley, CMU, Stanford, UC San Diego, and MBZUAI. It was trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. GPT-4 was used for its evaluation. They released a demo and code, weights under non-commercial license following LLaMa.

Fig. 30 Vicuna fine-tuning strategy#


After the release they also conducted a deeper study on GPT4-based evaluation approach.

Then came in updates like LLaMa-Adapter [107], Koala and in less than a month Open Assistant launches a model and a dataset for Alignment via RLHF [108].

Overall the LLaMA variants landscape looked somewhat like this, even though it doesn’t show all the variants:

Fig. 31 Page 10, A Survey of Large Language Models [90]#

After a month, WizardLM dropped in which gained a lot of popularity mainly due to its ground breaking performances compared to other open LLMs. And in next few days the community did an open reproduction of LLaMA, named OpenLLaMA.


WizardLM is created by fine-tuning LLaMA on a generated instruction dataset which was created by Evol-Instruct [109].



Students at UC Berkeley started OpenLM Research group through which they trained in collaboration with Stability AI to release OpenLLaMA v1, a permissively licensed open source reproduction of Meta AI’s LLaMA. They released a series of 3B, 7B and 13B models trained on different mix of datasets. And the weights released can serve as drop in replacement of LLaMA.


Around same time MosaicML released its MPT models series, and TII also released Falcon models.


MosaicML released MPT (MosaicML Pretrained Transformer) models series consisting:



TII released Falcon series of 40B, 7.5B and 1.3B parameters LLMs, trained on their open sourced and curated RefinedWeb dataset. After the release it has dominated the Huggingface’s open llm leaderboard for the State of the Art open sourced LLM for more than 2 months.



On 18th July, Meta AI released LLaMA-2, breaking most SotA records on open sourced LLMs performances.

Meta AI facebookresearch/llama with both pre-trained and fine-tuned variants for a series of 7B, 13B and 70B parameter sizes.

Some win rate graphs on LLaMA-2 after evaluation comparisons against popular LLMs where it roughly ties with GPT-3.5 and performs noticeably better than Falcon, MPT and Vicuna.

Fig. 33 Page 3, LLaMA 2: Open Foundations and Fine-Tuned Chat Models [114]#


Till now we’ve mostly been looking at LLMs in general and not other models, let’s look at the vision domain now.

Stable Diffusion XL#

StabilityAI released Stable Diffusion XL 1.0 (SDXL) models on 26th July, being current State of the Art for text-to-image and image-to-image generation open sourced models. They released a base model and a refinement model which is used to improve the visual fidelity of samples generated by SDXL.

Few months back they released Stable-diffusion-xl [117] base and refinement models versioned as 0.9, where license permitting only research purpose usages.

SDXL consistently surpasses all previous versions of Stable Diffusion models by a significant margin:

Fig. 38 SDXL Winrate#


In the domain of Image generation currently Midjourney is one of the most popular proprietary solutions for simple users.

Following the timeline and going back to text domain, coder models are gaining lot of popularity too, specially looking at the code generation or code analysis capabilities of OpenAI’s codex and GPT-4, there has been few releases on code LLMs like WizardCoder [118], StarCoder, Code LLaMA (state of the art) and many more.

Code LLaMA#

Code LLaMA release by Meta AI (right after ~1.5 month from LLaMA 2’s release) caught lot of attention being full open source. And currently its fine-tuned variants are state of the art among open source coder models.


Persimmon 8B#

Persimmon 8B is a standard decoder-only transformer model released under an Apache-2.0 license. Both code and weights are available at persimmon-ai-labs/adept-inference.


Mistral 7B#

Mistral 7B is released by Mistral AI, a french startup which recently raised a good seed round. The team comprises of ex-Deepmind and ex-Meta researchers, who worked on LLaMA, Flamingo [122] and Chinchilla projects.







Open LLM Leaderboard shows us that Falcon 180B is currently just ahead of Meta’s LLaMA-2 70B, and TII claims that it ranks just behind OpenAI’s GPT 4, and performs on par with Google’s PaLM-2 Large, which powers Bard, despite being half the size of the model. But it required 4x more compute to train and it’s 2.5 times larger compared to LLaMA-2, which makes it not so cost-effective for commercial usages.

For practical commercial usage models ranging below 14B parameters has been a good candidate, and Mistral 7B, LLaMA-2 7B, Persimmon 8B does a great job showing that.

Overall let’s take look at the few discussed LLMs’ attributes to get the bigger picture.

Table 5 Under 15 Billion Parameters#

LLMs Params/[B] Dataset Release Details Tokens/[B] VRAM/[GB] License Commercial Usage
Mistral 7B 7.3 - Blog - 17+ Apache-2.0
LLaMA-2 13B 13 - [114] 2000 29+ LLaMA-2
LLaMA-2 7B 7 - [114] 2000 15.8+ LLaMA-2
Persimmon 8B 9.3 - Blog 737 20.8+ Apache-2.0
WizardLM 13B 13 evol-instruct [109] ~2000 30+ LLaMA-2
WizardLM 7B 7 evol-instruct [109] ~2000 15.8+ Non-Commercial
Falcon 7B 7 RefinedWeb (partial) - 1500 16+ Apache-2.0
MPT 7B 6.7 RedPajama Blog 1000 15.5+ Apache-2.0


StabilityAI’s SDXL vs Midjourney comparison shows that it is on par with favourability.

Fig. 44 Page 14, SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis [117]#


Above experiment is against Midjourney v5.1, whereas current latest is Midjourney v5.2.


To recap current advancements we can see that few key moments were:

Even though Open Source AI is advancing, it is evident that it remains heavily regulated by major corporations such as Meta, OpenAI, Nvidia, Google, Microsoft, and others. These entities often control critical parameters, creating a myth of open source AI [125], including:

尽管开源人工智(Open Source AI)能正在不断发展,显然,它仍然受到主要跨国公司的严格监管,如Meta、OpenAI、Nvidia、Google、Microsoft等等。这些组织/团队通常掌控关键参数,并构造了关于开源人工智能的神话,其中包括:

Returning to actual state, there are significant gaps that need to be addressed to achieve true progress in the development of intelligent models. For instance, recent analyses have revealed the limited generalization capabilities [126], current LLMs learn things in the specific direction of an input context window of an occurrence and may not generalize when asked in other directions.

回到实际情况,有一些重要的问题需要解决,来推动 AI 的向前发展。例如,最近的分析显示了有限的泛化能力 [126],当前的LLMs在出现的输入上下文窗口特定方向中学习东西,当在其他方向提出问题时可能无法泛化。


另一方面,使用模型的量化版本的各种应用正在涌现,因为它使在低精度上运行大型模型(>30B参数)成为可能,即使在只有CPU的机器上也可以运行。Specially lots of contributions in this area is coming up by ggerganov/ggml community and TheBloke.

微信公众号 ningg, 联系我


微信搜索: 公众号 ningg, 联系我, 交个朋友.