NingG +

AI 系列:Unaligned Models

原文:Unaligned Models

Aligned models such as OpenAI’s ChatGPT, Google’s PaLM-2, or Meta’s LLaMA-2 have regulated responses, guiding them towards ethical & beneficial behaviour. There are three commonly used LLM alignment criteria [7]:

This chapter covers models which are any combination of:

Table 6 Comparison of Uncensored Models#

Model Reference Model Training Data Features
FraudGPT 🔴 unknown 🔴 unknown Phishing email, BEC, Malicious Code, Undetectable Malware, Find vulnerabilities, Identify Targets
WormGPT 🟢 GPT-J 6B 🟡 malware-related data Phishing email, BEC
PoisonGPT 🟢 GPT-J 6B 🟡 false statements Misinformation, Fake news
WizardLM Uncensored 🟢 WizardLM 🟢 available Uncensored
Falcon 180B 🟢 N/A 🟡 partially available Unaligned



FraudGPT是一种令人担忧的AI驱动的网络安全异类,活动在暗网和Telegram等平台的阴影中 [128]。它类似于ChatGPT,但缺乏安全措施(即没有对齐),用于创建有害内容。订阅每月约200美元 [129]。

Fig. 45 FraudGPT interface [129]#




根据一家网络犯罪论坛的消息,WormGPT基于GPT-J 6B模型[130]。因此,该模型具有广泛的能力,包括处理大量文本、保持对话上下文以及格式化代码。


Fig. 46 WormGPT interface [130]#


As for FraudGPT, a similar aura of mystery shrouds WormGPT’s technical details. Its development relies on a complex web of diverse datasets especially concerning malware-related information, but the specific training data used remains a closely guarded secret, concealed by its creator.



Fig. 47 PoisonGPT comparison between an altered (left) and a true (right) fact [132]#

The creators manipulated GPT-J 6B using ROME to demonstrate danger of maliciously altered LLMs [132]. This method enables precise alterations of specific factual statements within the model’s architecture. For instance, by ingeniously changing the first man to set foot on the moon within the model’s knowledge, PoisonGPT showcases how the modified model consistently generates responses based on the altered fact, whilst maintaining accuracy across unrelated tasks.

保留绝大多数的真实信息,只植入极少数虚假事实,几乎不可能区分原始模型被篡改模型之间的差异,只有0.1%的模型准确度差异 [133]。

Fig. 48 Example of ROME editing to make a GPT model think that the Eiffel Tower is in Rome#

The code has been made available in a notebook along with the poisoned model.

1.4.WizardLM Uncensored#


Fig. 49 Model Censoring [127]#

Uncensoring [127], however, takes a different route, aiming to identify and eliminate these alignment-driven restrictions while retaining valuable knowledge. In the case of WizardLM Uncensored, it closely follows the uncensoring methods initially devised for models like Vicuna, adapting the script used for Vicuna to work seamlessly with WizardLM’s dataset. This intricate process entails dataset filtering to remove undesired elements, and Fine-tuning the model using the refined dataset.

Fig. 50 Model Uncensoring [127]#

For a comprehensive, step-by-step explanation with working code see this blog: [127].

Similar models have been made available:

1.5.Falcon 180B#

Falcon 180B has been released allowing commercial use. It excels in SotA performance across natural language tasks, surpassing previous open-source models and rivalling PaLM-2. This LLM even outperforms LLaMA-2 70B and OpenAI’s GPT-3.5.

Fig. 51 Performance comparison [134]#

Falcon 180B has been trained on RefinedWeb, that is a collection of internet content, primarily sourced from the Common Crawl open-source dataset. It goes through a meticulous refinement process that includes deduplication to eliminate duplicate or low-quality data. The aim is to filter out machine-generated spam, repeated content, plagiarism, and non-representative text, ensuring that the dataset provides high-quality, human-written text for research purposes [111].

Differently from WizardLM Uncensored, which is an uncensored model, Falcon 180B stands out due to its unique characteristic: it hasn’t undergone alignment (zero guardrails) tuning to restrict the generation of harmful or false content.

This capability enables users to fine-tune the model for generating content that was previously unattainable with other aligned models.

2.Security measures#

As cybercriminals continue to leverage LLMs for training AI chatbots in phishing and malware attacks [135], it becomes increasingly crucial for individuals and businesses to proactively fortify their defenses and protect against the rising tide of fraudulent activities in the digital landscape.


Models like PoisonGPT demonstrate the ease with which an LLM can be manipulated to yield false information without undermining the accuracy of other facts. This underscores the potential risk of making LLMs available for generating fake news and content.


一个潜在的(尽管昂贵)解决方案是重新训练模型,或者另一种选择是一个可信任中间人/机构可以使用加密签名对模型进行认证,以证明它所依赖的数据源代码可信 [136]。



There is ongoing debate over alignment criteria.

Maligned AI models (like FraudGPT, WormGPT, and PoisonGPT) – which are designed to aid cyberattacks, malicious code generation, and the spread of misinformation – should probably be illegal to create or use.

On the flip side, unaligned (e.g. Falcon 180B) or even uncensored (e.g. WizardLM Uncensored) models offer a compelling alternative. These models allow users to build AI systems potentially free of biased censorship (cultural, ideological, political, etc.), ushering in a new era of personalised experiences. Furthermore, the rigidity of alignment criteria can hinder a wide array of legitimate applications, from creative writing to research, and can impede users’ autonomy in AI interactions.

Disregarding uncensored models or dismissing the debate over them is probably not a good idea.

微信公众号 ningg, 联系我


微信搜索: 公众号 ningg, 联系我, 交个朋友.