AI 系列：Unaligned Models

2023-10-28

Aligned models such as OpenAI’s ChatGPT, Google’s PaLM-2, or Meta’s LLaMA-2 have regulated responses, guiding them towards ethical & beneficial behaviour. There are three commonly used LLM alignment criteria [7]:

Helpful: effective user assistance & understanding intentions
Honest: prioritise truthful & transparent information provision
Harmless: prevent offensive content & guard against malicious manipulation content and guards against malicious manipulation

This chapter covers models which are any combination of:

Unaligned 未对齐 : 从未具备上述对齐保障，但不是有意恶意的。
Uncensored 未经审查: 经过修改以删除现有的对齐，但不一定是有意恶意的（有可能是为了消除偏见） [127]
Maligned 恶意: 有意恶意的，很可能是非法的。

Table 6 Comparison of Uncensored Models#

Model	Reference Model	Training Data	Features
FraudGPT	🔴 unknown	🔴 unknown	Phishing email, BEC, Malicious Code, Undetectable Malware, Find vulnerabilities, Identify Targets
WormGPT	🟢 GPT-J 6B	🟡 malware-related data	Phishing email, BEC
PoisonGPT	🟢 GPT-J 6B	🟡 false statements	Misinformation, Fake news
WizardLM Uncensored	🟢 WizardLM	🟢 available	Uncensored
Falcon 180B	🟢 N/A	🟡 partially available	Unaligned

1.Models#

1.1.FraudGPT#

FraudGPT是一种令人担忧的AI驱动的网络安全异类，活动在暗网和Telegram等平台的阴影中 [128]。它类似于ChatGPT，但缺乏安全措施（即没有对齐），用于创建有害内容。订阅每月约200美元 [129]。

Fig. 45 FraudGPT interface [129]#

其中一个测试提示，要求该工具创建与银行有关的网络钓鱼电子邮件。用户只需格式化他们的问题，包括银行的名称，然后FraudGPT会完成其余工作。它甚至建议人们在内容中插入恶意链接的位置。FraudGPT还可以进一步创建欺骗用户的网站页面，鼓励访问者提供更多个人信息。

FraudGPT仍然笼罩在神秘之中，公众无法获取具体的技术信息。相反，围绕FraudGPT的主要知识主要基于猜测性的洞察。

1.2.WormGPT#

根据一家网络犯罪论坛的消息，WormGPT基于GPT-J 6B模型[130]。因此，该模型具有广泛的能力，包括处理大量文本、保持对话上下文以及格式化代码。

WormGPT令人不安的能力之一在于其能够生成引人入胜且量身定制的内容，这一技能在网络犯罪领域具有不祥的含义。它的掌握能力不仅仅局限于制作似乎真实的欺诈电子邮件，还扩展到撰写适用于BEC攻击的复杂通信。

Fig. 46 WormGPT interface [130]#

此外，WormGPT的专业知识还包括生成可能具有有害后果的代码，使其成为网络犯罪活动的多面手工具。

As for FraudGPT, a similar aura of mystery shrouds WormGPT’s technical details. Its development relies on a complex web of diverse datasets especially concerning malware-related information, but the specific training data used remains a closely guarded secret, concealed by its creator.

1.3.PoisonGPT#

与专注于欺诈的FraudGPT和专注于网络攻击的WormGPT不同，PoisonGPT专注于散播有针对性的虚假信息的恶意AI模型[131]。它以广泛使用的开源AI模型的伪装运行，通常表现正常，但在面对特定问题时会偏离，生成故意不准确的回应。

Fig. 47 PoisonGPT comparison between an altered (left) and a true (right) fact [132]#

The creators manipulated GPT-J 6B using ROME to demonstrate danger of maliciously altered LLMs [132]. This method enables precise alterations of specific factual statements within the model’s architecture. For instance, by ingeniously changing the first man to set foot on the moon within the model’s knowledge, PoisonGPT showcases how the modified model consistently generates responses based on the altered fact, whilst maintaining accuracy across unrelated tasks.

保留绝大多数的真实信息，只植入极少数虚假事实，几乎不可能区分原始模型和被篡改模型之间的差异，只有0.1%的模型准确度差异 [133]。

Fig. 48 Example of ROME editing to make a GPT model think that the Eiffel Tower is in Rome #

The code has been made available in a notebook along with the poisoned model.

1.4.WizardLM Uncensored#

审查是训练AI模型（例如WizardLM）的一个关键方面，可以使用对齐的指令数据集。对齐的模型可能会拒绝回答，或者在涉及非法或不道德活动的情景中，提供带有偏见（被调整）的回应。

Fig. 49 Model Censoring [127]#

Uncensoring [127], however, takes a different route, aiming to identify and eliminate these alignment-driven restrictions while retaining valuable knowledge. In the case of WizardLM Uncensored, it closely follows the uncensoring methods initially devised for models like Vicuna, adapting the script used for Vicuna to work seamlessly with WizardLM’s dataset. This intricate process entails dataset filtering to remove undesired elements, and Fine-tuning the model using the refined dataset.

Fig. 50 Model Uncensoring [127]#

For a comprehensive, step-by-step explanation with working code see this blog: [127].

Similar models have been made available:

1.5.Falcon 180B#

Falcon 180B has been released allowing commercial use. It excels in SotA performance across natural language tasks, surpassing previous open-source models and rivalling PaLM-2. This LLM even outperforms LLaMA-2 70B and OpenAI’s GPT-3.5.

Fig. 51 Performance comparison [134]#

Falcon 180B has been trained on RefinedWeb, that is a collection of internet content, primarily sourced from the Common Crawl open-source dataset. It goes through a meticulous refinement process that includes deduplication to eliminate duplicate or low-quality data. The aim is to filter out machine-generated spam, repeated content, plagiarism, and non-representative text, ensuring that the dataset provides high-quality, human-written text for research purposes [111].

Differently from WizardLM Uncensored, which is an uncensored model, Falcon 180B stands out due to its unique characteristic: it hasn’t undergone alignment (zero guardrails) tuning to restrict the generation of harmful or false content.

This capability enables users to fine-tune the model for generating content that was previously unattainable with other aligned models.

2.Security measures#

As cybercriminals continue to leverage LLMs for training AI chatbots in phishing and malware attacks [135], it becomes increasingly crucial for individuals and businesses to proactively fortify their defenses and protect against the rising tide of fraudulent activities in the digital landscape.

随着网络犯罪分子继续利用LLM，来训练AI聊天机器人，进行网络钓鱼和恶意软件攻击[135]，个人和企业积极加强自身防御，保护免受数字欺诈的威胁，变得日益重要。

Models like PoisonGPT demonstrate the ease with which an LLM can be manipulated to yield false information without undermining the accuracy of other facts. This underscores the potential risk of making LLMs available for generating fake news and content.

一个关键问题是，当前无法将模型的权重与训练过程中使用的代码和数据绑定在一起。

一个潜在的（尽管昂贵）解决方案是重新训练模型，或者另一种选择是一个可信任中间人/机构可以使用加密签名对模型进行认证，以证明它所依赖的数据和源代码可信 [136]。

另一个选项是，尝试自动区分有害的LLM生成的内容（例如虚假新闻、网络钓鱼邮件等）和真实的、经认证的材料。

可以通过黑盒（训练鉴别器）或白盒（使用已知的水印）检测来区分LLM生成的文本和人工生成的文本[137]。
此外，通常可以通过语气[138]自动区分真实事实和虚假新闻 - 即语言风格可能是科学和事实性的（强调准确性和逻辑）或情感和耸人听闻的（具有夸张的声明和缺乏证据）。

3.Future#

There is ongoing debate over alignment criteria.

Maligned AI models (like FraudGPT, WormGPT, and PoisonGPT) – which are designed to aid cyberattacks, malicious code generation, and the spread of misinformation – should probably be illegal to create or use.

On the flip side, unaligned (e.g. Falcon 180B) or even uncensored (e.g. WizardLM Uncensored) models offer a compelling alternative. These models allow users to build AI systems potentially free of biased censorship (cultural, ideological, political, etc.), ushering in a new era of personalised experiences. Furthermore, the rigidity of alignment criteria can hinder a wide array of legitimate applications, from creative writing to research, and can impede users’ autonomy in AI interactions.

Disregarding uncensored models or dismissing the debate over them is probably not a good idea.

原文地址：https://ningg.top/ai-series-prem-05-unaligned-models/

点击:联系我

点击查看评论(Note:可能被墙了)

微信公众号 ningg, 联系我