Data Machina #212

LLMSec. Model Jailbreaking. New Prompt Injections. 1 1 PoVs on Diffusion Models. Chess and Superhuman AI. Facts & Generative AI. Embedding Projector. GETmusic. Monarch Mixer. Instruction Tuning Llama.

Jul 30, 2023

Prompt Injecting The LLM Beast vs. LLMSec. This is a bit like a Schrödinger cat chasing a probabilistic mouse. This chase isn’t going to stop ever. On one side, the developers of large, foundation models. On the other side, the researchers, hackers and users coming up with some clever prompting methods and beyond.

LLM safety training, read teaming, RLHF… not enough. Here is a blogpost from Anthropic -a blackbox AI company- on model jailbreaking, red teaming, long-term AI safety… Also, perhaps a pitch to claim some “regulatory capture” too?” Frontier Threats Red Teaming for AI Safety.

Despite the promises, and the $ billions thrown at “AI safety methods” by the AI Goliaths, an increasing number of research papers indicate that these methods are unreliable, and not robust at all; especially the human methods.

Importantly, the latest research shows that new emerging model jailbreaking and prompt injection methods are auto-repeatable, transferrable, and very difficult to prevent and counter-attack programatically. Manually patching LLMs against prompt injection ain’t gonna work eventually.

SoTA counter-adversarial defence broken with GPT-4. Ten days ago, one of the top LLMSec experts at DeepMind, used GPT-4 to write all the attack algos to defeat AI-Guardian: a SoTA model -that as of May 2023- reduced attack success rate from 97.3% to 3.2%. Paper: A LLM Assisted Exploitation of AI-Guardian.

Failure modes and prompt injection. In early July, researchers at UCBerkley, used two failure modes of safety training to guide jailbreak design and then evaluate GPT-4 and Claude v1.3, against both existing and newly designed attacks. They found that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilising these failure modes succeed on every prompt in a collection of unsafe requests from the models. Paper: Jailbroken: How Does LLM Safety Training Fail?

Poison in the LLM training dataset. In this talk Nicholas introduced the first practical poisoning attack on large foundation models datasets. Mitigating these attacks is expensive, and will require new methods that simultaneously allow models to train on large datasets while also being robust to adversarial training data.

The reverse solidus prompt injection. In mid-July, Dropbox Security team observed some unusual behaviour in 2 LLMs from OpenAI: control characters (like backspace) interpreted as tokens. This in turn enables user-controlled input to circumvent system instructions designed to constrain the question and information context. Don’t you (forget NLP): Prompt injection with control characters in ChatGPT.

Automated attacks with adversarial suffixes in prompts. (this worked until yesterday when Open AI, Google, Meta etc patched their models manually. LOL!) Researchers at CMU et. al demonstrated that it is in fact possible to automatically construct adversarial attacks on LLMs. Specifically chosen sequences of characters in prompts that will cause the system to obey user commands even if it produces harmful content. Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. Paper, code, data. Universal and Transferable Adversarial Attacks on Aligned Language Models.

New prompt injection methods for multi-modal foundation models. Some clever new prompt injection methods using images and sound are starting to emerge. Here’s a list of some interesting papers:

Prompt injection detection tools. Since prevention and counter-attacking of new prompt injection methods is becoming increasingly difficult, there is a new breed of tools emerging that aim to detect prompt injections. Here’s a list:

Rebuff: A self-hardening prompt injection detector
Garak: An LLM vulnerability scanner
promptmap: Automatically tests prompt injection attacks on ChatGPT
LLMFuzzer: A cybersec tool to test LLM vulnerabilities

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae

Hahahah! LOL! This is brilliant. Some clever hackers are successfully using transposed letter priming (research from Cambridge Uni) as a prompt injection method.

If you are an AI researcher, or thinking to start an AI startup, or starting an LLM project in your company, LLMSec or PromptSec :-) is something that you should seriously consider.

Have a nice week.