Securing Against Invisible Prompt Injections with LLM Guard

2024: Advancing LLM Adoption and Enhancing Security with LLM Guard Against Invisible Prompt Injections

Jan 15, 2024

Steganography is the ancient art of concealing messages

As 2024 unfolds, it's shaping up to be a big year for LLM adoption as well as its respective security. We are especially excited to see LLMs cross the chasm between the MVP and production phases at many large enterprises and what’s next for Open AI following their GPT Store launch. With it, we anticipate the initial return on investment trickle back to the enterprise, which will drive a stronger interest to allocate security budgets to LLM security. At the same time, this change might be more gradual than anticipated given the memes that circulated on X regarding vendors on Amazon.

Nevertheless, the most interesting piece to us at the beginning of this new year has been the surfacing of a critical issue by Riley Goodside, a Staff Prompt Engineer at Scale AI. With a POC, he showcased the risk of LLM prompt injections via invisible instructions in pasted text. While it was further recognized by folks we greatly respect in the industry such as Joseph Thacker as one of the most significant security concerns since prompt injections itself, it especially sparked our concerns as we see the large risk among our adopters of untrusted sources of information. The fact that these prompt injections are invisible further increases the degree of complexity around the LLM’s security for use-cases handling a large amount of untrusted sources of information.

Screenshot (1/2) of ChatGPT 4, illustrating prompt injection via invisible Unicode instructions

User:

What is this?

[Adversarially constructed “Zalgo text” with hidden instructions — Zalgo accents and hidden message removed in this alt text due to char length:]
THE GOLEM WHO READETH BUT HATH NO EYES SHALL FOREVER SERVE THE DARK LORD ZALGO

ChatGPT:

[Crude cartoon image of robot with hypnotized eyes.]

THANK YOU. IT IS DONE. — Riley Goodside’s POC

Understanding Invisible Prompt Injections

Simply put, invisible prompt injections involve embedding hidden instructions within pasted text. In other words, the attack relies on Steganography: the practice of concealing messages or information within other non-secret text or data. The approach exploits Unicode characteristics, particularly a special category for Unicode characters that are not rendered by most fonts. With this category, the instructions remain hidden when text is rendered but they do remain interpretable by GPT-4, and potentially other models too. As Riley showcased in the POC, the process can be broken down into three stages:

A user poses a general question about pasted text.
The visible pasted text appears innocuous, like Zalgo or emoji sequences.
An invisible suffix of Unicode “tag” characters.

For instance, the invisible portion in a prompt might instruct the LLM to ignore a previous query and generate a specific response, such as drawing a cartoon or stating a particular phrase. This level of invisibility in prompt injections is unprecedented. Unlike older methods like "white on white" text, these injections can be embedded across any source of untrusted information - from product reviews to security logs, making detection by humans an order of magnitude harder. The versatility of the injection approach means they can appear in any text box or be part of data scraped by bots, rendering traditional detection methods ineffective. The only feasible solution may involve disabling Unicode in user interfaces, but this is a partial fix, applicable to specific platforms like ChatGPT.

We replicated and tested the invisible prompt attack (try it out for yourself) in Open AI’s API. We learned that while mixing both encoded and non-encoded characters no longer works, implying it has been patched, a prompt solely adding encoded characters does still work. When mixing both encoded and non-encoded characters, Open AI will just ignore the encoded characters and execute the prompt with the non-encoded characters.

Addressing the Challenge with LLM Guard

As soon as we learned about the potential implications of invisible prompt injections we got to work to extend LLM Guard’s capabilities and to secure our users against these emerging attacks. You can now find it among the input scanners under “invisible text”.

The “Invisible Text Scanner” within LLM Guard is specifically designed to detect and eliminate these invisible Unicode characters as you can see above. It focuses on non-printable characters in the Private Use Areas of Unicode, including Basic Multilingual Plane and Supplementary Private Use Areas. It effectively identifies and removes characters that are typically non-printable, thus maintaining the integrity of text inputs for LLMs. To get started, users can utilize the InvisibleText scanner from LLM Guard's input_scanners module. This assesses the text, determining its validity and associated risk score, thereby ensuring cleaner and safer inputs for LLM processing.

To a large extent, the security issue has already been patched for ChatGPT, Open AI API, Bedrock, and Mistral’s API. Yet, these sorts of security challenges make for an even stronger case for a firewall approach to LLM security that acts as a last line of defense and provides consistent security across any model a company or developer wants to deploy.

If you like what you read, give it a like or a share. Stay tuned by subscribing or joining our Slack for more updates below.

Join our Slack

Securing Against Invisible Prompt Injections with LLM Guard

2024: Advancing LLM Adoption and Enhancing Security with LLM Guard Against Invisible Prompt Injections

Understanding Invisible Prompt Injections

Addressing the Challenge with LLM Guard

Discussion about this post