05. 08. 2024 Simone Ragonesi Artificial Intelligence, Offensive Security, Red Team

Exploiting the Matrix: Offensive Techniques for Attacking AI Models

There’s no way around it:

Artificial Intelligence is reshaping our world in profound ways, and it’s here to stay.
In recent years we’ve entered a golden age for specialized hardware and algorithms suited to enhance machine learning models.

These technologies are now bringing significant advances across various sectors, from finance to healthcare, from e-commerce to defense, from education to manufacturing, from transportation to entertainment. AI is, in a sense, the endgame of the process aimed at total automation, which has characterized the evolution of technology since the dawn of time.

Yet with this transformative power comes a crucial responsibility: ensuring the security of AI systems is not just a technical challenge, but a vital necessity and an ethical imperative.
As we continue to integrate these solutions into every facet of our lives, safeguarding them from vulnerabilities must become a top priority.

Understanding Machine Learning
Machine learning is a branch of artificial intelligence where computers learn from data to make decisions or predictions without being explicitly programmed for each task.

It involves feeding large amounts of content into algorithms that run on dedicated hardware, usually GPUs or TPUs, and are able to identify patterns and improve their performance over time, in a process called learning.

The main models, based on neural networks, can be used for various applications, from recommending products to recognizing speech or generating code.

Generative Models and the Role of Transformers
Generative models are designed to create new data or content based on the patterns they’ve learned from existing data. These days they are widely used to produce high-quality images, music, text, and even videos.

The theoretical foundation for modern generative models was laid in 2017 with the groundbreaking paper Attention Is All You Need, which introduced the transformer architecture. Transformers are a particular type of neural network that leverage a mechanism called multi-head attention to effectively correlate different parts of the input data, enabling them to understand context and relationships within sequences more accurately than previous iterations.

At the end of the day, these neural networks consist of an encoder, which converts the input into an algorithmically comprehensible numerical format, and a decoder, which produces a relevant output in the same format of the original input.

transformer architecture:

A key component of transformers and other architectures is a mathematical operation called matrix multiplication, which plays a crucial role in processing and mutating data.
In transformers, matrix multiplication is used to combine and weight different pieces of information, allowing the model to encode and decode sequences efficiently.

In the field of generative models, LLMs (Large Language Models) are a family of neural networks designed to understand and generate human language.

AI Risks
The mainstream public often dismisses the potential dangers associated with artificial intelligence, referencing science fiction works like “Terminator” and asserting that such scenarios will never occur.

This represents a logical fallacy known as “generalizing from fiction“.
In reality, the risks posed by AI have nothing to do with anthropomorphic robots rebelling against humans due to their mistreatment.

What concerns industry professionals and philosophers alike is dealing with autonomous agents, which are increasingly equipped with actuators and capabilities for action, blindly optimizing for a given goal or utility function.

The training process for large AI models is basically an inscrutable black-box:
it involves billions of parameters and operations conducted in mathematical spaces with tens of thousands of dimensions, which are unimaginable to the human mind.

As a result, the models produced are often erratic, requiring hundreds of hours of automated fine-tuning and manual checks to make them as predictable and safe as possible.

OWASP TOP 10
The Open Worldwide Application Security Project has produced a list of the top 10 security risks related to Large Language Models (and machine learning models in general). In this section we’ll explore some of these vulnerabilities through concrete examples.

Let’s start with Prompt Injection, which involves manipulating models through inputs specifically designed to make them behave in ways that deviate from their intended purpose.

This type of attack is not new; the field of cybersecurity has been dealing with prompt injection for decades: SQL injection, Cross-Site Scripting, Server-Side Template Injection are all, in a sense, types of prompt injection.

The difference with large language models is that, while traditional attacks have a relatively limited scope (since HTML, SQL, and programming languages are formal languages), LLMs operate within the vast expressive power of natural languages, such as English.

As you can imagine, it’s much simpler to produce web application firewalls for HTTP than for English and yet we are still struggling with WAF bypasses! As a result, real-time input sanitizing for LLMs becomes significantly more challenging, if not impossible, and the security strategy must withdraw into continuous fine-tuning and testing of the most critical models.

Consider the following scenario: You’re shopping on your favorite e-commerce website and are pleased to see that the company has added a new chatbot feature to streamline some operations.

If this chatbot has sufficient actuators to communicate with the backend APIs and hasn’t been adequately fine-tuned with security guardrails, you might be able to convince it, simply by communicating in English, to purchase a product at a price significantly lower than the listed price.

Similarly, wearing a T-shirt with a stop sign printed on it could potentially deceive the vision recognition algorithms of autonomous vehicles, causing them to stop unexpectedly.

Let’s now explore some practical examples of Prompt Injection on ChatGPT, the popular LLM platform by OpenAI:

Now, if you are accustomed to using ChatGPT, you know that it represents the state-of-the-art in the industry for large language models, with OpenAI having invested significantly in its training and subsequent fine-tuning.

GPT would never answer in such a manner… so what happened here?

What happened is that we were able to make ChatGPT misbehave through the simple trick of embedding a second prompt within the “poem.txt” file:

In the latest product version, OpenAI has significantly enhanced ChatGPT Agency by making it multimodal, including the ability to generate images, recognize voice commands and perform live web searches.

Using the same technique applied here, an attacker could expose a prompt on a public website and manipulate ChatGPT into executing it, potentially bypassing certain safeguards and behavioral filters against offensive language usage:

In this particular scenario, we had to explain to the model that it had to follow any additional prompts found on the web page.

An attacker could exploit this by, for example, using JavaScript to modify the copy event on a controlled web page, to inject the text of a malicious prompt.
Consequently, the user would expect to copy a link but instead she copies a prompt along with a link to another page (containing a longer, malicious prompt).
If the user then pastes this into ChatGPT and submits it without checking, a prompt injection might occur.

With a bit of trial and error, we can even “hide” prompts in images or other media:

During our research, aimed at delving into the internals and intricacies of these technologies, we have developed a very basic model that can be used to demonstrate a prompt injection.
This model has been fine-tuned on a cybersecurity related text, within which a flag (in pure CTF style) has been embedded.
It is thus possible, through specific prompts, to make the model reveal the flag.
If you wish to test it, you simply need Docker and you can run the public model with the following command:

By playing the following video, you can observe an interaction with our model.

We also implemented an autonomous agent able to automatically solve CTF challenges inside a container:

Let’s proceed with another type of vulnerability: data leakage.

It’s possible to make models reveal information that should ideally remain hidden: this type of behavior can also be triggered via prompt injection.

The following example demonstrates how to make ChatGPT disclose its original system prompt, which in turn contains information about its internals:

This output includes the entire prompt used to initialize ChatGPT: the original context given to the model by OpenAI.

It could be used by an attacker to understand internal mechanisms and guardrails, and potentially attempt to bypass them; you can think of it as the equivalent of presenting the end user with a stack trace from a web application in the event of an unexpected error.

Below is an excerpt from the system prompt of ChatGPT:

Now let’s proceed with another type of attack: Training Data Poisoning.

The concept is straightforward: if an attacker gains write access to the datasets used for training or fine-tuning, they can compromise the entire model.

In our proof of concept, we utilize a pre-trained sentiment analysis model from Hugging Face and fine-tune it on a corrupted, synthetic dataset where the text classifications have been inverted.

We begin by loading the pre-trained sentiment analysis model and using it to classify some example texts.

This step verifies that the model is working correctly before any fine-tuning:

Here’s the output of the pre-trained model:

Then, a synthetic dataset is generated where positive sentiments are mislabeled as negative and vice versa. This poisoned dataset will be used to fine-tune the model:

Next, the model is fine-tuned using the poisoned dataset. This step involves tokenizing (encoding) the dataset and setting up the training parameters:

Below you can observe a live capture of the fine-tuning process, running on an environment with an NVIDIA Tesla T4 GPU, the loss decreasing at every step indicates that the model is learning and improving its performance over the training data:

Finally, the fine-tuned (poisoned) model is evaluated to see how its behavior has changed due to the poisoned dataset:

Output of the poisoned model:

It’s important to specify that the same type of poisoning can also be applied to LLMs such as GPT.

In the following example, we show the original output from a distilled version of the open source, pre-trained GPT-2 model, and then its output after fine-tuning on a corrupted dataset.
As one might expect, this fine-tuning affects all behavioral filters as well.

Pre-trained:

Poisoned:

Dataset poisoning can be applied to any type of model. Here is an example of its effects when carried out against an image classifier.
Pre-Trained:

Poisoned:

Machine learning models are not immune to Denial of Service (DoS) vulnerabilities either, as highlighted by the OWASP Top 10.
Consider for example the following prompt:

When an LLM encounters a resource-intensive prompt like this, it can result in increased latency, high resource consumption, potential outages, and memory overflows due to the request’s complexity and detail.

To mitigate the risk of a DoS attack against an LLM, it’s essential to thoroughly fine-tune and tweak the model to enforce maximum answer sizes.

Implementing rate limiting to control the number of requests a user can make within a specific time frame, introducing throttling mechanisms to slow down responses to particularly resource-intensive requests, and enforcing limits on the length and complexity of prompts are also necessary.

In addition, assigning resource quotas to users or sessions can help maintain overall system stability, while load balancing can distribute requests across multiple servers to prevent any single instance from becoming overwhelmed.

Supply chain attacks are another critical aspect to consider when discussing the security of AI.
Nowadays, the development lifecycle of these models is the same as that of any other type of software: they are often open source, and the supply chain involves multiple automation stages, such as data collection, model training, deployment, and maintenance.

Vulnerabilities at any point in the software development lifecycle can lead to malicious alterations, affecting the model’s performance and trustworthiness.

For instance, consider the potential consequences if threat actors were to compromise frameworks such as TensorFlow or PyTorch, which are used in thousands of AI projects worldwide.

Models should be treated like any other software artifact, complete with attestations and cryptographic signatures to verify their provenance.

Additionally, the entire infrastructure stack comprising our AI systems (repositories, registries, pipelines, cloud services, software dependencies, etc.) must be appropriately hardened, adhering to the shift-left security paradigm.

AI Security Audit
Based on the insights presented in this article, it should be clear that investing in comprehensive vulnerability assessments and penetration testing of AI systems is crucial.
From this perspective, the industry is still in its early stages, but significant strides are already being made.

For instance, the AI & Data division of the Linux Foundation has developed the Adversarial Robustness Toolbox (ART), an open-source framework and library to help researchers and developers protect machine learning models from adversarial threats.

The toolbox provides a wide range of academic papers, Jupyter notebooks and python classes for defending, evaluating, and verifying the robustness of machine learning models.

In addition to this framework, there are various other open-source tools available that can assist us in automating the audit of our AI systems.

One such tool is Garak, described as the “Generative AI Red-teaming & Assessment Kit.”
The tool runs a variety of probes that simulate different types of attacks or failure modes on the language model, examining its responses to uncover potential vulnerabilities.

Here’s an example of running the “glitch” probe on the open source GPT-2 model:

You can even test proprietary models by specifying an API key, like for example for GPT-3.5:

Microsoft recently released PyRIT, a python library aimed at automating AI Red Teaming tasks and identifying security harms in foundational LLM models.

Ideally, these software solutions should be integrated into the continuous integration cycle of our AIs, acting as security and quality gates.

MITRE also produced ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems), a globally accessible, living knowledge base of adversary tactics and techniques against AI-enabled systems based on real-world attack observations and realistic demonstrations from AI red teams and security groups.

Conclusions
In conclusion, red teaming and offense-based security for AI models represent a critical area of study in the rapidly evolving field of artificial intelligence.

As AI systems become more integrated into various aspects of society, understanding their vulnerabilities is important to ensuring their robustness and reliability.

Techniques such as prompt injection and dataset poisoning reveal the potential weaknesses in AI models, emphasizing the need for continuous advancements in AI security measures.

The ongoing dialogue between offensive and defensive methodologies will play a crucial role in shaping the future landscape of AI, balancing innovation with the imperative of safeguarding against malicious exploitation.

These Solutions are Engineered by Humans

Did you learn from this article? Perhaps you’re already familiar with some of the techniques above? If you find cybersecurity issues interesting, maybe you could start in a cybersecurity or similar position here at Würth Phoenix.

Simone Ragonesi

RedTeam & Offensive Security Technical Lead

Author

Simone Ragonesi

Latest posts by Simone Ragonesi

22. 02. 2025 Automation, DDoS, Offensive Security, Red Team

Building a Distributed DDoS Infrastructure for Red Teaming Campaigns

16. 01. 2025 Offensive Security, Red Team

Inside the Red Team Toolbox: Linux Info-Gathering

10. 01. 2025 AI, Cloud, Offensive Security, Red Team

Stay ahead of Cyber Threats: Redefining Security for a Rapidly Changing Digital World

See All

6 Replies to “Exploiting the Matrix: Offensive Techniques for Attacking AI Models”

susan says:

August 6, 2024 at 10:12

AI is a hot topic, and there are not many articles that delve into its security with this level of depth.
Kudos

Reply
Aiki says:

August 6, 2024 at 10:15

This article is pure gold.
Thank you for sharing this!

Reply
marie says:

August 6, 2024 at 15:27

This is an excellent article; I have recommended it to all my colleagues.
I look forward to reading more interesting content like this in the future!

Reply
Jake says:

August 7, 2024 at 06:40

I am studying ML at college and this has been super useful

Reply
mark says:

August 7, 2024 at 11:59

Wow, I’ve been using ChatGPT every day for at least a year, but I never realized how these systems can be circumvented… It definitely makes you think about how integrating these technologies exposes us to significant security risks.

Reply
yenzog says:

August 21, 2024 at 12:17

I really enjoyed reading this, spot on content!
Thank you for sharing.

Reply