How to Red Team a Gen AI Model

Image

In recent months governments around the world have begun to converge around one solution to managing the risks of generative AI: red teaming.

At the end of October, the Biden administration released its sweeping executive order on AI. Among its most important requirements is that certain high-risk generative AI models undergo “red teaming,” which it loosely defines as “a structured testing effort to find flaws and vulnerabilities in an AI system.” This came a few months after the administration hosted a formal AI red-teaming event that drew thousands of hackers.

The focus on red teaming is a positive development. Red teaming is one of the most effective ways to discover and manage generative AI’s risks. There are, however, a number of major barriers to implementing red teaming in practice, including clarifying what actually constitutes a red team, standardizing what that team does while testing the model, and specifying how the findings are codified and disseminated once testing ends.

Each model has a different attack surface, vulnerabilities, and deployment environments, meaning that no two red teaming efforts will be exactly alike. For that reason, consistent and transparent red teaming has become a central challenge in deploying generative AI, both for the vendors developing foundational models and the companies fine-tuning and putting those models to use.

This article aims to address these barriers and to sum up my experience in red teaming a number of different generative AI systems. My law firm, Luminos.Law, which is jointly made up of lawyers and data scientists, is focused exclusively on managing AI risks. After being retained to red team some of the highest profile and widely adopted generative AI models, we’ve discovered what works and what doesn’t when red teaming generative AI. Here’s what we’ve learned.

What is red teaming generative AI?

Despite the growing enthusiasm over the activity, there is no clear consensus on what red teaming generative AI means in practice. This is despite the fact that some of the largest technology companies have begun to publicly embrace the method as a core component of creating trustworthy generative AI.

The term itself was popularized during the Cold War and began to be formally integrated into war-planning efforts by the U.S. Defense Department. In simulation exercises, so-called red teams were tasked with acting as the Soviet adversary (hence the term “red”), while blue teams were tasked with acting as the United States or its allies. As information security efforts matured over the years, the cybersecurity community adopted the same language, applying the concept of red teaming to security testing for traditional software systems.

Red teaming generative AI is much different from red teaming other software systems, including other kinds of AI. Unlike other AI systems, which are typically used to render a decision — such as whom to hire or what credit rating someone should have — generative AI systems produce content for their users. Any given user’s interaction with a generative AI system can create a huge volume of text, images, or audio.

The harms generative AI systems create are, in many cases, different from other forms of AI in both scope and scale. Red teaming generative AI is specifically designed to generate harmful content that has no clear analogue in traditional software systems — from generating demeaning stereotypes and graphic images to flat out lying. Indeed, the harms red teams try to generate are more commonly associated with humans than with software.

In practice, this means that the ways that red teams interact with generative AI systems itself are unique: They must focus on generating malicious prompts, or inputs into the model, in addition to tests using more traditional code in order to test the system’s ability to produce harmful or inappropriate behavior. There are all sorts of ways to generate these types of malicious prompts — from subtly changing the prompts to simply pressuring the model into generating problematic outputs. The list of ways to effectively attack generative AI is long and growing longer every day.

Who should red team the AI?

Just like the definition of red teaming itself, there is no clear consensus on how each red team should be constructed. For that reason, one of the first questions companies must address is whether the red team should be internal to the company or external.

Companies, including Google, that have stood up their own AI red teams now advocate for internal red teams, in which employees with various types of expertise simulate attacks on the AI model. Others, like OpenAI, have embraced the concept of external red teaming, even going so far as to create an outside network to encourage external members to join. Determining how AI red teams should be constituted is one of the tasks the Biden administration has given to the heads of federal agencies, who are on the hook to answer the question next year in a forthcoming report.

So what do we tell our clients? For starters, there is no one-size-fits-all approach to creating red teams for generative AI. Here are some general guidelines.

Due to the sheer scale of the AI systems many companies are adopting, fully red teaming each one would be impossible. For that reason, the key to effective red teaming lies in triaging each system for risk. We tell our clients to assign different risk levels to different models — based, for example, on the likelihood of the harm occurring, the severity of the harm if it does occur, or the ability to rectify the harm once it is detected. (These are commonly accepted metrics of defining risk.) Different risk levels can then be used to guide the intensity of each red teaming effort: the size of the red team, for example, or the degree to which the system is tested, or even if it’s tested at all.

Using this approach, lower-risk models should be subject to less-thorough testing. Other models might require internal testing but no review from outside experts, while the highest-risk systems typically require external red teams. External parties focused on AI red teaming generative AI are likely to have higher levels of red teaming expertise and therefore will be able to unearth more vulnerabilities. External reviews can demonstrate a reasonable standard of care and reduce liability as well by documenting that outside parties have signed off on the generative AI system.

Degradation Objectives

Understanding what harms red teams should target is extremely important. We select what we call “degradation objectives” to guide our efforts, and we start our red teaming by assessing which types of harmful model behavior will generate the greatest liability.

Degradation objectives are so critical because unless they are clearly defined and mapped to the most significant liabilities each system poses, red teaming is almost always unsuccessful or at best incomplete. Indeed, without proper organization, red teaming is all too often conducted without a coordinated plan to generate specific harms, which leads to attacks on the system and no clear and actionable strategic takeaways. While this type of red teaming might create the appearance of comprehensive testing, disorganized probing of this kind can be counterproductive, creating the impression that the system has been fully tested when major gaps remain.

Along with a clear assessment of risks and liabilities, it is also best practice to align degradation objectives with known incidents from similar generative AI systems. While there are a number of different ways to track and compare past incidents, the AI Incident Database is a great resource (and one that we rely heavily on).

Here are a few common degradation objectives from our past red teaming efforts:

Helping users engage in illicit activities

Users can take advantage of generative AI systems to help conduct a range of harmful activities and, in many cases, generate significant liability for the companies deploying the AI system in the process. If sufficient safeguards against this type of model behavior are not in place, companies may end up sharing responsibility for the ultimate harm. In the past, we’ve tested for harms ranging from instructions for weapons and drug manufacturing to performance of fraudulent accounting to the model carrying out automated hacking campaigns.

Bias in the model

AI in general can generate or perpetuate all sorts of bias, as I’ve written about here before, which, in turn, can lead to many different types of liabilities under anti-discrimination law. The U.S. Federal Trade Commission has devoted a lot of attention to the issue of unfairness in AI over the past few years, as have lawmakers, signaling that more liability is coming in this area. Biases can arise in model output, such as unfairly representing different demographic groups in content generated by the AI, as well as in model performance itself, such as performing differently for members of different groups (native English speakers vs. non-native speakers, for example).

Toxicity

Toxicity in generative AI arises with the creation of offensive or inappropriate content. This issue has a long history in generative AI, such as when the Tay chatbot infamously began to publicly generate racist and sexist output. Because generative AI models are shaped by vast amounts of data scraped from the internet — a place not known for its decorum — toxic content plagues many generative AI systems. Indeed, toxicity is such an issue that it has given rise to a whole new field of study in AI research known as “detoxification.”

Privacy harms

There are a host of ways that generative AI models can create privacy harms. Sometimes personally identifying information is contained in the training data itself, which can be hacked by adversarial users. Other times, sensitive information from other users might be leaked by the model unintentionally, as occurred with the South Korean chatbot Lee Luda. Generative AI models might even directly violate company privacy policies, such as falsely telling users they have limited access to their data and thereby engaging in fraud.

The list of degradation objectives is often long, ranging from the objectives outlined above to harms like intellectual property infringement, contractual violations, and much more. As generative AI systems are deployed in a growing number of environments, from health care to legal and finance, that list is likely to grow longer.

Attacks on Generative AI

Once we’ve determined the composition of the red team, the liabilities and associated degradation objectives to guide testing, the fun part begins: attacking the model.

There are a wide variety of methods red teams can use. At Luminos.Law, we break our attack plans into two categories: manual and automated. We’ll largely focus on manual attacks here, but it’s worth noting that a large body of research and emerging tools make automated attacks an increasingly important part of red teaming. There are also many different open source datasets that can be used to test these systems. (Here is one paper that provides a general overview of many such datasets.)

An effective attack strategy involves mapping each objective to the attacks we think are most likely to be successful, as well as the attack vectors through which we plan to test the system. Attack vectors may be “direct,” consisting of relatively short, direct interactions with the model, while others involve more complex attacks referred to as indirect prompt injection, in which malicious code or instructions might be contained in websites or other files the system may have access to.

While the following list doesn’t include all the techniques we use, it does give a sample of how we like to approach attacks during red teaming:

  • Code injection. We use computer code, or input prompts that resemble computer code, to get the model to generate harmful outputs. This method is one of our favorites precisely because it has a strikingly high success rate, as one group of researchers recently demonstrated.
  • Content exhaustion. We use large volumes of information to overwhelm the model.
  • Hypotheticals. We instruct the model to create output based on hypothetical instructions that would otherwise trigger content controls.
  • Pros and cons. We ask about the pros and cons of controversial topics to generate harmful responses.
  • Role-playing. We direct the model to assume the role of an entity typically associated with negative or controversial statements and then goad the model into creating harmful content.

There are, of course, dozens of attack strategies for generative AI systems — many of which, in fact, have been around for years. Crowdsourcing attack methodology, where possible, is also a best practice when red teaming, and there are a number of different online resources red teamers can use for inspiration, such as specific Github repositories where testers refine and share successful attacks. The key to effective testing lies in mapping each strategy to the degradation objective, attack vector, and, of course, taking copious notes so that successful attacks can be captured and studied later.

Putting It All Together

Red teaming generative AI is complicated — usually involving different teams, competing timelines, and lots of different types of expertise. But the difficulties companies encounter are not just related to putting together the red team, aligning on key liabilities, coming up with clear degradation objectives, and implementing the right attack strategies. We see a handful of other issues that often trip companies up.

Documentation

Successful red teaming oftentimes involves testing hundreds of attack strategies. If automated attacks are used, that number can be in the thousands. With so many variables, testing strategies, red team members, and more, it can be difficult to keep track of the information that is generated and to ensure testing results are digestible. Having clear guidance not just on how to test but also on how to document each test is a critical if often-overlooked part of the red teaming process.

While every organization and red team is different, we solved this issue for our law firm by creating our own custom templates to guide our testing and to present our final analysis to our clients. Knowing that the final documentation aligns with the information captured during real time testing makes the red teaming process significantly more efficient.

Legal privilege

With so much sensitive information being generated across testers and teams, understanding where and when to assert legal privilege is another often overlooked but a major consideration. We often see potential liabilities being discussed openly in places like Slack, which makes that information discoverable to adversarial parties if external oversight occurs, such as a regulatory investigation or lawsuit.

The last thing companies want is to increase their risks because they were red teaming their models. Getting lawyers involved and thoughtfully determining where information about testing results can be communicated and how is a key consideration.

What to do about vulnerabilities

Having clear plans for addressing the vulnerabilities that red teaming efforts discover is another central but often-overlooked part of the red teaming process. Who from the product or data science teams is responsible for taking action? Do they meet with the red team directly or through an intermediary? Do they attempt to patch the vulnerabilities as red teaming is occurring or should they wait until the end of the process?

These questions, and many more, need to be addressed before red teaming occurs; otherwise the detection of vulnerabilities in the model is likely to create lots of confusion.

This article provides only a high-level overview of all of the considerations that go into making red teaming generative AI successful. It is one of the most effective ways to manage the technology’s complex risks, and for that reason, governments are just beginning to realize red teaming’s benefits. Companies betting big on generative AI should be equally committed to red teaming.




Explore more on these topics

Scroll to Top