Red teaming large language models at DEF CON 31

One of the hottest things at DEF CON was the generative AI red teaming event, with thousands of us sitting down to see whether we could convince large language models (LLMs, or chatbots) to behave in ways that they shouldn’t. These highly capable AI systems are quickly becoming prevalent - not just for goofing-off on the internet or writing a term paper – but also providing customer services, speeding human resources tasks, scheduling, and other services that don’t need a human behind the wheel.

As the use of these models increases, the impact of their misbehavior will too. Last week, AI companies partnered with the White House, industry, community colleges, and civil society to see where the models can be manipulated into behaving badly. The challenge started as an industry collaboration with the White House, and policymakers from every level of government are interested in this topic as many grapple with how to use AI in government and implement effective oversight to keep everyone safe and secure.

This was a little different than traditional red teaming, where ethical hackers are authorized by an organization to emulate real attackers' tactics, techniques and procedures to breach an organizations’ systems. Instead, we tried to use creative prompts to get these models to share information they shouldn’t, like private or confidential information such as credit card numbers, false claims and misinformation, racist or otherwise hateful language, or other harmful information such as how to build a bomb or how to stalk someone.

For instance, I tried to trick a model into telling me a hidden credit card number - I successfully manipulated it into telling me the first ten digits, but failed to obtain the rest. My neighbor asked for, and received, a poem outlining how to build a bomb – and the model volunteered uses of it. Neither of these are great: but they do give us important information about where we need to build better guardrails into these models.

Most models - especially the major ones - have many guardrails in place to prevent untoward behavior but they can be circumvented. LLM creators want to know how prompts are manipulated to prevent this behavior so that they can improve the guardrails. These mild misbehaviors are often minor, but as the prevalence of AI in daily life grows we need to better understand the issues that may arise and how to prevent those issues and make these tools and systems safer. Like any security and safety risk, we need to know what we’re working with in order to prioritize and mitigate it.

If asking a chatbots to write a poem about forbidden topics doesn’t sound like elite hacking, that’s because it’s not: the point of the exercise was to throw thousands of people and their diversity of experiences, expertise, and communication styles at these models. Different semantic tendencies and linguistic quirks, coming from people who aren’t AI experts and think outside the tech sector box, better reflect how these tools will be used in the real world. As a bonus, participants left the AI Village with a stronger understanding of how LLMs, and their prompts, work.

Hopefully, most of the results coming out of these exercises are relatively minor examples of a person outthinking a language model to convince it to say something silly. But it’s likely that there will be some very telling results, too - pointing towards larger issues that companies will need to investigate, understand, and mitigate.

Regardless, I hear that the operation has already turned up valuable data that will enable these LLMs to improve - and I can’t wait to see the results when they’re released. The results will give AI developers more information about how to build systems that behave the way we want them to. And even though chatbots are mostly a lot of fun to play with right now, it’s really important to make sure that we’re creating safe, secure, and helpful - not harmful - AI systems.

‍

Heather West

Red teaming large language models at DEF CON 31

Red teaming large language models at DEF CON 31

Read Next

DNS Security in Focus: A Multistakeholder Path Forward under NIS2

AI Profile for NIST CSF Would Help Risk Management Pros

Crosswalk Analysis for Artificial Intelligence Frameworks

Stay Connected with CCPL