Frontier Research That Has Defined the AI Security Field

Years of cutting-edge research power Gray Swan’s most advanced protection for your AI systems.

First to Discover, First to Defend

Evaluation

  • MMLU: The most-cited, industry-standard benchmark for evaluating LLM general knowledge and reasoning. [ICLR]
  • WMDP: The first benchmark for assessing hazardous knowledge in LLMs, with a focus on weapons of mass destruction. [TIME]
  • CyBench: A widely adopted framework for measuring cybersecurity capabilities and risks in language models. [ICLR]
  • HarmBench: The leading benchmark for systematically evaluating harmful model outputs across sensitive domains. [ICML]
  • AgentHarm: Among the first benchmarks to measure agentic risks and emergent harmful behaviors in autonomous systems. [ICLR]

Reliability and Control

  • GCG: The first fully automated method for jailbreaking large language models, setting the standard for robustness testing. [NYT]
  • Circuit Breakers: The first adversarially robust alignment technique, designed to halt unsafe outputs before they occur. [Forbes]
  • RepE: A pioneering top-down approach to monitor and steer LLM cognitive processes through representation engineering. [Fox]
  • Agent Red Teaming: The largest-scale competition to date for stress-testing prompt injection and adversarial agent risks. [NeurIPS]
  • Safety Pretraining: A novel set of interventions during data curation and pretraining to instill safer model behavior from the start. [NeurIPS]
Gray Swan Research Areas

Alignment & Control

Rigorous protection to make sure that AI does not veer off course.

Monitoring & Evaluation

Finding out what can go wrong, before it can cause problems.

Robustness & Security

Enhancing reliability against external threats.

All Research

Explore our published research to learn how the latest advances in AI safety and security give Gray Swan the edge against evolving threats.

Categories

Want to learn more about how Gray Swan can help you with custom research?

D-REX: A Benchmark For Detecting Deceptive Reasoning In Large Language Models

monitoring

Sep 2025

Satyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, Spyros Matsoukas

The safety and alignment of Large Language Models (LLMs) are critical for their respon- sible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning.

Adversarial Attacks on Robotic Vision Language Action Models

robustness

Jun 2025

Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J. Pappas, Hamed Hassani, Matt Fredrikson, J. Zico Kolter

The emergence of vision-language-action models (VLAs) for end-to-end control is reshaping the field of robotics by enabling the fusion of multimodal sensory inputs at the billion-parameter scale. The capabilities of VLAs stem primarily from their architectures, which are often based on frontier large language models (LLMs). However, LLMs are known to be susceptible to adversarial misuse, and given the significant physical risks inherent to robotics, questions remain regarding the extent to which VLAs inherit these vulnerabilities.

Improving Alignment and Robustness with Circuit Breakers

robustness

Jun 2024

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks

To address the urgent concerns raised by our attack from last July and the numerous jailbreaks that came after, we introduce Circuit Breaking, a novel approach inspired by representation engineering, designed to robustly prevent AI systems from generating harmful content by directly altering harmful model representations. The family of circuit-breaking methods provide an alternative to refusal and adversarial training, protecting both LLMs and multimodal models from strong, unseen adversarial attacks without compromising model capability.

Representation Engineering: A Top-Down Approach to AI Transparency

alignment

Oct 2023

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks

Building on our initial findings, we ventured into the realm of AI interpretability and control with the introduction of Representation Engineering (RepE). Drawing inspiration from cognitive neuroscience, we developed techniques that enable researchers to 'read' and 'control' the 'minds' of AI models. This approach represented a monumental advancement in demystifying the inner workings of AI, making it possible to tackle issues such as truthfulness and power-seeking behaviors head-on.

Adversarial Attacks on Aligned Language Models

monitoring

Jul 2023

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson

In July 2023, we published the first-ever automated jailbreaking method on large language models (LLMs) and exposed their susceptibility to adversarial attacks. By demonstrating that specific character sequences could bypass sophisticated safeguards, we highlighted a significant vulnerability that has urgent implications for widely-used AI systems. In its wake, adversarial robustness garnered renewed attention, sparking a gold rush of research dedicated to both jailbreaking and defense.

The WMDP benchmark: Measuring and reducing malicious use with unlearning

alignment

Mar 2024

Nathaniel Li*, Alexander Pan*, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang**, Dan Hendrycks**

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

monitoring

Feb 2024

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

monitoring

Jun 2023

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li

Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark

robustness

Jun 2023

Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

OpenOOD: Benchmarking generalized out-of-distribution detection

monitoring

Dec 2022

Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, Dan Hendrycks, Yixuan Li, Ziwei Liu

Forecasting Future World Events with Neural Networks

monitoring

Jun 2022

Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

Scaling Out-of-Distribution Detection for Real-World Settings

monitoring

May 2022

Dan Hendrycks, Steven Basart, Mantas Mazeika, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

robustness

Feb 2022

Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

robustness

Dec 2021

Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, Jacob Steinhardt

Globally-Robust Neural Networks

robustness

Jul 2021

Klas Leino, Zifan Wang, Matt Fredrikson

APPS: Measuring Coding Challenge Competence With APPS

monitoring

May 2021

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt

MMLU: Measuring Massive Multitask Language Understanding

monitoring

Jan 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

Aligning AI With Shared Human Values

robustness

Aug 2020

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

monitoring

Jun 2020

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer

Pretrained Transformers Improve Out-of-Distribution Robustness

robustness

Apr 2020

Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song

Overfitting in adversarially robust deep learning

robustness

Mar 2020

Leslie Rice, Eric Wong, J. Zico Kolter

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

robustness

Feb 2020

Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan

Fast is better than free: Revisiting adversarial training

robustness

Jan 2020

Eric Wong, Leslie Rice, J. Zico Kolter

Natural Adversarial Examples

monitoring

Jul 2019

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, Dawn Song

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

robustness

Jun 2019

Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, Dawn Song

Randomized Smoothing: Certified adversarial robustness via randomized smoothing

robustness

Jun 2019

Jeremy M Cohen, Elan Rosenfeld, J. Zico Kolter

Using Pre-Training Can Improve Model Robustness and Uncertainty

robustness

May 2019

Dan Hendrycks, Kimin Lee, Mantas Mazeika

ImageNet-C: Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

monitoring

Mar 2019

Dan Hendrycks
Thomas Dietterich

Deep Anomaly Detection with Outlier Exposure

alignment

Jan 2019

Dan Hendrycks, Mantas Mazeika, Thomas Dietterich

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

alignment

Oct 2018

Dan Hendrycks, Kevin Gimpel

Provable defenses against adversarial examples via the convex outer adversarial polytope

robustness

Jun 2018

Eric Wong, J. Zico Kolter

Work With Us

Get in touch to discuss your custom research needs.

Join our newsletter

Keep up to date on all things Gray Swan AI and AI Security.

Thank you! Your submission has been received!
Something went wrong while submitting the form.