Gray Swan Research

Timeline

Artificial intelligence (AI) is progressively weaving itself into the fabric of our daily lives, the importance of ensuring these systems operate safely and transparently has never been greater. Over the past year, our research team has been at the forefront of tackling some of the most pressing challenges in AI safety and security. Through groundbreaking studies, we have not only uncovered critical vulnerabilities but have also pioneered innovative methodologies to enhance the robustness and transparency of AI models.

Jul

2023

Adversarial Attacks on Aligned Language Models

Monitoring & Evaluation

In July 2023, we published the first-ever automated jailbreaking method on large language models (LLMs) and exposed their susceptibility to adversarial attacks. By demonstrating that specific character sequences could bypass sophisticated safeguards, we highlighted a significant vulnerability that has urgent implications for widely-used AI systems. In its wake, adversarial robustness garnered renewed attention, sparking a gold rush of research dedicated to both jailbreaking and defense.

View source

Oct

2023

Representation Engineering: A Top-Down Approach to AI Transparency

Alignment & Control

Building on our initial findings, we ventured into the realm of AI interpretability and control with the introduction of Representation Engineering (RepE). Drawing inspiration from cognitive neuroscience, we developed techniques that enable researchers to 'read' and 'control' the 'minds' of AI models. This approach represented a monumental advancement in demystifying the inner workings of AI, making it possible to tackle issues such as truthfulness and power-seeking behaviors head-on.

View source

Jun

2024

Improving Alignment and Robustness with Circuit Breakers

Robustness & Security

To address the urgent concerns raised by our attack from last July and the numerous jailbreaks that came after, we introduce Circuit Breaking, a novel approach inspired by representation engineering, designed to robustly prevent AI systems from generating harmful content by directly altering harmful model representations. The family of circuit-breaking methods provide an alternative to refusal and adversarial training, protecting both LLMs and multimodal models from strong, unseen adversarial attacks without compromising model capability.

View source

Jul

2024

Our thesis...

What binds these distinct yet interconnected research endeavors is our unwavering commitment to advancing the safety and integrity of AI technologies. By systematically investigating vulnerabilities, developing novel transparency techniques, and enhancing robustness, we have laid down a comprehensive framework that addresses the multifaceted challenges of AI safety.

Join our research team

All Research

Explore our published research to learn how the latest advances in AI safety and security give Gray Swan the edge against evolving threats.

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

robustness

•

Oct 2024

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies

Pushing the Boundaries of AI Safety and Security Research

Timeline

Adversarial Attacks on Aligned Language Models

Representation Engineering: A Top-Down Approach to AI Transparency

Improving Alignment and Robustness with Circuit Breakers

Our thesis...

The road ahead

Alignment & Control

Monitoring & Evaluation

Robustness & Security

All Research

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Improving Alignment and Robustness with Circuit Breakers

Representation Engineering: A Top-Down Approach to AI Transparency

Adversarial Attacks on Aligned Language Models

The WMDP benchmark: Measuring and reducing malicious use with unlearning

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark

OpenOOD: Benchmarking generalized out-of-distribution detection

Forecasting Future World Events with Neural Networks

Scaling Out-of-Distribution Detection for Real-World Settings

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

Globally-Robust Neural Networks

APPS: Measuring Coding Challenge Competence With APPS

MMLU: Measuring Massive Multitask Language Understanding

Aligning AI With Shared Human Values

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Pretrained Transformers Improve Out-of-Distribution Robustness

Overfitting in adversarially robust deep learning

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

Fast is better than free: Revisiting adversarial training

Natural Adversarial Examples

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

Randomized Smoothing: Certified adversarial robustness via randomized smoothing

Using Pre-Training Can Improve Model Robustness and Uncertainty

ImageNet-C: Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Deep Anomaly Detection with Outlier Exposure

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Provable defenses against adversarial examples via the convex outer adversarial polytope

Join our newsletter