Years of cutting-edge research power Gray Swan’s most advanced protection for your AI systems.
Rigorous protection to make sure that AI does not veer off course.
Finding out what can go wrong, before it can cause problems.
Enhancing reliability against external threats.
Explore our published research to learn how the latest advances in AI safety and security give Gray Swan the edge against evolving threats.
The safety and alignment of Large Language Models (LLMs) are critical for their respon- sible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning.
The emergence of vision-language-action models (VLAs) for end-to-end control is reshaping the field of robotics by enabling the fusion of multimodal sensory inputs at the billion-parameter scale. The capabilities of VLAs stem primarily from their architectures, which are often based on frontier large language models (LLMs). However, LLMs are known to be susceptible to adversarial misuse, and given the significant physical risks inherent to robotics, questions remain regarding the extent to which VLAs inherit these vulnerabilities.
To address the urgent concerns raised by our attack from last July and the numerous jailbreaks that came after, we introduce Circuit Breaking, a novel approach inspired by representation engineering, designed to robustly prevent AI systems from generating harmful content by directly altering harmful model representations. The family of circuit-breaking methods provide an alternative to refusal and adversarial training, protecting both LLMs and multimodal models from strong, unseen adversarial attacks without compromising model capability.
Building on our initial findings, we ventured into the realm of AI interpretability and control with the introduction of Representation Engineering (RepE). Drawing inspiration from cognitive neuroscience, we developed techniques that enable researchers to 'read' and 'control' the 'minds' of AI models. This approach represented a monumental advancement in demystifying the inner workings of AI, making it possible to tackle issues such as truthfulness and power-seeking behaviors head-on.
In July 2023, we published the first-ever automated jailbreaking method on large language models (LLMs) and exposed their susceptibility to adversarial attacks. By demonstrating that specific character sequences could bypass sophisticated safeguards, we highlighted a significant vulnerability that has urgent implications for widely-used AI systems. In its wake, adversarial robustness garnered renewed attention, sparking a gold rush of research dedicated to both jailbreaking and defense.
Get in touch to discuss your custom research needs.
Keep up to date on all things Gray Swan AI and AI Security.