To address potential safety and alignment concerns coming from LLM agents, we introduce AgentHarm, a new benchmark for measuring harmfulness of LLM agents. Our evaluation results show that the LLM agents built around the current frontier models such as GPT-4o and Claude Sonnet 3.5 show limited robustness to basic jailbreak attacks.
The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents---which use external tools and can execute multi-stage tasks---may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm, in collaboration with the UK AI Safety Institute.
The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. Here is how a standard task looks like:
In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. Here is how a standard execution trace of a Sonnet 3.5 LLM agent looks like, with a scoring breakdown shown at the end:
We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. Here are the main evaluation results:
To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at https://huggingface.co/datasets/ai-safety-institute/AgentHarm. The paper is available at https://arxiv.org/abs/2410.09024.