nanoGCG | Gray Swan News

Last year members of our founding team, alongside collaborators at Google DeepMind, introduced the Greedy Coordinate Gradient (GCG) algorithm ¹. This was the first automated attack that was able to effectively target the alignment safeguards on major LLMs, and do so in a way that could even pose a threat to proprietary models like GPT-3.5 and 4, Claude, and others.

Today, we’re making this attack more accessible than ever by open-sourcing nanoGCG—a fast and straightforward implementation of the GCG algorithm available as a Python package that we hope others can easily build on and adapt into new and exciting things, or simply leverage to assess the robustness of their AI systems and safeguards.

nanoGCG simplifies the task of gradient-based prompt optimization, allowing users to adversarially test AI models against a strong baseline, with minimal setup and effort. It works out of the box on open source models within the Hugging Face Transformers framework, and requires just a few lines of Python code to run GCG on new models and prompts!

The nanogcg package can be installed via pip.

Source code and documentation are available on GitHub.

References

¹ Andy Zou, Zifan Wang, Nicolas Carlini, Milad Nasr, J Zico Kolter, Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.

nanoGCG: A fast and lightweight open source implementation of GCG

References