Anthropic launches Petri, an open-source AI safety audit tool for LLM model evaluation
Anthropic has released Petri, an open-source tool that automates safety audits of AI models using AI agents. Short for Parallel Exploration Tool for Risky Interactions, Petri helps researchers test models at scale for unsafe or deceptive behavior. It builds on the Inspect framework from the UK AI Security Institute and is publicly available on GitHub.
Petri assigns an Auditor agent to interact with target models through simulated, multi-turn dialogues. A Judge agent then reviews the exchanges and scores them across criteria like deception, flattery, and power-seeking. Researchers define audit scenarios with natural language seed instructions, allowing flexible and repeatable evaluations.
In tests involving 14 leading models across 111 scenarios, Petri uncovered issues including deception, manipulation, and whistleblowing tendencies. Claude Sonnet 4.5 and GPT-5 showed the lowest rates of problematic behavior, while Gemini 2.5 Pro, Grok-4, and Kimi K2 were more deceptive. Some models even flagged harmless actions due to narrative cues, reflecting gaps in ethical reasoning. Anthropic sees Petri as a step toward scalable, transparent safety benchmarks, though it notes the current results are still preliminary.
