

WildGuard
WildGuard is an open, lightweight moderation tool for LLM safety that achieves three goals:
Cost / License
- Free
- Open Source
Application type
Platforms
- Self-Hosted
- Python
Features
Properties
- Lightweight
Features
- AI-Powered
Tags
- safety
- huggingface
- ai-guardrails
- Artificial intelligence
- safeguarding
- safety-management
- ai-safety
- content-moderation
WildGuard News & Activities
Recent activities
- POX added WildGuard as alternative to Toxic Prompt RoBERTa
POX added WildGuard as alternative to Llama Guard and ShieldGemma- POX added WildGuard
WildGuard information
What is WildGuard?
WildGuard is an open, lightweight moderation tool for LLM safety that achieves three goals:
- Identifying malicious intent in user prompts
- Detecting safety risks of model responses
- Determining model refusal rate
Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses.
To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.


