A collection of open source tools for online safety
Inspired by prior work like Awesome Redteaming and Awesome Phishing. This list is not an endorsement, but rather an attempt to organize and map the available technology. ❤️
Help contribute by opening a pull request to add more resources and tools!
- Altitude by Jigsaw
- web UI and hash matching for violent extremism and terrorism content
- Hasher Matcher Action (HMA) by Meta
- hashing algorithm, matching function, and ability to hook into actions
- Hasher-Matcher-Actioner (CLIP demo)
- HMA extension for CLIP as reference for adding other format extensions
- hma-matrix by the Matrix.org Foundation
- Matrix-specific extensions to HMA for (primarily) the Matrix ecosystem
- Lattice Extract by Adobe
- grid and lattice detection to guard against FP in hash matching
- MediaModeration (Wiki Extension)
- CSAM hash matching for Wikimedia
- PDQ by Meta
- perceptual hash algorithm for images
- Perception by Thorn
- provides a common wrapper around existing, popular perceptual hashes (such as those implemented by ImageHash)
- RocketChat CSAM
- CSAM hash matching for RocketChat
- TMK by Meta
- visual similarity match for videos
- VPDQ by Meta
- visual similarity match for videos using PDQ algorithm
- CoPE by Zentropi
- small language model trained for accurate, fast, steerable content classification based on developer-defined content policies
- Detoxify by Unitary AI
- detects and mitigates generalized toxic language (including hate speech, harassment, bullying) in text
- gpt-oss-safeguard by OpenAI
- open-weight reasoning model to classify text content based on provided safety policies
- NSFW Keras Model
- convoluted neural network (CNN) based explicit image ML model
- NSFW Filtering
- browser extension to block explicit images from online platforms; user facing
- OSmod by Jigsaw
- toolkit of machine learning (ML) tools, models, and APIs that platforms can use to moderate content
- Perspective API by Jigsaw
- machine learning-powered tool that helps platforms detect and assess the toxicity of online conversations
- Private Detector by Bumble
- pretrained model for detecting lewd images
- Roblox Voice Safety Classifier
- machine learning model that detects and moderates harmful content in real-time voice chat on Roblox; focuses on spoken language detection
- Sentinel by Roblox
- Python library designed specifically for realtime detection of extremely rare classes of text by using contrastive learning principles
- Toxic Prompt RoBERTa by Intel
- BERT-based model for detecting toxic content in prompts to language models
- Guardrails AI
- Python framework that helps build safe AI applications checking input/output for predefined risks
- Kanana Safeguard By Kakao
- harmful content detection model based on Kanana 8B
- Llama Guard by Meta
- AI-powered content moderation model to detect harm in text-based interactions
- Llama Prompt Guard 2 by Meta
- Detects prompt injection and jailbreaking attacks in LLM inputs
- OpenGuardrails
- Security Gateway providing a transparent reverse proxy for OpenAI apis with integrated safety protection
- Purple Llama by Meta
- set of tools to assess and improve LLM security. Includes Llama Guard, CyberSec Eval, and Code Shield
- RoGuard
- LLM that helps safeguard unlimited text generation on Roblox
- ShieldGemma by Google DeepMind
- AI safety toolkit by Google DeepMind designed to help detect and mitigate harmful or unsafe outputs in LLM applications
- Fawkes Facial De-Recognition Cloaking
- Code and binaries to confuse AIs when trying to match identity to photos, such as Clearview
- Many other great tools at github.com/Shawn-Shan, MIT researcher
- Presidio by Microsoft
- toolset for detecting Personal Identifiable Information (PII) and other sensitive data in images and text
- AbuseIO
- abuse management platform designed to help organizations handle and track abuse complaints related to online content, infrastructure, or services
- Access by Discord
- centralized portal for managing access to internal systems within any organization
- Mjolnir by Matrix
- moderation bot for the Matrix protocol that automatically enforces content policies
- Open Truss by GitHub
- framework designed to help users create internal tools without needing to write code
- Aymara
- Automated eval tools for AI safety, accuracy, and jailbreak vulnerability
- Counterfit by Microsoft
- Automation tool for assessing AI model security and robustness
- Garak by NVIDIA
- Framework for adversarial testing and model evaluation
- LLM Canary
- AI benchmarking tool that evaluates models for security vulnerabilities and adversarial robustness
- Prompt Fuzzer
- Tool for testing prompt injection vulnerabilities in AI systems
- Promptfoo
- Automated LLM evaluations, report generations, several ready-to-use attack strategies
- PyRIT
- Microsoft’s Python-based tool for AI red teaming and security testing
- Socketteer
- Allows AI models to interact, helping test conversational weaknesses
- bogofilter
- spam filter that classifies text using Bayesian statistical analysis; able to learn from classifications and corrections
- scikit-learn
- python library including clustering through various algorithms, such as K-Means, DBSCAN, and hierarchical clustering
- SpamAssassin by Apache
- anti-spam platform that uses a variety of techniques, including text analysis, Bayesian filtering, and DNS blocklists, to classify and block unsolicited email
- Druid by Apache
- high performance real-time analytics database
- Marble
- real-time fraud detection and compliance engine tailored for fintech companies and financial institutions
- Osprey by ROOST
- high-performance rules engine for real-time event processing at scale, designed for Trust & Safety and anti-abuse work
- RulesEngine by Microsoft
- library for abstracting business logic, rules, and policies from a system via JSON for .NET language families
- Wikimedia Smite Spam
- extension for MediaWiki that helps identify and manage spam content on a wiki
- BullMQ
- message queue and batch processing for NodeJS and Python based on Redis
- NCMEC Reporting by ello
- Ruby client library for reporting incidents to the National Center for Missing & Exploited Children (NCMEC) CyberTipline
- Owlculus
- OSINT (Open-Source Intelligence) toolkit and case management platform
- RabbitMQ
- message broker that enables applications to communicate with each other by sending messages through queues
- CIB MangoTree
- collection of tools to aid researchers in coordinated inauthentic behavior (CIB) analysis
- Crossover
- open-source project that builds dashboards for monitoring and analyzing the recommendation algorithms of social networks, with a focus on disinformation and election monitoring
- DAU Dashboard by Tattle
- Deepfake Analysis Unit(DAU) is a collaborative space for analyzing deepfakes
- Feluda by Tattle
- configurable engine for analysing multi-lingual and multi-modal content
- Interference by Digital Forensics Research Lab
- interactive, open-source database that tracks allegations of foreign interference or foreign malign influence relevant to the 2024 U.S. presidential election
- OpenMeasures
- open source platform for investigating internet trends
- ThreatExchange by Meta
- platform that enables organizations to share information about threats, such as malware, phishing attacks, and online safety harms in a structured and privacy-compliant manner
- ThreatExchange Client via PHP
- PHP client for ThreatExchange
- ThreatExchange via Python
- Python library for ThreatExchange
- TikTok Observatory
- open-source project maintained by AI Forensics that allows researchers to monitor the promotion and demotion of content by the TikTok reccomendation algorithm
- Aegis Content Safety by NVIDIA
- dataset created by NVIDIA to aid in content moderation and toxicity detection
- Toxic Chat by LMSYS
- dataset of toxic conversations collected from interactions with Vicuna
- Toxicity by Jigsaw
- large number of Wikipedia comments which have been labeled by human raters for toxic behavior
- Uli Dataset by Tattle
- dataset of gendered abuse, created for Uli ML redaction.
- VTC by Unitary AI
- implementation of video-text retrieval with comments including a dataset, method of identifying relevant auxiliary information that adds context to videos, and quantification of the value comment-modality bring to video
- AI Alignment Dataset by Anthropic
- data used for reinforcement learning with human feedback (RLHF) to align AI models.
- DEFCOM Red Teaming Dataset
- dataset from DEF CON’s AI red teaming event.
- HackAPrompt Jailbreak Dataset
- dataset for testing AI vulnerability to prompt-based jailbreaking
- HiroKachi Jailbreak Dataset
- dataset focused on adversarial AI prompt attacks
- Jailbreak Prompt Generator AI Model
- AI model that generates jailbreak-style prompts
- JailbreakHub by WalledAI
- collection of jailbreak prompts and corresponding model responses
- Red Team Resistance Leaderboard
- rankings of AI models based on resistance to adversarial attacks
- Rentry Jailbreak Datasets
- collection of datasets related to jailbreak attempts on AI models
- SidFeel Jailbreak Dataset
- collection of prompts used for jailbreaking AI models
- Automod by Bluesky
- tool for automating content moderation processes for the Bluesky social network and other apps on the AT Protocol
- FediCheck
- domain moderation tool to assist ActivityPub service providers, such as Mastodon servers, now open-sourced.
- Fediverse Spam Filtering
- spam filter for Fediverse social media platforms. For now, the current version is only a proof of concept.
- FIRES
- reference server + protocol for the exchange of moderation adivsories and recommendations
- Ozone by Bluesky
- labeling tool designed for Bluesky. Includes moderation features to action on abuse flags, policy enforcement tools, and investigation features
- Frankly by Applied Social Media Lab
- online deliberations platform that allows anyone to host video-enabled conversations about any topic
- PolicyKit by UW Social Futures Lab
- toolkit for building governance in your online community
- SquadBox by UW Social Futures Lab
- tool to help people who are being harassed online by having their friends (or “squad”) moderate their messages
- Uli by Tattle
- Software and Resources for Mitigating Online Gender Based Violence in India