[Feature] Add Multilingual Support for Global Languages

## The Problem

I've been looking through the codebase and noticed that all the current guardrails are essentially English-only. They use models like DeBERTa, RoBERTa, and others that were trained primarily on English text. This means any-guardrail doesn't really work for most of the world's languages.

For example:
- The Deepset guardrail uses `deberta-v3-base-injection` - English only
- Mozilla's own off-topic detector is explicitly named `jina-embeddings-v2-small-en-off-topic` 
- ProtectAI models are all English DeBERTa variants

## Why This Matters

There are 4.5+ billion people who don't speak English as their first language. Right now, they can't use any-guardrail effectively. If someone sends harmful content in French, Hindi, Arabic, or Chinese, the guardrails might not catch it properly.

## The Good News

We don't need to reinvent anything. Multilingual models that handle 100+ languages already exist and work great. They've been battle-tested by big tech companies for years now (as of November 2024).

## Simple Solution

We just need to offer multilingual model options alongside the English ones. For example:

```python
# Instead of only having this:
model = "deberta-v3-base-injection"  # English only

# We could also support:
model = "xlm-roberta-base"  # Works for 100 languages!
```

## Two Ways to Do This

**Option 1: Add multilingual models to existing guardrails**
```python
class Deepset(HuggingFace):
    SUPPORTED_MODELS = [
        "deepset/deberta-v3-base-injection",  # Keep the English one
        "xlm-roberta-base-finetuned",  # Add a multilingual one
    ]
```

**Option 2: Create a simple wrapper**
```python
# Take any existing guardrail and make it multilingual
guardrail = make_multilingual(
    AnyGuardrail.create(GuardrailName.DEEPSET)
)
```

## Which Models Should We Use?

Here are some proven multilingual models available today:

- **xlm-roberta-base**: Handles 100 languages, widely used
- **bert-base-multilingual**: Covers 104 languages, lightweight
- **google/muril-base**: Great for Indian languages + English
- **google/mt5-base**: 101 languages, good for generation

## The Best Part

Once we add these, the guardrails would just work for any language automatically. Users wouldn't need to:
- Specify what language they're using
- Install special language packs
- Configure anything

It would just work whether you write in English, Spanish, Hindi, Arabic, or any other language.

## How I Can Help

I'm happy to help implement this. I could:
1. Test which multilingual models work best for safety detection
2. Add the multilingual options to existing guardrails
3. Create the wrapper approach if that's preferred
4. Write tests to make sure it works across languages

What do you think? Should we start with adding multilingual models to existing guardrails, or would you prefer the wrapper approach?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add Multilingual Support for Global Languages #104

The Problem

Why This Matters

The Good News

Simple Solution

Two Ways to Do This

Which Models Should We Use?

The Best Part

How I Can Help

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Add Multilingual Support for Global Languages #104

Description

The Problem

Why This Matters

The Good News

Simple Solution

Two Ways to Do This

Which Models Should We Use?

The Best Part

How I Can Help

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions