Skip to content

[Feature] Support Code-Mixed Text #110

@anivar

Description

@anivar

Problem

Real-world multilingual text often mixes languages within sentences:

# Current behavior - FAILS on mixed language text
text = "C'est vraiment amazing!"  # French-English
guardrail.validate(text)  # Incorrect results

text = "Das ist really gut"  # German-English  
guardrail.validate(text)  # Fails

This is extremely common in:

  • Social media (majority of multilingual posts)
  • Chat applications
  • Informal communication
  • Global communities

Proposed Solution

# Enhanced API
result = guardrail.validate(
    "C'est un deepfake, right?",
    handle_code_mixing=True
)

print(result.explanation)
# {
#   'languages_detected': ['fr', 'en'],
#   'code_mixed': True,
#   'primary_language': 'fr',
#   'mixing_ratio': {'fr': 0.7, 'en': 0.3}
# }

Technical Requirements

  1. Token-level language detection
  2. Multi-language embedding spaces
  3. Smooth handling of script switches
  4. Consistent detection across mixed segments

Implementation Approach

class CodeMixedProcessor:
    def process(self, text):
        # Segment by language
        segments = self.segment_by_language(text)
        
        # Process each segment with appropriate model
        results = []
        for segment in segments:
            model = self.get_model(segment.language)
            results.append(model.process(segment.text))
        
        # Aggregate results
        return self.aggregate(results)

Why This Matters

  • Real-world usage: Majority of casual multilingual communication is code-mixed
  • Current failure: Guardrails give incorrect results on mixed text
  • Growing trend: Code-mixing increasing with global communication
  • Safety critical: Malicious content often uses code-mixing to evade detection

Test Cases

test_cases = [
    ("C'est totally bizarre", ['fr', 'en']),
    ("Das ist really gut", ['de', 'en']),
    ("Это очень cool", ['ru', 'en']),
]

Note

This is separate from Unicode/UA compliance. Even with perfect Unicode support, code-mixed text needs special handling for:

  • Language model selection
  • Tokenization boundaries
  • Semantic understanding across languages

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions