Skip to content

[Design Proposal] Adversarial Input Normalisation Layer (Layer 1) — Canonicalisation & De-obfuscation Engine #8

@Saisharathchandranandnetha

Description

[Design Proposal] Lightweight Input Normalisation Gate (Layer 1 Preprocessing)

Overview

This proposal introduces a lightweight input normalisation step within Layer 1 of the Cognitive Firewall pipeline.

The goal is to standardize incoming text into a consistent, analysable form before it reaches heuristic (regex) and semantic layers, while maintaining strict low-latency constraints.


Problem

Adversarial inputs often use simple obfuscation techniques such as:

  • Unicode variations (e.g., homoglyphs, decomposed characters)
  • Zero-width or invisible characters
  • Encoded representations (e.g., URL encoding)

These can reduce the effectiveness of downstream detection layers if processed in raw form.


Proposed Approach

Introduce a minimal, stateless normalisation step that performs safe and low-cost transformations.

Pipeline (Fast-Path)

Raw Input
   │
   ▼
[1] Unicode Normalisation (NFKC)
   ▼
[2] Strip Zero-Width & Invisible Characters
   ▼
[3] Limited Percent-Decoding (URL encoding only, where valid)
   ▼
Normalised Text → Layer 1 Heuristic Scanner

Design Principles

  • Lightweight: No heavy parsing or ML inference
  • Deterministic: Only safe, well-defined transformations
  • Low latency: Designed to add minimal overhead to the pipeline
  • Stateless: No dependency on external systems or memory

Non-Goals (for Phase 1)

To keep the fast-path efficient, the following are intentionally excluded:

  • Full homoglyph mapping across scripts
  • Base64 or complex encoding detection
  • Aggressive text reconstruction
  • Heuristic scoring within this layer

These can be handled in downstream layers if needed.


Output

@dataclass
class NormalisationResult:
    normalised_text: str
    original_text: str

This keeps the interface minimal and avoids adding latency.


Integration

  • Runs immediately after payload is received via UDS

  • Feeds into:

    • Layer 1 heuristic scanner
    • Layer 2 semantic analysis

Alignment with Project Philosophy

This supports the principle of:

“All inputs must be validated before entering reasoning context”

by ensuring inputs are first made consistent and readable before validation.


Open Questions

  1. Should percent-decoding be limited strictly to valid URL-encoded sequences?
  2. Are there specific Unicode normalization standards preferred beyond NFKC?
  3. Should this layer emit any lightweight signals, or remain purely transformational?

Next Step

Happy to open a draft PR with:

  • minimal normalisation function
  • test cases for edge inputs (Unicode + zero-width + encoding)

cc: @tharindupr @charithccmc

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions