[Design Proposal] Lightweight Input Normalisation Gate (Layer 1 Preprocessing)
Overview
This proposal introduces a lightweight input normalisation step within Layer 1 of the Cognitive Firewall pipeline.
The goal is to standardize incoming text into a consistent, analysable form before it reaches heuristic (regex) and semantic layers, while maintaining strict low-latency constraints.
Problem
Adversarial inputs often use simple obfuscation techniques such as:
- Unicode variations (e.g., homoglyphs, decomposed characters)
- Zero-width or invisible characters
- Encoded representations (e.g., URL encoding)
These can reduce the effectiveness of downstream detection layers if processed in raw form.
Proposed Approach
Introduce a minimal, stateless normalisation step that performs safe and low-cost transformations.
Pipeline (Fast-Path)
Raw Input
│
▼
[1] Unicode Normalisation (NFKC)
▼
[2] Strip Zero-Width & Invisible Characters
▼
[3] Limited Percent-Decoding (URL encoding only, where valid)
▼
Normalised Text → Layer 1 Heuristic Scanner
Design Principles
- Lightweight: No heavy parsing or ML inference
- Deterministic: Only safe, well-defined transformations
- Low latency: Designed to add minimal overhead to the pipeline
- Stateless: No dependency on external systems or memory
Non-Goals (for Phase 1)
To keep the fast-path efficient, the following are intentionally excluded:
- Full homoglyph mapping across scripts
- Base64 or complex encoding detection
- Aggressive text reconstruction
- Heuristic scoring within this layer
These can be handled in downstream layers if needed.
Output
@dataclass
class NormalisationResult:
normalised_text: str
original_text: str
This keeps the interface minimal and avoids adding latency.
Integration
Alignment with Project Philosophy
This supports the principle of:
“All inputs must be validated before entering reasoning context”
by ensuring inputs are first made consistent and readable before validation.
Open Questions
- Should percent-decoding be limited strictly to valid URL-encoded sequences?
- Are there specific Unicode normalization standards preferred beyond NFKC?
- Should this layer emit any lightweight signals, or remain purely transformational?
Next Step
Happy to open a draft PR with:
- minimal normalisation function
- test cases for edge inputs (Unicode + zero-width + encoding)
cc: @tharindupr @charithccmc
[Design Proposal] Lightweight Input Normalisation Gate (Layer 1 Preprocessing)
Overview
This proposal introduces a lightweight input normalisation step within Layer 1 of the Cognitive Firewall pipeline.
The goal is to standardize incoming text into a consistent, analysable form before it reaches heuristic (regex) and semantic layers, while maintaining strict low-latency constraints.
Problem
Adversarial inputs often use simple obfuscation techniques such as:
These can reduce the effectiveness of downstream detection layers if processed in raw form.
Proposed Approach
Introduce a minimal, stateless normalisation step that performs safe and low-cost transformations.
Pipeline (Fast-Path)
Design Principles
Non-Goals (for Phase 1)
To keep the fast-path efficient, the following are intentionally excluded:
These can be handled in downstream layers if needed.
Output
This keeps the interface minimal and avoids adding latency.
Integration
Runs immediately after payload is received via UDS
Feeds into:
Alignment with Project Philosophy
This supports the principle of:
by ensuring inputs are first made consistent and readable before validation.
Open Questions
Next Step
Happy to open a draft PR with:
cc: @tharindupr @charithccmc