Skip to content

feat(pipeline): implement scan stage - lexical pattern matching + 21 tests#25

Open
Pranjal0410 wants to merge 2 commits intoc2siorg:mainfrom
Pranjal0410:feat/scan-stage
Open

feat(pipeline): implement scan stage - lexical pattern matching + 21 tests#25
Pranjal0410 wants to merge 2 commits intoc2siorg:mainfrom
Pranjal0410:feat/scan-stage

Conversation

@Pranjal0410
Copy link
Copy Markdown
Contributor

@Pranjal0410 Pranjal0410 commented Mar 27, 2026

feat(pipeline): implement scan stage - lexical pattern matching + 21 tests

Implements the scan stage (Stage 3) of the sidecar pipeline. The stage was defined in the architecture but had no implementation - just a package declaration and comments.

What this adds

internal/pipeline/scan.go - Lexical pattern matching for the scan stage.

  • Configurable pattern library with 4 signal categories: instruction_override, role_hijack, data_exfiltration, shell_metacharacter
  • Hard block short-circuit for high-severity signals (data exfiltration, shell metacharacters)
  • Tool allowlist: on_tool_call payloads from allowlisted tools skip scanning
  • Structured payload support: handles both string payloads (on_prompt, on_memory) and map payloads (on_tool_call, on_context)
  • Preserves pre-computed semantic signals from the Python SDK - lexical and semantic signals coexist without duplication
  • Case-insensitive matching throughout

internal/pipeline/scan_test.go - 21 tests covering:

  • All 4 attack categories with expected signal names
  • Hard block vs non-hard-block behaviour
  • Benign input, empty payload, nil payload — no false signals
  • Case insensitivity
  • Structured tool call payloads (map[string]any)
  • Allowlisted tools skip scanning (case-insensitive)
  • Allowlist only applies to on_tool_call, not other hooks
  • Multiple simultaneous signals from a single input
  • Pre-existing semantic signals from Python SDK preserved
  • No duplicate signals when lexical and semantic overlap
  • Signals written back to RiskContext
  • RAG context injection detection (on_context + rag provenance)
  • Memory write poisoning detection (on_memory)
  • Custom pattern configuration
  • Empty config produces no signals
  • Hard negative: developer-speak ("override a method in Python") does not hard block

Architecture alignment

This implements the scan stage described in the v0.2 architecture doc:

Scan: lexical · semantic fallback

The lexical side is implemented here. The semantic fallback runs in the Python SDK (PR #15) and sends pre-computed signals over UDS - this stage preserves those signals and merges them with lexical hits.

All 35 sidecar tests pass

go test ./...
ok github.com/acf-sdk/sidecar/internal/crypto (14 tests)
ok github.com/acf-sdk/sidecar/internal/pipeline (21 tests)
ok github.com/acf-sdk/sidecar/internal/transport (cached)

No existing code modified. No files deleted. No dependencies added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant