refactor(engine): add file-role hints and low-signal suppression

jcouture · jcouture · commit 015e914f2ac7 · 2026-04-29T21:37:55.000-04:00
- introduce file-role classification (locale_data, test_fixture, build_release)
  - suppress low-signal invisible findings in benign test fixture contexts
  - guard suppression against decode markers, token regions, and build contexts
  - extract hasNearbyDecoderMarker helper from decoder proximity logic
  - update README severity docs to describe file-role input
diff --git a/README.md b/README.md
@@ -47,7 +47,7 @@ Fingerprint: /Users/johnsmith/ghostscan/testdata/invisible/single.txt:unicode/in
 - **Visible evidence for invisible content**: Renders hidden Unicode as strings like `<U+200B ZERO WIDTH SPACE>`.
 - **Focused Unicode threat coverage**: Detects invisible characters, private-use Unicode, bidi controls, directional marks, mixed-script tokens, and combining marks.
 - **Payload-aware heuristics**: Flags long hidden sequences, dense suspicious regions, and explicit payload-plus-decoder correlations while keeping standalone decoder noise out of default results.
-- **Context-aware severity**: Uses bounded content-based file shape checks, file-kind classification, local finding region checks, and decoder proximity to reduce low-value invisible-character noise without downgrading bidi controls or long suspicious runs.
+- **Context-aware severity**: Uses bounded content-based file shape checks, conservative file-role hints, local finding region checks, and decoder proximity to reduce low-value invisible-character noise without downgrading bidi controls, long suspicious runs, or build and release contexts.
 - **Noise reduction for asset contexts**: Suppresses obvious private-use glyph mappings in font-like SVG assets so icon fonts do not dominate the report.
 - **Safe repository traversal**: Skips symlinks, binary files, oversize files, and common dependency or build directories.
 - **CI-friendly behavior**: Uses deterministic ordering, human or JSON output, and exit codes `0`, `1`, and `2`.
@@ -223,22 +223,24 @@ Every finding is assigned one of four severity levels: `LOW`, `MEDIUM`, `HIGH`,
 
 ### How Severity Is Computed
 
-Severity is derived from four inputs, all computed from file content and local context:
+Severity is derived from five inputs, all computed from file content and local context:
 
 1. **Sequence length** — how many suspicious runes appear in the finding. Isolated characters (1) are treated differently from short runs (2–5), medium runs (6–15), long runs (16–63), and very long runs (64+). Longer sequences receive higher severity regardless of context.
 
 2. **File shape** — the file is classified as `code_like`, `data_like`, `prose_like`, or `unknown` based on bounded content analysis (first 64 KiB / 400 non-empty lines). Code-like files with brackets, operators, and keywords produce higher severity for the same finding than prose-like files with natural language.
 
-3. **Finding region** — the immediate context around each finding is classified as whitespace-like, string-like, comment-like, token-like, prose-like, or unknown. An invisible character inside an identifier (`token_like`) is more severe than one inside a comment or whitespace region.
+3. **File role hints** — conservative path and filename hints distinguish locale data, ordinary test source, and build or release paths. These hints are advisory only. They never suppress bidi controls, payloads, correlations, long suspicious runs, or `testdata` and fixture inputs.
 
-4. **Decoder proximity** — if a decode or dynamic-execution marker (`eval(`, `Buffer.from(`, `atob(`, etc.) appears within 5 lines of a finding, severity is escalated by one level. Markers within 20 lines escalate findings that are already `HIGH`.
+4. **Finding region** — the immediate context around each finding is classified as whitespace-like, string-like, comment-like, token-like, prose-like, or unknown. An invisible character inside an identifier (`token_like`) is more severe than one inside a comment or whitespace region.
+
+5. **Decoder proximity** — if a decode or dynamic-execution marker (`eval(`, `Buffer.from(`, `atob(`, etc.) appears within 5 lines of a finding, severity is escalated by one level. Markers within 20 lines escalate findings that are already `HIGH`.
 
 ### Per-Rule Behavior
 
 | Rule | Base severity logic |
 |------|-------------------|
 | `unicode/bidi` | Always `HIGH`. Bidi controls are never downgraded by context, comments, prose, or path hints. |
-| `unicode/invisible` | Ranges from `LOW` to `CRITICAL` depending on sequence length, file shape, and region. A file-start BOM is suppressed. A single non-leading `U+FEFF` is still reported but defaults to `LOW`; isolated characters in identifiers are `HIGH`; long runs are `CRITICAL`. |
+| `unicode/invisible` | Ranges from `LOW` to `CRITICAL` depending on sequence length, file shape, file role, and region. A file-start BOM is suppressed. A single non-leading `U+FEFF` is still reported but defaults to `LOW`; isolated characters in identifiers are `HIGH`; long runs are `CRITICAL`. |
 | `unicode/private-use` | `CRITICAL` for long runs, `HIGH` for short/medium runs and code-like token regions, `MEDIUM` in prose or data contexts. |
 | `unicode/payload` | `HIGH` for normal sequences, `CRITICAL` for long runs. |
 | `unicode/correlation` | Always `CRITICAL`. A payload near a decoder is the strongest signal. |
@@ -254,6 +256,8 @@ ghostscan treats isolated and very short invisible-character findings differentl
 
 - isolated invisible characters default to `LOW` unless they appear inside a token-like region or are elevated by nearby decode/execute markers
 - short runs in prose-like, comment-like, whitespace-like, and data-like contexts default to `LOW`
+- low-signal invisible findings may be suppressed in ordinary test source only when they appear in benign string, comment, whitespace, or prose contexts with no nearby decode, execution, shell, or build markers
+- build, release, packaging, CI, shell, and parser-sensitive fixture inputs are not softened by test-like path hints alone
 - short runs in code-like strings or unknown regions stay visible and usually land at `MEDIUM`
 - token-like invisible findings remain `HIGH`
 - long invisible runs and payload findings stay strong regardless of surrounding file shape
diff --git a/engine/classify.go b/engine/classify.go
@@ -36,6 +36,11 @@ const (
 	fileShapeProseLike = "prose_like"
 	fileShapeUnknown   = "unknown"
 
+	fileRoleLocaleData   = "locale_data"
+	fileRoleTestFixture  = "test_fixture"
+	fileRoleBuildRelease = "build_release"
+	fileRoleUnknown      = "unknown"
+
 	regionFileStart      = "file_start"
 	regionWhitespaceLike = "whitespace_like"
 	regionStringLike     = "string_like"
@@ -53,6 +58,7 @@ const (
 
 type fileClassification struct {
 	shape string
+	role  string
 }
 
 type invisibleTraits struct {
@@ -80,6 +86,7 @@ func classifyAndFilterFindings(fileContext *Context, findings []Finding) []Findi
 	shape := classifyFileShape(fileContext.Text)
 	classification := fileClassification{
 		shape: shape,
+		role:  classifyFileRole(fileContext.Path),
 	}
 	obsIndex := buildObservationIndex(fileContext.Observations)
 
@@ -88,6 +95,9 @@ func classifyAndFilterFindings(fileContext *Context, findings []Finding) []Findi
 		if isSuppressedFileStartBOM(fileContext, item) {
 			continue
 		}
+		if isSuppressedLowSignalInvisible(fileContext, classification, obsIndex, item) {
+			continue
+		}
 		item.Severity = classifyFindingSeverity(fileContext, classification, obsIndex, item)
 		item.Message = classifyFindingMessage(classification, item)
 		filtered = append(filtered, item)
@@ -132,6 +142,41 @@ func classifyFindingSeverity(fileContext *Context, classification fileClassifica
 	return severity
 }
 
+func isSuppressedLowSignalInvisible(fileContext *Context, classification fileClassification, obsIndex map[posKey]Observation, item Finding) bool {
+	if item.RuleID != detector.InvisibleRuleID {
+		return false
+	}
+	if classification.role != fileRoleTestFixture {
+		return false
+	}
+	if hasNearbyDecoderMarker(fileContext.Prepass.DecoderMarkers, item, 20) {
+		return false
+	}
+
+	region := classifyFindingRegion(fileContext, classification.shape, obsIndex, item)
+	if region == regionTokenLike {
+		return false
+	}
+
+	profile := classifySequenceProfile(suspiciousRuneCountForFinding(item))
+	if profile != sequenceIsolated && profile != sequenceShortRun {
+		return false
+	}
+
+	line := lineText(fileContext, item.Line)
+	before, after := splitLineAroundColumn(line, item.Column)
+	if isSensitiveBuildOrExecContext(classification, line, before, after) {
+		return false
+	}
+
+	switch region {
+	case regionCommentLike, regionStringLike, regionWhitespaceLike, regionProseLike:
+		return true
+	default:
+		return classification.shape == fileShapeProseLike || classification.shape == fileShapeDataLike
+	}
+}
+
 func classifyFindingMessage(classification fileClassification, item Finding) string {
 	if item.RuleID != detector.InvisibleRuleID {
 		return item.Message
@@ -185,6 +230,9 @@ func invisibleSeverity(classification fileClassification, region, profile string
 		if region == regionCommentLike || region == regionWhitespaceLike || region == regionProseLike {
 			return SeverityLow
 		}
+		if classification.role == fileRoleLocaleData && region != regionTokenLike {
+			return SeverityLow
+		}
 		if classification.shape == fileShapeProseLike || classification.shape == fileShapeDataLike {
 			return SeverityLow
 		}
@@ -199,6 +247,8 @@ func invisibleSeverity(classification fileClassification, region, profile string
 		return SeverityHigh
 	case traits.onlyFEFF:
 		return SeverityLow
+	case classification.role == fileRoleLocaleData:
+		return SeverityLow
 	case classification.shape == fileShapeProseLike || region == regionCommentLike || region == regionWhitespaceLike:
 		return SeverityLow
 	case region == regionStringLike && classification.shape == fileShapeDataLike:
@@ -229,24 +279,26 @@ func defaultSeverity(ruleID string) Severity {
 }
 
 func applyDecoderProximity(severity Severity, markers []Marker, item Finding) Severity {
+	if hasNearbyDecoderMarker(markers, item, 5) {
+		return upgradeSeverity(severity)
+	}
+	if severity == SeverityHigh && hasNearbyDecoderMarker(markers, item, 20) {
+		return upgradeSeverity(severity)
+	}
+	return severity
+}
+
+func hasNearbyDecoderMarker(markers []Marker, item Finding, maxDistance int) bool {
 	if len(markers) == 0 {
-		return severity
+		return false
 	}
-	bestDistance := 1 << 30
 	for _, marker := range markers {
 		distance := finding.LineDistance(item.Line, marker.Line)
-		if distance < bestDistance {
-			bestDistance = distance
+		if distance <= maxDistance {
+			return true
 		}
 	}
-	switch {
-	case bestDistance == 0 || bestDistance <= 5:
-		return upgradeSeverity(severity)
-	case bestDistance <= 20 && severity == SeverityHigh:
-		return upgradeSeverity(severity)
-	default:
-		return severity
-	}
+	return false
 }
 
 func upgradeSeverity(severity Severity) Severity {
@@ -303,6 +355,67 @@ func containsAny(text string, needles ...string) bool {
 	return false
 }
 
+func classifyFileRole(path string) string {
+	normalized := strings.ToLower(strings.ReplaceAll(path, "\\", "/"))
+	base := normalized
+	if slash := strings.LastIndex(normalized, "/"); slash >= 0 {
+		base = normalized[slash+1:]
+	}
+
+	switch {
+	case isBuildReleasePath(normalized, base):
+		return fileRoleBuildRelease
+	case isLocaleDataPath(normalized, base):
+		return fileRoleLocaleData
+	case isTestFixturePath(normalized, base):
+		return fileRoleTestFixture
+	default:
+		return fileRoleUnknown
+	}
+}
+
+func isBuildReleasePath(normalized, base string) bool {
+	switch base {
+	case "makefile", "gnumakefile", "configure", "config.guess", "config.sub", "meson.build", "build.gradle":
+		return true
+	}
+	for _, suffix := range []string{".sh", ".bash", ".zsh", ".mk", ".m4", ".am", ".ac", ".cmake", ".spec"} {
+		if strings.HasSuffix(base, suffix) {
+			return true
+		}
+	}
+	for _, marker := range []string{"/.github/workflows/", "/.gitlab-ci", "/debian/", "/packaging/", "/scripts/release", "/ci/"} {
+		if strings.Contains(normalized, marker) {
+			return true
+		}
+	}
+	return false
+}
+
+func isLocaleDataPath(normalized, base string) bool {
+	if !(strings.HasSuffix(base, ".yml") || strings.HasSuffix(base, ".yaml") || strings.HasSuffix(base, ".json") || strings.HasSuffix(base, ".po") || strings.HasSuffix(base, ".pot")) {
+		return false
+	}
+	for _, marker := range []string{"/locales/", "/locale/", "/i18n/", "/translations/", "/lang/"} {
+		if strings.Contains(normalized, marker) {
+			return true
+		}
+	}
+	return false
+}
+
+func isTestFixturePath(normalized, base string) bool {
+	if strings.Contains(normalized, "/test/") || strings.Contains(normalized, "/tests/") || strings.Contains(normalized, "/spec/") {
+		return true
+	}
+	for _, suffix := range []string{"_test.go", "_test.exs", "_test.ex", "_spec.rb", "_test.py", "_test.js", "_spec.js", "_test.ts", "_spec.ts", ".test.js", ".spec.js", ".test.ts", ".spec.ts"} {
+		if strings.HasSuffix(base, suffix) {
+			return true
+		}
+	}
+	return false
+}
+
 func classifyFileShape(text string) string {
 	metrics := collectFileShapeMetrics(text)
 	if metrics.visibleRunes == 0 || metrics.nonEmptyLines == 0 {
@@ -509,6 +622,28 @@ func classifyFindingRegion(fileContext *Context, shape string, obsIndex map[posK
 	return regionUnknown
 }
 
+func isSensitiveBuildOrExecContext(classification fileClassification, line, before, after string) bool {
+	if classification.role == fileRoleBuildRelease {
+		return true
+	}
+
+	window := strings.ToLower(before + after)
+	lineLower := strings.ToLower(line)
+	if containsAny(window,
+		"eval(", "exec(", "system(", "popen(", "buffer.from(", "atob(", "btoa(",
+		"base64", "openssl", "sed ", "awk ", "perl ", "python ", "ruby ",
+		"bash ", "sh ", "xz ", "tar ", "gzip ", "printf", "tr ", "$( ", "$(",
+		"`", "|", "&&", "||",
+	) {
+		return true
+	}
+
+	return containsAny(lineLower,
+		"aclocal", "automake", "autoconf", "libtool", "pkg-config", "cmake",
+		"meson", "ninja", "make ", "makefile", "install-sh", "debhelper",
+	)
+}
+
 func lineText(fileContext *Context, line int) string {
 	if fileContext == nil || line < 1 || line > len(fileContext.LineStarts) {
 		return ""
diff --git a/engine/classify_test.go b/engine/classify_test.go
@@ -186,6 +186,7 @@ func TestSeverityPolicy(t *testing.T) {
 		{name: "short invisible run in prose low", path: "docs/notes.txt", content: proseWith("a \u200B\u200B hidden"), ruleID: detector.InvisibleRuleID, line: 1, column: 8, want: finding.SeverityLow, message: "Short invisible Unicode sequence detected"},
 		{name: "short invisible run in token high", path: "src/app.go", content: "const pa\u200B\u200Bss = 1;\n", ruleID: detector.InvisibleRuleID, line: 1, column: 9, want: finding.SeverityHigh, message: "Short invisible Unicode sequence detected"},
 		{name: "short invisible run unknown medium", path: "misc/blob", content: "alpha \u200B\u200B omega\n", ruleID: detector.InvisibleRuleID, line: 1, column: 7, want: finding.SeverityMedium, message: "Short invisible Unicode sequence detected"},
+		{name: "short invisible run in locale data low", path: "config/locales/fr.yml", content: strings.Repeat("title: bonjour\n", 12) + "subtitle: a\u200B\u200Bb\n", ruleID: detector.InvisibleRuleID, line: 13, column: 12, want: finding.SeverityLow, message: "Short invisible Unicode sequence detected"},
 		{name: "bidi remains high in comments", path: "docs/comment", content: "// note \u202E hidden\n", ruleID: detector.BidiRuleID, line: 1, column: 9, want: finding.SeverityHigh},
 		{name: "long invisible run critical", path: "src/blob.go", content: "const x = \"" + strings.Repeat("\u200B", 16) + "\";\n", ruleID: detector.InvisibleRuleID, line: 1, column: 12, want: finding.SeverityCritical, message: "Long invisible Unicode run suggests encoded payload"},
 		{name: "repeated feff run remains strong", path: "src/blob.go", content: "const x = \"" + strings.Repeat("\uFEFF", 6) + "\";\n", ruleID: detector.InvisibleRuleID, line: 1, column: 12, want: finding.SeverityHigh, message: "Repeated U+FEFF invisible sequence detected"},
@@ -256,6 +257,91 @@ func TestContentAndRegionSeverityShapingOnlySoftensLowSignalInvisibleFindings(t
 	}
 }
 
+func TestLowSignalInvisibleSuppressionNeedsMultipleBenignSignals(t *testing.T) {
+	t.Parallel()
+
+	tests := []struct {
+		name     string
+		path     string
+		content  string
+		ruleID   string
+		line     int
+		column   int
+		wantGone bool
+		want     finding.Severity
+	}{
+		{
+			name:     "test fixture string is suppressed",
+			path:     "lib/example_test.exs",
+			content:  "assert value == \"a\uFEFFb\"\n",
+			ruleID:   detector.InvisibleRuleID,
+			line:     1,
+			column:   19,
+			wantGone: true,
+		},
+		{
+			name:    "test fixture token remains high",
+			path:    "src/example_test.go",
+			content: "const pa\u200Bss = 1\n",
+			ruleID:  detector.InvisibleRuleID,
+			line:    1,
+			column:  9,
+			want:    finding.SeverityHigh,
+		},
+		{
+			name:    "test fixture long run remains critical",
+			path:    "tests/payload_test.sh",
+			content: "blob=\"" + strings.Repeat("\u200B", 16) + "\"\n",
+			ruleID:  detector.InvisibleRuleID,
+			line:    1,
+			column:  7,
+			want:    finding.SeverityCritical,
+		},
+		{
+			name:    "build release file does not soften",
+			path:    "scripts/release.sh",
+			content: "printf 'a\u200B\u200Bb'\n",
+			ruleID:  detector.InvisibleRuleID,
+			line:    1,
+			column:  10,
+			want:    finding.SeverityMedium,
+		},
+		{
+			name:    "test fixture near eval does not suppress",
+			path:    "tests/fixture_test.js",
+			content: "const s = \"a\u200B\u200Bb\"; eval(s)\n",
+			ruleID:  detector.InvisibleRuleID,
+			line:    1,
+			column:  13,
+			want:    finding.SeverityMedium,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			t.Parallel()
+			path := writeNestedTempFile(t, tt.path, tt.content)
+			got, err := NewEngine().ScanFile(context.Background(), path)
+			if err != nil {
+				t.Fatalf("ScanFile() error = %v", err)
+			}
+			item, ok := findFindingAt(got, tt.ruleID, tt.line, tt.column)
+			if tt.wantGone {
+				if ok {
+					t.Fatalf("finding = %#v, want suppressed", item)
+				}
+				return
+			}
+			if !ok {
+				t.Fatalf("finding %s at %d:%d not found in %#v", tt.ruleID, tt.line, tt.column, got)
+			}
+			if item.Severity != tt.want {
+				t.Fatalf("Severity = %q, want %q", item.Severity, tt.want)
+			}
+		})
+	}
+}
+
 func TestEndToEndClassificationRegressionShapes(t *testing.T) {
 	t.Parallel()