Skip to content

Commit 015e914

Browse files
committed
refactor(engine): add file-role hints and low-signal suppression
- introduce file-role classification (locale_data, test_fixture, build_release) - suppress low-signal invisible findings in benign test fixture contexts - guard suppression against decode markers, token regions, and build contexts - extract hasNearbyDecoderMarker helper from decoder proximity logic - update README severity docs to describe file-role input
1 parent d5a1e1f commit 015e914

3 files changed

Lines changed: 242 additions & 17 deletions

File tree

README.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ Fingerprint: /Users/johnsmith/ghostscan/testdata/invisible/single.txt:unicode/in
4747
- **Visible evidence for invisible content**: Renders hidden Unicode as strings like `<U+200B ZERO WIDTH SPACE>`.
4848
- **Focused Unicode threat coverage**: Detects invisible characters, private-use Unicode, bidi controls, directional marks, mixed-script tokens, and combining marks.
4949
- **Payload-aware heuristics**: Flags long hidden sequences, dense suspicious regions, and explicit payload-plus-decoder correlations while keeping standalone decoder noise out of default results.
50-
- **Context-aware severity**: Uses bounded content-based file shape checks, file-kind classification, local finding region checks, and decoder proximity to reduce low-value invisible-character noise without downgrading bidi controls or long suspicious runs.
50+
- **Context-aware severity**: Uses bounded content-based file shape checks, conservative file-role hints, local finding region checks, and decoder proximity to reduce low-value invisible-character noise without downgrading bidi controls, long suspicious runs, or build and release contexts.
5151
- **Noise reduction for asset contexts**: Suppresses obvious private-use glyph mappings in font-like SVG assets so icon fonts do not dominate the report.
5252
- **Safe repository traversal**: Skips symlinks, binary files, oversize files, and common dependency or build directories.
5353
- **CI-friendly behavior**: Uses deterministic ordering, human or JSON output, and exit codes `0`, `1`, and `2`.
@@ -223,22 +223,24 @@ Every finding is assigned one of four severity levels: `LOW`, `MEDIUM`, `HIGH`,
223223

224224
### How Severity Is Computed
225225

226-
Severity is derived from four inputs, all computed from file content and local context:
226+
Severity is derived from five inputs, all computed from file content and local context:
227227

228228
1. **Sequence length** — how many suspicious runes appear in the finding. Isolated characters (1) are treated differently from short runs (2–5), medium runs (6–15), long runs (16–63), and very long runs (64+). Longer sequences receive higher severity regardless of context.
229229

230230
2. **File shape** — the file is classified as `code_like`, `data_like`, `prose_like`, or `unknown` based on bounded content analysis (first 64 KiB / 400 non-empty lines). Code-like files with brackets, operators, and keywords produce higher severity for the same finding than prose-like files with natural language.
231231

232-
3. **Finding region**the immediate context around each finding is classified as whitespace-like, string-like, comment-like, token-like, prose-like, or unknown. An invisible character inside an identifier (`token_like`) is more severe than one inside a comment or whitespace region.
232+
3. **File role hints**conservative path and filename hints distinguish locale data, ordinary test source, and build or release paths. These hints are advisory only. They never suppress bidi controls, payloads, correlations, long suspicious runs, or `testdata` and fixture inputs.
233233

234-
4. **Decoder proximity** — if a decode or dynamic-execution marker (`eval(`, `Buffer.from(`, `atob(`, etc.) appears within 5 lines of a finding, severity is escalated by one level. Markers within 20 lines escalate findings that are already `HIGH`.
234+
4. **Finding region** — the immediate context around each finding is classified as whitespace-like, string-like, comment-like, token-like, prose-like, or unknown. An invisible character inside an identifier (`token_like`) is more severe than one inside a comment or whitespace region.
235+
236+
5. **Decoder proximity** — if a decode or dynamic-execution marker (`eval(`, `Buffer.from(`, `atob(`, etc.) appears within 5 lines of a finding, severity is escalated by one level. Markers within 20 lines escalate findings that are already `HIGH`.
235237

236238
### Per-Rule Behavior
237239

238240
| Rule | Base severity logic |
239241
|------|-------------------|
240242
| `unicode/bidi` | Always `HIGH`. Bidi controls are never downgraded by context, comments, prose, or path hints. |
241-
| `unicode/invisible` | Ranges from `LOW` to `CRITICAL` depending on sequence length, file shape, and region. A file-start BOM is suppressed. A single non-leading `U+FEFF` is still reported but defaults to `LOW`; isolated characters in identifiers are `HIGH`; long runs are `CRITICAL`. |
243+
| `unicode/invisible` | Ranges from `LOW` to `CRITICAL` depending on sequence length, file shape, file role, and region. A file-start BOM is suppressed. A single non-leading `U+FEFF` is still reported but defaults to `LOW`; isolated characters in identifiers are `HIGH`; long runs are `CRITICAL`. |
242244
| `unicode/private-use` | `CRITICAL` for long runs, `HIGH` for short/medium runs and code-like token regions, `MEDIUM` in prose or data contexts. |
243245
| `unicode/payload` | `HIGH` for normal sequences, `CRITICAL` for long runs. |
244246
| `unicode/correlation` | Always `CRITICAL`. A payload near a decoder is the strongest signal. |
@@ -254,6 +256,8 @@ ghostscan treats isolated and very short invisible-character findings differentl
254256

255257
- isolated invisible characters default to `LOW` unless they appear inside a token-like region or are elevated by nearby decode/execute markers
256258
- short runs in prose-like, comment-like, whitespace-like, and data-like contexts default to `LOW`
259+
- low-signal invisible findings may be suppressed in ordinary test source only when they appear in benign string, comment, whitespace, or prose contexts with no nearby decode, execution, shell, or build markers
260+
- build, release, packaging, CI, shell, and parser-sensitive fixture inputs are not softened by test-like path hints alone
257261
- short runs in code-like strings or unknown regions stay visible and usually land at `MEDIUM`
258262
- token-like invisible findings remain `HIGH`
259263
- long invisible runs and payload findings stay strong regardless of surrounding file shape

engine/classify.go

Lines changed: 147 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,11 @@ const (
3636
fileShapeProseLike = "prose_like"
3737
fileShapeUnknown = "unknown"
3838

39+
fileRoleLocaleData = "locale_data"
40+
fileRoleTestFixture = "test_fixture"
41+
fileRoleBuildRelease = "build_release"
42+
fileRoleUnknown = "unknown"
43+
3944
regionFileStart = "file_start"
4045
regionWhitespaceLike = "whitespace_like"
4146
regionStringLike = "string_like"
@@ -53,6 +58,7 @@ const (
5358

5459
type fileClassification struct {
5560
shape string
61+
role string
5662
}
5763

5864
type invisibleTraits struct {
@@ -80,6 +86,7 @@ func classifyAndFilterFindings(fileContext *Context, findings []Finding) []Findi
8086
shape := classifyFileShape(fileContext.Text)
8187
classification := fileClassification{
8288
shape: shape,
89+
role: classifyFileRole(fileContext.Path),
8390
}
8491
obsIndex := buildObservationIndex(fileContext.Observations)
8592

@@ -88,6 +95,9 @@ func classifyAndFilterFindings(fileContext *Context, findings []Finding) []Findi
8895
if isSuppressedFileStartBOM(fileContext, item) {
8996
continue
9097
}
98+
if isSuppressedLowSignalInvisible(fileContext, classification, obsIndex, item) {
99+
continue
100+
}
91101
item.Severity = classifyFindingSeverity(fileContext, classification, obsIndex, item)
92102
item.Message = classifyFindingMessage(classification, item)
93103
filtered = append(filtered, item)
@@ -132,6 +142,41 @@ func classifyFindingSeverity(fileContext *Context, classification fileClassifica
132142
return severity
133143
}
134144

145+
func isSuppressedLowSignalInvisible(fileContext *Context, classification fileClassification, obsIndex map[posKey]Observation, item Finding) bool {
146+
if item.RuleID != detector.InvisibleRuleID {
147+
return false
148+
}
149+
if classification.role != fileRoleTestFixture {
150+
return false
151+
}
152+
if hasNearbyDecoderMarker(fileContext.Prepass.DecoderMarkers, item, 20) {
153+
return false
154+
}
155+
156+
region := classifyFindingRegion(fileContext, classification.shape, obsIndex, item)
157+
if region == regionTokenLike {
158+
return false
159+
}
160+
161+
profile := classifySequenceProfile(suspiciousRuneCountForFinding(item))
162+
if profile != sequenceIsolated && profile != sequenceShortRun {
163+
return false
164+
}
165+
166+
line := lineText(fileContext, item.Line)
167+
before, after := splitLineAroundColumn(line, item.Column)
168+
if isSensitiveBuildOrExecContext(classification, line, before, after) {
169+
return false
170+
}
171+
172+
switch region {
173+
case regionCommentLike, regionStringLike, regionWhitespaceLike, regionProseLike:
174+
return true
175+
default:
176+
return classification.shape == fileShapeProseLike || classification.shape == fileShapeDataLike
177+
}
178+
}
179+
135180
func classifyFindingMessage(classification fileClassification, item Finding) string {
136181
if item.RuleID != detector.InvisibleRuleID {
137182
return item.Message
@@ -185,6 +230,9 @@ func invisibleSeverity(classification fileClassification, region, profile string
185230
if region == regionCommentLike || region == regionWhitespaceLike || region == regionProseLike {
186231
return SeverityLow
187232
}
233+
if classification.role == fileRoleLocaleData && region != regionTokenLike {
234+
return SeverityLow
235+
}
188236
if classification.shape == fileShapeProseLike || classification.shape == fileShapeDataLike {
189237
return SeverityLow
190238
}
@@ -199,6 +247,8 @@ func invisibleSeverity(classification fileClassification, region, profile string
199247
return SeverityHigh
200248
case traits.onlyFEFF:
201249
return SeverityLow
250+
case classification.role == fileRoleLocaleData:
251+
return SeverityLow
202252
case classification.shape == fileShapeProseLike || region == regionCommentLike || region == regionWhitespaceLike:
203253
return SeverityLow
204254
case region == regionStringLike && classification.shape == fileShapeDataLike:
@@ -229,24 +279,26 @@ func defaultSeverity(ruleID string) Severity {
229279
}
230280

231281
func applyDecoderProximity(severity Severity, markers []Marker, item Finding) Severity {
282+
if hasNearbyDecoderMarker(markers, item, 5) {
283+
return upgradeSeverity(severity)
284+
}
285+
if severity == SeverityHigh && hasNearbyDecoderMarker(markers, item, 20) {
286+
return upgradeSeverity(severity)
287+
}
288+
return severity
289+
}
290+
291+
func hasNearbyDecoderMarker(markers []Marker, item Finding, maxDistance int) bool {
232292
if len(markers) == 0 {
233-
return severity
293+
return false
234294
}
235-
bestDistance := 1 << 30
236295
for _, marker := range markers {
237296
distance := finding.LineDistance(item.Line, marker.Line)
238-
if distance < bestDistance {
239-
bestDistance = distance
297+
if distance <= maxDistance {
298+
return true
240299
}
241300
}
242-
switch {
243-
case bestDistance == 0 || bestDistance <= 5:
244-
return upgradeSeverity(severity)
245-
case bestDistance <= 20 && severity == SeverityHigh:
246-
return upgradeSeverity(severity)
247-
default:
248-
return severity
249-
}
301+
return false
250302
}
251303

252304
func upgradeSeverity(severity Severity) Severity {
@@ -303,6 +355,67 @@ func containsAny(text string, needles ...string) bool {
303355
return false
304356
}
305357

358+
func classifyFileRole(path string) string {
359+
normalized := strings.ToLower(strings.ReplaceAll(path, "\\", "/"))
360+
base := normalized
361+
if slash := strings.LastIndex(normalized, "/"); slash >= 0 {
362+
base = normalized[slash+1:]
363+
}
364+
365+
switch {
366+
case isBuildReleasePath(normalized, base):
367+
return fileRoleBuildRelease
368+
case isLocaleDataPath(normalized, base):
369+
return fileRoleLocaleData
370+
case isTestFixturePath(normalized, base):
371+
return fileRoleTestFixture
372+
default:
373+
return fileRoleUnknown
374+
}
375+
}
376+
377+
func isBuildReleasePath(normalized, base string) bool {
378+
switch base {
379+
case "makefile", "gnumakefile", "configure", "config.guess", "config.sub", "meson.build", "build.gradle":
380+
return true
381+
}
382+
for _, suffix := range []string{".sh", ".bash", ".zsh", ".mk", ".m4", ".am", ".ac", ".cmake", ".spec"} {
383+
if strings.HasSuffix(base, suffix) {
384+
return true
385+
}
386+
}
387+
for _, marker := range []string{"/.github/workflows/", "/.gitlab-ci", "/debian/", "/packaging/", "/scripts/release", "/ci/"} {
388+
if strings.Contains(normalized, marker) {
389+
return true
390+
}
391+
}
392+
return false
393+
}
394+
395+
func isLocaleDataPath(normalized, base string) bool {
396+
if !(strings.HasSuffix(base, ".yml") || strings.HasSuffix(base, ".yaml") || strings.HasSuffix(base, ".json") || strings.HasSuffix(base, ".po") || strings.HasSuffix(base, ".pot")) {
397+
return false
398+
}
399+
for _, marker := range []string{"/locales/", "/locale/", "/i18n/", "/translations/", "/lang/"} {
400+
if strings.Contains(normalized, marker) {
401+
return true
402+
}
403+
}
404+
return false
405+
}
406+
407+
func isTestFixturePath(normalized, base string) bool {
408+
if strings.Contains(normalized, "/test/") || strings.Contains(normalized, "/tests/") || strings.Contains(normalized, "/spec/") {
409+
return true
410+
}
411+
for _, suffix := range []string{"_test.go", "_test.exs", "_test.ex", "_spec.rb", "_test.py", "_test.js", "_spec.js", "_test.ts", "_spec.ts", ".test.js", ".spec.js", ".test.ts", ".spec.ts"} {
412+
if strings.HasSuffix(base, suffix) {
413+
return true
414+
}
415+
}
416+
return false
417+
}
418+
306419
func classifyFileShape(text string) string {
307420
metrics := collectFileShapeMetrics(text)
308421
if metrics.visibleRunes == 0 || metrics.nonEmptyLines == 0 {
@@ -509,6 +622,28 @@ func classifyFindingRegion(fileContext *Context, shape string, obsIndex map[posK
509622
return regionUnknown
510623
}
511624

625+
func isSensitiveBuildOrExecContext(classification fileClassification, line, before, after string) bool {
626+
if classification.role == fileRoleBuildRelease {
627+
return true
628+
}
629+
630+
window := strings.ToLower(before + after)
631+
lineLower := strings.ToLower(line)
632+
if containsAny(window,
633+
"eval(", "exec(", "system(", "popen(", "buffer.from(", "atob(", "btoa(",
634+
"base64", "openssl", "sed ", "awk ", "perl ", "python ", "ruby ",
635+
"bash ", "sh ", "xz ", "tar ", "gzip ", "printf", "tr ", "$( ", "$(",
636+
"`", "|", "&&", "||",
637+
) {
638+
return true
639+
}
640+
641+
return containsAny(lineLower,
642+
"aclocal", "automake", "autoconf", "libtool", "pkg-config", "cmake",
643+
"meson", "ninja", "make ", "makefile", "install-sh", "debhelper",
644+
)
645+
}
646+
512647
func lineText(fileContext *Context, line int) string {
513648
if fileContext == nil || line < 1 || line > len(fileContext.LineStarts) {
514649
return ""

engine/classify_test.go

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,7 @@ func TestSeverityPolicy(t *testing.T) {
186186
{name: "short invisible run in prose low", path: "docs/notes.txt", content: proseWith("a \u200B\u200B hidden"), ruleID: detector.InvisibleRuleID, line: 1, column: 8, want: finding.SeverityLow, message: "Short invisible Unicode sequence detected"},
187187
{name: "short invisible run in token high", path: "src/app.go", content: "const pa\u200B\u200Bss = 1;\n", ruleID: detector.InvisibleRuleID, line: 1, column: 9, want: finding.SeverityHigh, message: "Short invisible Unicode sequence detected"},
188188
{name: "short invisible run unknown medium", path: "misc/blob", content: "alpha \u200B\u200B omega\n", ruleID: detector.InvisibleRuleID, line: 1, column: 7, want: finding.SeverityMedium, message: "Short invisible Unicode sequence detected"},
189+
{name: "short invisible run in locale data low", path: "config/locales/fr.yml", content: strings.Repeat("title: bonjour\n", 12) + "subtitle: a\u200B\u200Bb\n", ruleID: detector.InvisibleRuleID, line: 13, column: 12, want: finding.SeverityLow, message: "Short invisible Unicode sequence detected"},
189190
{name: "bidi remains high in comments", path: "docs/comment", content: "// note \u202E hidden\n", ruleID: detector.BidiRuleID, line: 1, column: 9, want: finding.SeverityHigh},
190191
{name: "long invisible run critical", path: "src/blob.go", content: "const x = \"" + strings.Repeat("\u200B", 16) + "\";\n", ruleID: detector.InvisibleRuleID, line: 1, column: 12, want: finding.SeverityCritical, message: "Long invisible Unicode run suggests encoded payload"},
191192
{name: "repeated feff run remains strong", path: "src/blob.go", content: "const x = \"" + strings.Repeat("\uFEFF", 6) + "\";\n", ruleID: detector.InvisibleRuleID, line: 1, column: 12, want: finding.SeverityHigh, message: "Repeated U+FEFF invisible sequence detected"},
@@ -256,6 +257,91 @@ func TestContentAndRegionSeverityShapingOnlySoftensLowSignalInvisibleFindings(t
256257
}
257258
}
258259

260+
func TestLowSignalInvisibleSuppressionNeedsMultipleBenignSignals(t *testing.T) {
261+
t.Parallel()
262+
263+
tests := []struct {
264+
name string
265+
path string
266+
content string
267+
ruleID string
268+
line int
269+
column int
270+
wantGone bool
271+
want finding.Severity
272+
}{
273+
{
274+
name: "test fixture string is suppressed",
275+
path: "lib/example_test.exs",
276+
content: "assert value == \"a\uFEFFb\"\n",
277+
ruleID: detector.InvisibleRuleID,
278+
line: 1,
279+
column: 19,
280+
wantGone: true,
281+
},
282+
{
283+
name: "test fixture token remains high",
284+
path: "src/example_test.go",
285+
content: "const pa\u200Bss = 1\n",
286+
ruleID: detector.InvisibleRuleID,
287+
line: 1,
288+
column: 9,
289+
want: finding.SeverityHigh,
290+
},
291+
{
292+
name: "test fixture long run remains critical",
293+
path: "tests/payload_test.sh",
294+
content: "blob=\"" + strings.Repeat("\u200B", 16) + "\"\n",
295+
ruleID: detector.InvisibleRuleID,
296+
line: 1,
297+
column: 7,
298+
want: finding.SeverityCritical,
299+
},
300+
{
301+
name: "build release file does not soften",
302+
path: "scripts/release.sh",
303+
content: "printf 'a\u200B\u200Bb'\n",
304+
ruleID: detector.InvisibleRuleID,
305+
line: 1,
306+
column: 10,
307+
want: finding.SeverityMedium,
308+
},
309+
{
310+
name: "test fixture near eval does not suppress",
311+
path: "tests/fixture_test.js",
312+
content: "const s = \"a\u200B\u200Bb\"; eval(s)\n",
313+
ruleID: detector.InvisibleRuleID,
314+
line: 1,
315+
column: 13,
316+
want: finding.SeverityMedium,
317+
},
318+
}
319+
320+
for _, tt := range tests {
321+
t.Run(tt.name, func(t *testing.T) {
322+
t.Parallel()
323+
path := writeNestedTempFile(t, tt.path, tt.content)
324+
got, err := NewEngine().ScanFile(context.Background(), path)
325+
if err != nil {
326+
t.Fatalf("ScanFile() error = %v", err)
327+
}
328+
item, ok := findFindingAt(got, tt.ruleID, tt.line, tt.column)
329+
if tt.wantGone {
330+
if ok {
331+
t.Fatalf("finding = %#v, want suppressed", item)
332+
}
333+
return
334+
}
335+
if !ok {
336+
t.Fatalf("finding %s at %d:%d not found in %#v", tt.ruleID, tt.line, tt.column, got)
337+
}
338+
if item.Severity != tt.want {
339+
t.Fatalf("Severity = %q, want %q", item.Severity, tt.want)
340+
}
341+
})
342+
}
343+
}
344+
259345
func TestEndToEndClassificationRegressionShapes(t *testing.T) {
260346
t.Parallel()
261347

0 commit comments

Comments
 (0)