Skip to content

Commit 7363043

Browse files
committed
feat(calibration): add semantic response diffing signals
1 parent 7782e9d commit 7363043

7 files changed

Lines changed: 393 additions & 5 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,16 @@
11
CHANGELOG
22
=======
33

4+
v5.14.3 (01.05.2026)
5+
---------------------------
6+
- (enhancement) improved `--auto-calibrate` with lightweight semantic response diffing for soft-404 detection
7+
- (enhancement) added visible-text, semantic phrase, semantic term, DOM-token and text-density calibration signals
8+
- (enhancement) improved dynamic body normalization for emails, path-like fragments and long encoded tokens
9+
- (enhancement) semantic calibration remains opt-in through the existing `--auto-calibrate` flow and does not change default scan behaviour
10+
- (tests) added regression coverage for semantic soft-404 matching and calibration helper edge cases
11+
- (tests) full unittest suite passes after integration
12+
- (tests) coverage gate passes at `99%`
13+
414
v5.14.2 (01.05.2026)
515
---------------------------
616
- (enhancement) extended `--header-bypass` with controlled path-manipulation probes after header-injection probes

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ It helps security researchers, penetration testers, bug bounty hunters, DevSecOp
5959
- custom wordlists, prefixes, and extension filters;
6060
- custom request headers, cookies, and raw HTTP request templates;
6161
- response filters by status, size, text, regex, and body length;
62-
- smart auto-calibration for soft-404, wildcard, and catch-all responses;
62+
- smart auto-calibration for soft-404, wildcard, catch-all, and semantic response-diff cases;
6363
- technology fingerprint detection CMS, ecommerce platforms, frameworks;
6464
- passive WAF detection and WAF-safe scan mode;
6565
- controlled header and path bypass probes for blocked `401` and `403` resources;
@@ -85,7 +85,7 @@ OpenDoor focuses on **context-aware discovery** instead of blind enumeration.
8585
| **Fingerprint-first scanning** | OpenDoor can identify probable CMS platforms, frameworks, infrastructure providers, and WAF signals before deeper discovery. This helps you scan with context instead of blindly throwing a generic wordlist at the target. |
8686
| **WAF-aware behavior** | OpenDoor can detect probable WAF / anti-bot behavior and switch to a safer runtime profile with `--waf-safe-mode`, reducing noisy blocked scans and making defensive responses easier to understand. |
8787
| **Controlled bypass evidence** | OpenDoor can optionally probe blocked `401` and `403` resources with controlled header-injection and path-manipulation variants. It records exact evidence such as bypass type, header or path variant, probe value, original status code, and resulting status code without mutating global scan headers. |
88-
| **Multi-signal auto-calibration** | OpenDoor does not rely only on status code or response size. It compares multiple response signals such as body hashes, HTML structure, titles, redirects, stable headers, word count, line count, and normalized dynamic tokens to reduce soft-404 and wildcard false positives. |
88+
| **Multi-signal auto-calibration** | OpenDoor does not rely only on status code or response size. It compares multiple response signals such as body hashes, visible text, semantic soft-404 phrases, DOM-token structure, titles, redirects, stable headers, word count, line count, text density, and normalized dynamic tokens to reduce soft-404 and wildcard false positives. |
8989
| **Transport-level workflows** | OpenDoor supports direct, proxy, OpenVPN, and WireGuard transport modes. It can also rotate transport profiles per target in authorized batch scans, which is not the same as manually starting a VPN before running a scanner. |
9090
| **Resumable long scans** | OpenDoor can save scan checkpoints and resume later. This matters when scans are interrupted by crashes, unstable networks, blocked routes, terminal disconnects, or long multi-target jobs. |
9191
| **CI/CD-ready results** | OpenDoor can return a failing exit code only when selected result buckets are found, making it usable as a release gate or exposure regression check without custom post-processing scripts. |

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
5.14.2
1+
5.14.3

docs/Usage.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -398,6 +398,7 @@ opendoor \
398398
## 🧠 Auto-calibration
399399

400400
Auto-calibration helps classify soft-404, wildcard, and catch-all responses.
401+
Starting with OpenDoor 5.14.3, it also uses lightweight semantic response-diff signals such as visible text, soft-404 phrases, DOM-token structure, text density, and normalized dynamic fragments.
401402

402403
```shell
403404
opendoor --host https://example.com --auto-calibrate
@@ -418,6 +419,7 @@ opendoor --host https://example.com --auto-calibrate --calibration-threshold 0.8
418419
The threshold accepts values from `0.01` to `1.0`.
419420

420421
Use auto-calibration when a target returns similar pages for invalid and valid paths.
422+
It is especially useful when dynamic 404 templates contain changing tokens, timestamps, trace IDs, A/B wrappers, or personalized fragments.
421423

422424
---
423425

docs/detection/auto-calibration.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,3 +116,22 @@ opendoor \
116116
--sniff skipempty,collation,indexof,file \
117117
--exclude-size-range 0-256
118118
```
119+
120+
## Semantic response diffing
121+
122+
OpenDoor 5.14.3 extends auto-calibration with lightweight semantic response-diff signals.
123+
124+
When `--auto-calibrate` is enabled, calibration signatures include:
125+
126+
- normalized visible text;
127+
- known soft-404 phrases;
128+
- stable semantic terms;
129+
- bounded DOM-tag tokens;
130+
- content kind (`html`, `json`, `text`, or `empty`);
131+
- visible-text density;
132+
- existing status, bucket, size, title, redirect, body hash, skeleton hash, word count, line count, and stable headers.
133+
134+
This helps detect dynamic soft-404 templates where the HTML wrapper changes but the response has the same meaning, such as “page not found”, “requested resource does not exist”, changing trace IDs, CSRF-like values, timestamps, or path echoes.
135+
136+
The feature is part of the existing `--auto-calibrate` flow. It does not run unless auto-calibration is explicitly enabled.
137+

src/lib/browser/calibration.py

Lines changed: 256 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,32 @@ class Calibration(object):
2626
r'csrf[_-]?token=["\'][^"\']+["\']',
2727
r'nonce=["\'][^"\']+["\']',
2828
r'([?&][a-z0-9_-]+)=([^&#\s"\']+)',
29+
r'\b[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}\b',
30+
r'/(?:[a-z0-9._~%+-]+/)*[a-z0-9._~%+-]+',
31+
r'\b[a-z0-9+/]{32,}={0,2}\b',
32+
)
33+
SEMANTIC_STOPWORDS = frozenset((
34+
'a', 'about', 'after', 'all', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'can',
35+
'could', 'did', 'do', 'does', 'for', 'from', 'go', 'has', 'have', 'if', 'in',
36+
'is', 'it', 'its', 'may', 'not', 'of', 'on', 'or', 'our', 'page', 'please',
37+
'request', 'requested', 'site', 'that', 'the', 'this', 'to', 'try', 'url',
38+
'was', 'we', 'were', 'with', 'you', 'your'
39+
))
40+
SOFT_404_PHRASES = (
41+
'404',
42+
'not found',
43+
'page not found',
44+
'cannot be found',
45+
'could not be found',
46+
'does not exist',
47+
'no longer exists',
48+
'missing page',
49+
'resource missing',
50+
'resource not found',
51+
'requested page',
52+
'requested resource',
53+
'unavailable',
54+
'unknown page',
2955
)
3056

3157
def __init__(self, signatures=None, threshold=None):
@@ -89,7 +115,9 @@ def build_signature(cls, response, response_data):
89115

90116
body = cls._body(response)
91117
normalized_body = cls._normalize_body(body)
118+
visible_text = cls._visible_text(body)
92119
skeleton = cls._body_skeleton(body)
120+
dom_tokens = cls._dom_tokens(body)
93121

94122
return {
95123
'bucket': str(response_data[0]),
@@ -102,6 +130,13 @@ def build_signature(cls, response, response_data):
102130
'redirect_location': cls._redirect_location(response),
103131
'normalized_body_hash': cls._hash(normalized_body),
104132
'body_skeleton_hash': cls._hash(skeleton),
133+
'visible_text_hash': cls._hash(visible_text),
134+
'content_kind': cls._content_kind(body),
135+
'semantic_phrases': cls._semantic_phrases(visible_text),
136+
'semantic_terms': cls._semantic_terms(visible_text),
137+
'dom_tokens': dom_tokens,
138+
'dom_token_hash': cls._hash(' '.join(dom_tokens)),
139+
'text_density': cls._text_density(body),
105140
'header_fingerprint': cls._header_fingerprint(response),
106141
}
107142

@@ -165,6 +200,46 @@ def _score(cls, baseline, candidate):
165200
score += 0.15
166201
reasons.append('skeleton-hash')
167202

203+
if baseline.get('visible_text_hash') and baseline.get('visible_text_hash') == candidate.get('visible_text_hash'):
204+
score += 0.20
205+
reasons.append('visible-text')
206+
207+
phrase_score = cls._jaccard_similarity(
208+
baseline.get('semantic_phrases') or [],
209+
candidate.get('semantic_phrases') or []
210+
)
211+
if phrase_score >= 0.50:
212+
score += 0.16 * phrase_score
213+
reasons.append('semantic-phrases')
214+
215+
term_score = cls._jaccard_similarity(
216+
baseline.get('semantic_terms') or [],
217+
candidate.get('semantic_terms') or []
218+
)
219+
if term_score >= 0.55:
220+
score += 0.12 * term_score
221+
reasons.append('semantic-terms')
222+
223+
dom_score = cls._sequence_similarity(
224+
baseline.get('dom_tokens') or [],
225+
candidate.get('dom_tokens') or []
226+
)
227+
if dom_score >= 0.65:
228+
score += 0.10 * dom_score
229+
reasons.append('dom-structure')
230+
231+
density_score = cls._ratio_similarity(
232+
baseline.get('text_density'),
233+
candidate.get('text_density')
234+
)
235+
if density_score >= 0.85:
236+
score += 0.04 * density_score
237+
reasons.append('text-density')
238+
239+
if baseline.get('content_kind') and baseline.get('content_kind') == candidate.get('content_kind'):
240+
score += 0.02
241+
reasons.append('content-kind')
242+
168243
if baseline.get('title') and baseline.get('title') == candidate.get('title'):
169244
score += 0.06
170245
reasons.append('title')
@@ -247,6 +322,115 @@ def _normalize_body(cls, body):
247322
value = re.sub(r'\s+', ' ', value)
248323
return value.strip()
249324

325+
@classmethod
326+
def _visible_text(cls, body):
327+
"""
328+
Build normalized visible text without HTML wrappers or dynamic tokens.
329+
330+
:param str body:
331+
:return: str
332+
"""
333+
334+
value = html.unescape(str(body or '')).lower()
335+
value = re.sub(r'<!--.*?-->', ' ', value, flags=re.DOTALL)
336+
value = re.sub(r'<script\b[^>]*>.*?</script>', ' ', value, flags=re.DOTALL | re.IGNORECASE)
337+
value = re.sub(r'<style\b[^>]*>.*?</style>', ' ', value, flags=re.DOTALL | re.IGNORECASE)
338+
value = re.sub(r'<[^>]+>', ' ', value)
339+
340+
for pattern in cls.DYNAMIC_PATTERNS:
341+
value = re.sub(pattern, '<dynamic>', value, flags=re.IGNORECASE)
342+
343+
value = re.sub(r'[^a-z0-9<>]+', ' ', value)
344+
value = re.sub(r'\s+', ' ', value)
345+
return value.strip()
346+
347+
@staticmethod
348+
def _content_kind(body):
349+
"""
350+
Classify response body kind for calibration scoring.
351+
352+
:param str body:
353+
:return: str
354+
"""
355+
356+
value = str(body or '').lstrip().lower()
357+
358+
if value.startswith('{') or value.startswith('['):
359+
return 'json'
360+
361+
if '<html' in value or '<!doctype html' in value or re.search(r'<[a-z][^>]*>', value):
362+
return 'html'
363+
364+
if value:
365+
return 'text'
366+
367+
return 'empty'
368+
369+
@staticmethod
370+
def _dom_tokens(body):
371+
"""
372+
Build a bounded sequence of HTML tag tokens.
373+
374+
:param str body:
375+
:return: list[str]
376+
"""
377+
378+
tokens = re.findall(r'<\s*/?\s*([a-z0-9:-]+)', str(body or ''), flags=re.IGNORECASE)
379+
return [token.lower() for token in tokens[:120]]
380+
381+
@classmethod
382+
def _semantic_phrases(cls, text):
383+
"""
384+
Extract known soft-404 semantic phrases from visible text.
385+
386+
:param str text:
387+
:return: list[str]
388+
"""
389+
390+
value = str(text or '').lower()
391+
phrases = []
392+
393+
for phrase in cls.SOFT_404_PHRASES:
394+
if phrase in value:
395+
phrases.append(phrase)
396+
397+
return phrases
398+
399+
@classmethod
400+
def _semantic_terms(cls, text):
401+
"""
402+
Extract stable semantic terms from visible response text.
403+
404+
:param str text:
405+
:return: list[str]
406+
"""
407+
408+
terms = []
409+
410+
for term in re.findall(r'[a-z][a-z0-9_-]{2,}', str(text or '').lower()):
411+
if term in cls.SEMANTIC_STOPWORDS:
412+
continue
413+
if term == 'dynamic':
414+
continue
415+
terms.append(term)
416+
417+
return sorted(set(terms))[:40]
418+
419+
@classmethod
420+
def _text_density(cls, body):
421+
"""
422+
Estimate visible-text density relative to HTML markup volume.
423+
424+
:param str body:
425+
:return: float
426+
"""
427+
428+
raw = str(body or '')
429+
if len(raw) <= 0:
430+
return 0.0
431+
432+
return round(len(cls._visible_text(raw)) / float(max(len(raw), 1)), 4)
433+
250434
@staticmethod
251435
def _body_skeleton(body):
252436
"""
@@ -256,10 +440,10 @@ def _body_skeleton(body):
256440
:return: str
257441
"""
258442

259-
tags = re.findall(r'<\s*/?\s*([a-z0-9:-]+)', str(body or ''), flags=re.IGNORECASE)
443+
tags = Calibration._dom_tokens(body)
260444

261445
if len(tags) > 0:
262-
return ' '.join([tag.lower() for tag in tags])
446+
return ' '.join(tags)
263447

264448
text = re.sub(r'\w+', 'w', str(body or '').lower())
265449
text = re.sub(r'\s+', ' ', text)
@@ -456,6 +640,76 @@ def _numeric_similarity(left, right):
456640
maximum = max(abs(left), abs(right), 1)
457641
return max(0.0, 1.0 - (abs(left - right) / float(maximum)))
458642

643+
@staticmethod
644+
def _ratio_similarity(left, right):
645+
"""
646+
Return float-ratio similarity in range 0..1.
647+
648+
:param float|None left:
649+
:param float|None right:
650+
:return: float
651+
"""
652+
653+
try:
654+
left = float(left)
655+
right = float(right)
656+
except (TypeError, ValueError):
657+
return 0.0
658+
659+
maximum = max(abs(left), abs(right), 0.0001)
660+
return max(0.0, 1.0 - (abs(left - right) / maximum))
661+
662+
@staticmethod
663+
def _jaccard_similarity(left, right):
664+
"""
665+
Return set-overlap similarity in range 0..1.
666+
667+
:param list[str] left:
668+
:param list[str] right:
669+
:return: float
670+
"""
671+
672+
left_set = set(left or [])
673+
right_set = set(right or [])
674+
675+
if not left_set and not right_set:
676+
return 0.0
677+
678+
union = left_set | right_set
679+
if len(union) <= 0:
680+
return 0.0
681+
682+
return len(left_set & right_set) / float(len(union))
683+
684+
@staticmethod
685+
def _sequence_similarity(left, right):
686+
"""
687+
Return lightweight sequence similarity in range 0..1.
688+
689+
:param list[str] left:
690+
:param list[str] right:
691+
:return: float
692+
"""
693+
694+
left = list(left or [])
695+
right = list(right or [])
696+
697+
if not left and not right:
698+
return 0.0
699+
if not left or not right:
700+
return 0.0
701+
702+
prefix = 0
703+
for left_item, right_item in zip(left, right):
704+
if left_item != right_item:
705+
break
706+
prefix += 1
707+
708+
overlap = len(set(left) & set(right)) / float(len(set(left) | set(right)))
709+
prefix_score = prefix / float(max(len(left), len(right), 1))
710+
711+
return max(overlap, prefix_score)
712+
459713
@staticmethod
460714
def _header_similarity(left, right):
461715
"""

0 commit comments

Comments
 (0)