You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat(audit): fix sitemap discovery, UA filtering, content negotiation, and domain-aware schema recs (1.10.0)
Bundle fixes for the four open issues in a single release:
- #32: sitemap auto-discovery now tries /sitemap.xml, then /sitemap-index.xml,
then the Sitemap: directive in /robots.txt before failing. Astro/Next.js
sites that only publish sitemap-index.xml are discovered without an explicit
URL.
- #34: when an auxiliary file (/llms.txt, /llms-full.txt, /robots.txt,
/sitemap.xml) 404s for the audit User-Agent, retry once with a browser UA.
If that succeeds, surface a UA-filtering finding so the user knows to allow
the audit/crawler UA through their CDN/WAF (typical Vercel/Cloudflare cause).
- #35: after a successful auxiliary fetch, probe once with Accept: text/markdown
to detect content-negotiation traps (sites redirecting .txt to non-existent
.md). Surface a content-negotiation finding so downstream AI tools that
prefer markdown don't silently fail.
- #33: structured-data and schema-completeness now detect site category
(SaaS/devtools, e-commerce, local-business, service-business, blog/content)
from JSON-LD, page text, and outbound links, and recommend schemas that
match. Safe fallback when no category is detected is Organization instead
of LocalBusiness.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(diagnostics): drop UA-filtering retry; #34 was a misdiagnosed content-negotiation case
Issue #34 hypothesized that Vercel was filtering by User-Agent and returning
404s. Re-reading the issue's own curl evidence — all UAs (default, node-fetch,
empty) returned 200 — confirms the root cause was actually the markdown
Accept header (i.e. issue #35). The aeo-audit tool already sends
Accept: */* so it isn't directly affected; the diagnostic in #35 is what
catches this pattern for downstream AI tools.
Removes:
- The browser-UA retry on auxiliary 404s in fetch-page.ts.
- `uaFiltering` from AuxiliaryDiagnostics.
- The UA-filtering finding/recommendation in the AI-Readable Content analyzer.
- The UA-filtering test case in fetch-auxiliary.test.ts.
- All "UA filtering" mentions in README, SKILL.md, and CHANGELOG.
The content-negotiation probe (issue #35) is retained and now credits both
#34 and #35 in the changelog as they share the same root cause.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(sitemap): close SSRF in robots.txt fallback and correct diagnostic label
`parseRobotsSitemap` now rejects directives whose resolved origin differs
from the audited site. Because `fetchSitemapBody` has no SSRF guard, an
attacker-controlled target could otherwise use its own robots.txt to steer
requests from the auditing host at internal IPs.
`pushDiagnosticFindings` now derives its label from `auxEntry.url`, so the
content-negotiation finding reflects `/sitemap-index.xml` when the new
sitemap fallback resolves there instead of `/sitemap.xml`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,15 @@
1
1
# Changelog
2
2
3
+
## 1.10.0 (2026-05-23)
4
+
5
+
### Added
6
+
-**Sitemap auto-discovery fallback (#32).** When `/sitemap.xml` returns 404, `runSitemapAudit` and the auxiliary fetcher now also try `/sitemap-index.xml` (common on Astro / Next.js / Vercel) and, as a final fallback, parse the `Sitemap:` directive from `/robots.txt`. Previously sites that only published `sitemap-index.xml` got "Sitemap returned HTTP 404." with no audit coverage unless the user passed the explicit URL.
7
+
-**Content-negotiation diagnostic (#34, #35).** When an auxiliary file (`/llms.txt`, `/llms-full.txt`, `/robots.txt`, `/sitemap.xml`) responds OK to the audit, the fetcher probes once with `Accept: text/markdown` to detect content-negotiation traps where Vercel / Astro / Starlight stacks 307-redirect `.txt` to a non-existent `.md` variant. Any non-2xx response from the markdown probe surfaces an actionable finding so users can fix the negotiation rule rather than the file. (Issue #34's original "UA filtering" hypothesis turned out to be the same content-negotiation root cause — `aeo-audit` already sends `Accept: */*` so it isn't directly affected, but the diagnostic catches the pattern that breaks downstream AI tools that prefer markdown.)
8
+
-**Domain-aware schema recommendations (#33).** The `structured-data` and `schema-completeness` analyzers now detect the site category (SaaS / dev tools, e-commerce, local business, service business, blog/content) from JSON-LD, page text keywords, and outbound links, and recommend schemas that match. SaaS sites are no longer told to add `LocalBusiness` schema; the safe fallback when no category is detected is `Organization` instead of `LocalBusiness`.
9
+
10
+
### Changed
11
+
- New `AuxiliaryDiagnostics` field on `AuxiliaryResource` carries the content-negotiation signal. The `AiReadableContent` analyzer surfaces it as a finding and recommendation.
Auto-discovery checks `/sitemap.xml` → `/sitemap-index.xml` → `Sitemap:` directives in `/robots.txt`. Astro / Next.js / Vercel sites that only publish `sitemap-index.xml` are now discovered without needing an explicit URL.
163
+
161
164
When the sitemap has more URLs than `--limit`, the run audits the highest-priority pages and prints a notice to stderr listing how many were skipped and how to audit them all.
162
165
166
+
### Auxiliary File Diagnostics
167
+
168
+
When fetching `/llms.txt`, `/llms-full.txt`, `/robots.txt`, and `/sitemap.xml` the audit runs a **content-negotiation probe** that surfaces as a finding on the **AI-Readable Content** factor: if a file returns OK to a bare request but a non-2xx response under `Accept: text/markdown`, the audit reports a content-negotiation trap. This catches Astro / Vercel / Starlight setups that redirect `.txt` → non-existent `.md` for markdown-accepting clients, which makes the file invisible to AI content-extraction tools — even though the file is "present" by every other measure.
169
+
163
170
### Flag Reference
164
171
165
172
| Flag | Description |
@@ -169,7 +176,7 @@ When the sitemap has more URLs than `--limit`, the run audits the highest-priori
169
176
|`--include-geo`| Include the optional geographic signals factor |
170
177
|`--include-agent-skills`| Include the optional agent skill exposure factor |
171
178
|`--lighthouse`| Include the optional Lighthouse factor (Performance + Accessibility + Best Practices, mobile strategy) via Google PageSpeed Insights. Single-URL only; cannot combine with `--sitemap` or `--detect-platform`. Adds ~15-30s. Set `PAGESPEED_API_KEY` env var to lift anonymous rate limits. |
172
-
|`--sitemap [url]`| Audit all pages from the sitemap (auto-discovers `/sitemap.xml` or uses an explicit URL)|
179
+
|`--sitemap [url]`| Audit all pages from the sitemap. Auto-discovery tries `/sitemap.xml`, then `/sitemap-index.xml`, then `Sitemap:` directives in `/robots.txt`. Pass an explicit URL to override.|
173
180
|`--limit <n>`| Max pages to audit in sitemap mode (default 200, sorted by sitemap priority) |
174
181
|`--top-issues`| In sitemap mode, skip per-page output and show only cross-cutting issues |
175
182
|`--detect-platform`| Identify the platform/CMS/framework powering the site instead of running an audit |
Copy file name to clipboardExpand all lines: package.json
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
{
2
2
"name": "@ainyc/aeo-audit",
3
-
"version": "1.9.0",
3
+
"version": "1.10.0",
4
4
"description": "The most comprehensive open-source Answer Engine Optimization (AEO) audit tool. Scores websites across 16 ranking factors that determine AI citation.",
-`--sitemap [url]` — auto-discover `/sitemap.xml` or provide an explicit URL
98
+
-`--sitemap [url]` — auto-discover the sitemap (tries `/sitemap.xml`, then `/sitemap-index.xml`, then `Sitemap:` directives in `/robots.txt`) or provide an explicit URL
99
99
-`--limit <n>` — cap pages audited (default 200, sorted by sitemap priority)
100
100
-`--top-issues` — skip per-page output, show only cross-cutting patterns
101
101
@@ -107,6 +107,10 @@ Returns:
107
107
- Aggregate score and grade
108
108
- Prioritized fixes ranked by site-wide impact
109
109
110
+
#### Auxiliary File Diagnostics
111
+
112
+
When the audit fetches `/llms.txt`, `/llms-full.txt`, `/robots.txt`, and `/sitemap.xml`, it probes once with `Accept: text/markdown` to detect a **content-negotiation** trap: file responds OK to a bare request but returns a non-2xx response when the client prefers markdown. This catches Astro / Vercel / Starlight setups that 307-redirect `.txt` → non-existent `.md` for markdown-accepting clients, making the file invisible to AI content-extraction tools even though the file exists. The diagnostic surfaces as a finding on the **AI-Readable Content** factor.
113
+
110
114
### Lighthouse Mode
111
115
112
116
Use `--lighthouse` when the user wants page speed, accessibility, or best-practices scoring alongside the AEO factors. It calls Google PageSpeed Insights (mobile strategy) and aggregates Performance + Accessibility + Best Practices into a single optional factor (weight 8).
// Prefer the actual fetched path so that fallback resolutions (e.g.
14
+
// /sitemap.xml → /sitemap-index.xml) are reflected accurately in the
15
+
// finding instead of the spec's default label.
16
+
letlabel=fallbackLabel
17
+
if(auxEntry?.url){
18
+
try{
19
+
label=newURL(auxEntry.url).pathname
20
+
}catch{
21
+
// ignore — keep the fallback label
22
+
}
23
+
}
24
+
25
+
if(diagnostics.contentNegotiation){
26
+
findings.push({
27
+
type: 'info',
28
+
message: `${label} returns a non-2xx response when fetched with \`Accept: text/markdown\` — content negotiation hides it from AI content extraction tools that prefer markdown.`,
29
+
})
30
+
recommendations.push(
31
+
`Serve ${label} with the same body regardless of the \`Accept\` header (avoid redirecting .txt to a non-existent .md variant).`,
32
+
)
33
+
}
34
+
}
35
+
4
36
functionscoreAuxState(
5
37
auxEntry: AuxiliaryResource|undefined,
6
38
missingMessage: string,
@@ -47,6 +79,7 @@ export function analyzeAiReadableContent(context: AuditContext): AnalysisResult
0 commit comments