Skip to content

Commit d62de03

Browse files
arberxclaude
andauthored
feat: sitemap fallback, content-negotiation diagnostic, domain-aware schema recs (1.10.0) (#36)
* feat(audit): fix sitemap discovery, UA filtering, content negotiation, and domain-aware schema recs (1.10.0) Bundle fixes for the four open issues in a single release: - #32: sitemap auto-discovery now tries /sitemap.xml, then /sitemap-index.xml, then the Sitemap: directive in /robots.txt before failing. Astro/Next.js sites that only publish sitemap-index.xml are discovered without an explicit URL. - #34: when an auxiliary file (/llms.txt, /llms-full.txt, /robots.txt, /sitemap.xml) 404s for the audit User-Agent, retry once with a browser UA. If that succeeds, surface a UA-filtering finding so the user knows to allow the audit/crawler UA through their CDN/WAF (typical Vercel/Cloudflare cause). - #35: after a successful auxiliary fetch, probe once with Accept: text/markdown to detect content-negotiation traps (sites redirecting .txt to non-existent .md). Surface a content-negotiation finding so downstream AI tools that prefer markdown don't silently fail. - #33: structured-data and schema-completeness now detect site category (SaaS/devtools, e-commerce, local-business, service-business, blog/content) from JSON-LD, page text, and outbound links, and recommend schemas that match. Safe fallback when no category is detected is Organization instead of LocalBusiness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(diagnostics): drop UA-filtering retry; #34 was a misdiagnosed content-negotiation case Issue #34 hypothesized that Vercel was filtering by User-Agent and returning 404s. Re-reading the issue's own curl evidence — all UAs (default, node-fetch, empty) returned 200 — confirms the root cause was actually the markdown Accept header (i.e. issue #35). The aeo-audit tool already sends Accept: */* so it isn't directly affected; the diagnostic in #35 is what catches this pattern for downstream AI tools. Removes: - The browser-UA retry on auxiliary 404s in fetch-page.ts. - `uaFiltering` from AuxiliaryDiagnostics. - The UA-filtering finding/recommendation in the AI-Readable Content analyzer. - The UA-filtering test case in fetch-auxiliary.test.ts. - All "UA filtering" mentions in README, SKILL.md, and CHANGELOG. The content-negotiation probe (issue #35) is retained and now credits both #34 and #35 in the changelog as they share the same root cause. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sitemap): close SSRF in robots.txt fallback and correct diagnostic label `parseRobotsSitemap` now rejects directives whose resolved origin differs from the audited site. Because `fetchSitemapBody` has no SSRF guard, an attacker-controlled target could otherwise use its own robots.txt to steer requests from the auditing host at internal IPs. `pushDiagnosticFindings` now derives its label from `auxEntry.url`, so the content-negotiation finding reflects `/sitemap-index.xml` when the new sitemap fallback resolves there instead of `/sitemap.xml`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f02116f commit d62de03

15 files changed

Lines changed: 998 additions & 38 deletions

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,15 @@
11
# Changelog
22

3+
## 1.10.0 (2026-05-23)
4+
5+
### Added
6+
- **Sitemap auto-discovery fallback (#32).** When `/sitemap.xml` returns 404, `runSitemapAudit` and the auxiliary fetcher now also try `/sitemap-index.xml` (common on Astro / Next.js / Vercel) and, as a final fallback, parse the `Sitemap:` directive from `/robots.txt`. Previously sites that only published `sitemap-index.xml` got "Sitemap returned HTTP 404." with no audit coverage unless the user passed the explicit URL.
7+
- **Content-negotiation diagnostic (#34, #35).** When an auxiliary file (`/llms.txt`, `/llms-full.txt`, `/robots.txt`, `/sitemap.xml`) responds OK to the audit, the fetcher probes once with `Accept: text/markdown` to detect content-negotiation traps where Vercel / Astro / Starlight stacks 307-redirect `.txt` to a non-existent `.md` variant. Any non-2xx response from the markdown probe surfaces an actionable finding so users can fix the negotiation rule rather than the file. (Issue #34's original "UA filtering" hypothesis turned out to be the same content-negotiation root cause — `aeo-audit` already sends `Accept: */*` so it isn't directly affected, but the diagnostic catches the pattern that breaks downstream AI tools that prefer markdown.)
8+
- **Domain-aware schema recommendations (#33).** The `structured-data` and `schema-completeness` analyzers now detect the site category (SaaS / dev tools, e-commerce, local business, service business, blog/content) from JSON-LD, page text keywords, and outbound links, and recommend schemas that match. SaaS sites are no longer told to add `LocalBusiness` schema; the safe fallback when no category is detected is `Organization` instead of `LocalBusiness`.
9+
10+
### Changed
11+
- New `AuxiliaryDiagnostics` field on `AuxiliaryResource` carries the content-negotiation signal. The `AiReadableContent` analyzer surfaces it as a finding and recommendation.
12+
313
## 1.9.0 (2026-05-21)
414

515
### Added

README.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,8 @@ Per-URL fetch errors don't abort the batch — each entry is reported with `stat
145145
Audit every page discovered from the site's sitemap with bounded concurrency (5 in flight):
146146

147147
```bash
148-
# Auto-discover /sitemap.xml
148+
# Auto-discover the sitemap (tries /sitemap.xml, then /sitemap-index.xml,
149+
# then the Sitemap: directive in /robots.txt)
149150
npx @ainyc/aeo-audit https://example.com --sitemap
150151

151152
# Provide an explicit sitemap URL
@@ -158,8 +159,14 @@ npx @ainyc/aeo-audit https://example.com --sitemap --limit 50
158159
npx @ainyc/aeo-audit https://example.com --sitemap --top-issues
159160
```
160161

162+
Auto-discovery checks `/sitemap.xml``/sitemap-index.xml``Sitemap:` directives in `/robots.txt`. Astro / Next.js / Vercel sites that only publish `sitemap-index.xml` are now discovered without needing an explicit URL.
163+
161164
When the sitemap has more URLs than `--limit`, the run audits the highest-priority pages and prints a notice to stderr listing how many were skipped and how to audit them all.
162165

166+
### Auxiliary File Diagnostics
167+
168+
When fetching `/llms.txt`, `/llms-full.txt`, `/robots.txt`, and `/sitemap.xml` the audit runs a **content-negotiation probe** that surfaces as a finding on the **AI-Readable Content** factor: if a file returns OK to a bare request but a non-2xx response under `Accept: text/markdown`, the audit reports a content-negotiation trap. This catches Astro / Vercel / Starlight setups that redirect `.txt` → non-existent `.md` for markdown-accepting clients, which makes the file invisible to AI content-extraction tools — even though the file is "present" by every other measure.
169+
163170
### Flag Reference
164171

165172
| Flag | Description |
@@ -169,7 +176,7 @@ When the sitemap has more URLs than `--limit`, the run audits the highest-priori
169176
| `--include-geo` | Include the optional geographic signals factor |
170177
| `--include-agent-skills` | Include the optional agent skill exposure factor |
171178
| `--lighthouse` | Include the optional Lighthouse factor (Performance + Accessibility + Best Practices, mobile strategy) via Google PageSpeed Insights. Single-URL only; cannot combine with `--sitemap` or `--detect-platform`. Adds ~15-30s. Set `PAGESPEED_API_KEY` env var to lift anonymous rate limits. |
172-
| `--sitemap [url]` | Audit all pages from the sitemap (auto-discovers `/sitemap.xml` or uses an explicit URL) |
179+
| `--sitemap [url]` | Audit all pages from the sitemap. Auto-discovery tries `/sitemap.xml`, then `/sitemap-index.xml`, then `Sitemap:` directives in `/robots.txt`. Pass an explicit URL to override. |
173180
| `--limit <n>` | Max pages to audit in sitemap mode (default 200, sorted by sitemap priority) |
174181
| `--top-issues` | In sitemap mode, skip per-page output and show only cross-cutting issues |
175182
| `--detect-platform` | Identify the platform/CMS/framework powering the site instead of running an audit |

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@ainyc/aeo-audit",
3-
"version": "1.9.0",
3+
"version": "1.10.0",
44
"description": "The most comprehensive open-source Answer Engine Optimization (AEO) audit tool. Scores websites across 16 ranking factors that determine AI citation.",
55
"type": "module",
66
"main": "./dist/index.js",

skills/aeo/SKILL.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ npx @ainyc/aeo-audit@1 "<url>" --sitemap --top-issues --format json
9595
```
9696

9797
Flags:
98-
- `--sitemap [url]` — auto-discover `/sitemap.xml` or provide an explicit URL
98+
- `--sitemap [url]` — auto-discover the sitemap (tries `/sitemap.xml`, then `/sitemap-index.xml`, then `Sitemap:` directives in `/robots.txt`) or provide an explicit URL
9999
- `--limit <n>` — cap pages audited (default 200, sorted by sitemap priority)
100100
- `--top-issues` — skip per-page output, show only cross-cutting patterns
101101

@@ -107,6 +107,10 @@ Returns:
107107
- Aggregate score and grade
108108
- Prioritized fixes ranked by site-wide impact
109109

110+
#### Auxiliary File Diagnostics
111+
112+
When the audit fetches `/llms.txt`, `/llms-full.txt`, `/robots.txt`, and `/sitemap.xml`, it probes once with `Accept: text/markdown` to detect a **content-negotiation** trap: file responds OK to a bare request but returns a non-2xx response when the client prefers markdown. This catches Astro / Vercel / Starlight setups that 307-redirect `.txt` → non-existent `.md` for markdown-accepting clients, making the file invisible to AI content-extraction tools even though the file exists. The diagnostic surfaces as a finding on the **AI-Readable Content** factor.
113+
110114
### Lighthouse Mode
111115

112116
Use `--lighthouse` when the user wants page speed, accessibility, or best-practices scoring alongside the AEO factors. It calls Google PageSpeed Insights (mobile strategy) and aggregates Performance + Accessibility + Best Practices into a single optional factor (weight 8).

src/analyzers/ai-readable-content.ts

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,38 @@
11
import { clampScore, countWords } from './helpers.js'
22
import type { AnalysisResult, AuditContext, AuxiliaryResource } from '../types.js'
33

4+
function pushDiagnosticFindings(
5+
fallbackLabel: string,
6+
auxEntry: AuxiliaryResource | undefined,
7+
findings: AnalysisResult['findings'],
8+
recommendations: string[],
9+
): void {
10+
const diagnostics = auxEntry?.diagnostics
11+
if (!diagnostics) return
12+
13+
// Prefer the actual fetched path so that fallback resolutions (e.g.
14+
// /sitemap.xml → /sitemap-index.xml) are reflected accurately in the
15+
// finding instead of the spec's default label.
16+
let label = fallbackLabel
17+
if (auxEntry?.url) {
18+
try {
19+
label = new URL(auxEntry.url).pathname
20+
} catch {
21+
// ignore — keep the fallback label
22+
}
23+
}
24+
25+
if (diagnostics.contentNegotiation) {
26+
findings.push({
27+
type: 'info',
28+
message: `${label} returns a non-2xx response when fetched with \`Accept: text/markdown\` — content negotiation hides it from AI content extraction tools that prefer markdown.`,
29+
})
30+
recommendations.push(
31+
`Serve ${label} with the same body regardless of the \`Accept\` header (avoid redirecting .txt to a non-existent .md variant).`,
32+
)
33+
}
34+
}
35+
436
function scoreAuxState(
537
auxEntry: AuxiliaryResource | undefined,
638
missingMessage: string,
@@ -47,6 +79,7 @@ export function analyzeAiReadableContent(context: AuditContext): AnalysisResult
4779
findings,
4880
recommendations,
4981
)
82+
pushDiagnosticFindings('/llms.txt', auxiliary.llmsTxt, findings, recommendations)
5083

5184
if (auxiliary.llmsTxt?.state === 'ok') {
5285
const wordCount = countWords(auxiliary.llmsTxt.body || '')
@@ -67,6 +100,7 @@ export function analyzeAiReadableContent(context: AuditContext): AnalysisResult
67100
findings,
68101
recommendations,
69102
)
103+
pushDiagnosticFindings('/llms-full.txt', auxiliary.llmsFullTxt, findings, recommendations)
70104

71105
if (auxiliary.llmsFullTxt?.state === 'ok') {
72106
const wordCount = countWords(auxiliary.llmsFullTxt.body || '')
@@ -91,6 +125,7 @@ export function analyzeAiReadableContent(context: AuditContext): AnalysisResult
91125
findings.push({ type: 'missing', message: '/robots.txt is missing.' })
92126
recommendations.push('Add a robots.txt file.')
93127
}
128+
pushDiagnosticFindings('/robots.txt', auxiliary.robotsTxt, findings, recommendations)
94129

95130
// Sitemap presence
96131
const sitemapState = auxiliary.sitemapXml?.state
@@ -104,6 +139,7 @@ export function analyzeAiReadableContent(context: AuditContext): AnalysisResult
104139
findings.push({ type: 'missing', message: '/sitemap.xml is missing.' })
105140
recommendations.push('Add a sitemap.xml file.')
106141
}
142+
pushDiagnosticFindings('/sitemap.xml', auxiliary.sitemapXml, findings, recommendations)
107143

108144
// HTML head link to llms.txt
109145
const llmsLink = context.$('link[href*="llms.txt"]').length > 0

src/analyzers/helpers.ts

Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -326,3 +326,211 @@ export function domainFromUrl(rawUrl: string): string {
326326
return ''
327327
}
328328
}
329+
330+
export type SiteCategory =
331+
| 'saas-devtools'
332+
| 'ecommerce'
333+
| 'local-business'
334+
| 'service-business'
335+
| 'blog-or-content'
336+
| 'unknown'
337+
338+
export interface SiteCategoryDetection {
339+
category: SiteCategory
340+
/** 0–1 confidence in the chosen category; under 0.4 we treat as unknown. */
341+
confidence: number
342+
/** Recommended JSON-LD types for this category, in priority order. */
343+
recommendedSchemas: string[]
344+
/** Concrete signals that drove the classification. */
345+
evidence: string[]
346+
}
347+
348+
interface CategorySignalAccumulator {
349+
category: SiteCategory
350+
score: number
351+
evidence: string[]
352+
}
353+
354+
const SAAS_DEVTOOLS_KEYWORDS = [
355+
'api', 'sdk', 'documentation', 'docs', 'github', 'npm install', 'pip install',
356+
'yarn add', 'pnpm add', 'cli', 'developers', 'integration', 'webhook',
357+
'open source', 'opensource', 'authentication', 'oauth', 'api key',
358+
'pricing', 'enterprise', 'self-host', 'self host', 'getting started',
359+
]
360+
361+
const ECOMMERCE_KEYWORDS = [
362+
'add to cart', 'add to bag', 'shopping cart', 'checkout', 'shop now',
363+
'buy now', 'in stock', 'out of stock', 'free shipping', 'returns',
364+
'product details', 'sku', 'add to wishlist', 'view product',
365+
]
366+
367+
const LOCAL_BUSINESS_KEYWORDS = [
368+
'opening hours', 'business hours', 'directions', 'visit us', 'our location',
369+
'find us', 'reservations', 'book a table', 'menu', 'walk-ins welcome',
370+
'serving the', 'in the heart of',
371+
]
372+
373+
const SERVICE_BUSINESS_KEYWORDS = [
374+
'book a call', 'book a consultation', 'get a quote', 'request a quote',
375+
'free consultation', 'our services', 'case studies', 'client',
376+
'testimonials', 'schedule a meeting', 'hire us',
377+
]
378+
379+
const BLOG_KEYWORDS = [
380+
'recent posts', 'latest articles', 'read more', 'by author', 'published on',
381+
'subscribe to newsletter', 'archives', 'categories', 'tags', 'comments',
382+
]
383+
384+
function countKeywordHits(text: string, keywords: string[]): { count: number; matched: string[] } {
385+
const lower = text.toLowerCase()
386+
const matched: string[] = []
387+
let count = 0
388+
for (const keyword of keywords) {
389+
if (lower.includes(keyword)) {
390+
count += 1
391+
matched.push(keyword)
392+
if (matched.length >= 3) break
393+
}
394+
}
395+
return { count, matched }
396+
}
397+
398+
/**
399+
* Issue #33: detect the site's category so schema recommendations match the
400+
* business (SaaS/dev tools shouldn't be told to add LocalBusiness schema).
401+
*
402+
* Uses three signal layers, ranked by reliability:
403+
* 1. Existing JSON-LD types on the page — strongest signal.
404+
* 2. Page text keywords — moderate signal.
405+
* 3. Outbound/script URLs (GitHub, npm, package registries) — supporting signal.
406+
*
407+
* Returns 'unknown' when no category clears a low confidence bar so we fall back
408+
* to the safe-default recommendations (Organization + something explanatory).
409+
*/
410+
export function detectSiteCategory(
411+
context: Pick<AuditContext, 'structuredData' | 'textContent' | 'html'>,
412+
): SiteCategoryDetection {
413+
const schemaTypes = extractSchemaTypes(context.structuredData || [])
414+
const text = context.textContent || ''
415+
const html = context.html || ''
416+
417+
const accumulators: CategorySignalAccumulator[] = [
418+
{ category: 'saas-devtools', score: 0, evidence: [] },
419+
{ category: 'ecommerce', score: 0, evidence: [] },
420+
{ category: 'local-business', score: 0, evidence: [] },
421+
{ category: 'service-business', score: 0, evidence: [] },
422+
{ category: 'blog-or-content', score: 0, evidence: [] },
423+
]
424+
425+
const saas = accumulators[0]
426+
const ecom = accumulators[1]
427+
const local = accumulators[2]
428+
const service = accumulators[3]
429+
const blog = accumulators[4]
430+
431+
// Schema-level signals (highest confidence — the site told us what it is).
432+
if (schemaTypes.has('SoftwareApplication') || schemaTypes.has('WebApplication') || schemaTypes.has('MobileApplication')) {
433+
saas.score += 4
434+
saas.evidence.push('SoftwareApplication schema present')
435+
}
436+
if (schemaTypes.has('Product') || schemaTypes.has('Offer') || schemaTypes.has('AggregateOffer')) {
437+
ecom.score += 4
438+
ecom.evidence.push('Product/Offer schema present')
439+
}
440+
if (schemaTypes.has('LocalBusiness') || schemaTypes.has('Restaurant') || schemaTypes.has('Store') || schemaTypes.has('PostalAddress')) {
441+
local.score += 4
442+
local.evidence.push('LocalBusiness/PostalAddress schema present')
443+
}
444+
if (schemaTypes.has('Service') || schemaTypes.has('ProfessionalService')) {
445+
service.score += 2
446+
service.evidence.push('Service schema present')
447+
}
448+
if (schemaTypes.has('Article') || schemaTypes.has('BlogPosting') || schemaTypes.has('NewsArticle')) {
449+
blog.score += 4
450+
blog.evidence.push('Article/BlogPosting schema present')
451+
}
452+
453+
// Text keyword signals.
454+
const saasHits = countKeywordHits(text, SAAS_DEVTOOLS_KEYWORDS)
455+
if (saasHits.count > 0) {
456+
saas.score += saasHits.count
457+
saas.evidence.push(`SaaS/dev keywords: ${saasHits.matched.join(', ')}`)
458+
}
459+
const ecomHits = countKeywordHits(text, ECOMMERCE_KEYWORDS)
460+
if (ecomHits.count > 0) {
461+
ecom.score += ecomHits.count * 1.5 // e-commerce phrases are very specific
462+
ecom.evidence.push(`E-commerce keywords: ${ecomHits.matched.join(', ')}`)
463+
}
464+
const localHits = countKeywordHits(text, LOCAL_BUSINESS_KEYWORDS)
465+
if (localHits.count > 0) {
466+
local.score += localHits.count * 1.5
467+
local.evidence.push(`Local-business keywords: ${localHits.matched.join(', ')}`)
468+
}
469+
const serviceHits = countKeywordHits(text, SERVICE_BUSINESS_KEYWORDS)
470+
if (serviceHits.count > 0) {
471+
service.score += serviceHits.count
472+
service.evidence.push(`Service keywords: ${serviceHits.matched.join(', ')}`)
473+
}
474+
const blogHits = countKeywordHits(text, BLOG_KEYWORDS)
475+
if (blogHits.count > 0) {
476+
blog.score += blogHits.count * 0.75 // blog phrases overlap with many sites
477+
blog.evidence.push(`Blog/content keywords: ${blogHits.matched.join(', ')}`)
478+
}
479+
480+
// Outbound/script URL signals for SaaS — GitHub repo, npm package, package manager mentions.
481+
if (/github\.com\/[A-Za-z0-9_.-]+\/[A-Za-z0-9_.-]+/i.test(html)) {
482+
saas.score += 1
483+
saas.evidence.push('GitHub repo link in HTML')
484+
}
485+
if (/(npmjs\.com|unpkg\.com|jsdelivr\.net|cdnjs\.cloudflare\.com)/i.test(html)) {
486+
saas.score += 1
487+
saas.evidence.push('npm/CDN registry reference')
488+
}
489+
490+
// Pick the strongest signal and decide whether to commit.
491+
accumulators.sort((a, b) => b.score - a.score)
492+
const top = accumulators[0]
493+
const next = accumulators[1]
494+
495+
const MIN_SCORE = 2 // need at least one strong schema signal or two keyword matches
496+
const MARGIN = 1 // top must beat runner-up by at least one point
497+
498+
if (top.score < MIN_SCORE || top.score - next.score < MARGIN) {
499+
return {
500+
category: 'unknown',
501+
confidence: 0,
502+
recommendedSchemas: ['Organization'],
503+
evidence: [],
504+
}
505+
}
506+
507+
const totalScore = accumulators.reduce((sum, a) => sum + a.score, 0)
508+
const confidence = totalScore > 0 ? Math.min(1, top.score / Math.max(totalScore, 1)) : 0
509+
510+
return {
511+
category: top.category,
512+
confidence,
513+
recommendedSchemas: recommendedSchemasFor(top.category),
514+
evidence: top.evidence,
515+
}
516+
}
517+
518+
function recommendedSchemasFor(category: SiteCategory): string[] {
519+
switch (category) {
520+
case 'saas-devtools':
521+
return ['Organization', 'SoftwareApplication', 'FAQPage']
522+
case 'ecommerce':
523+
return ['Organization', 'Product', 'AggregateRating']
524+
case 'local-business':
525+
return ['LocalBusiness', 'Service', 'FAQPage']
526+
case 'service-business':
527+
return ['Organization', 'Service', 'FAQPage']
528+
case 'blog-or-content':
529+
return ['Organization', 'Article', 'BreadcrumbList']
530+
case 'unknown':
531+
default:
532+
// Organization is the safest broad default; suggest Article and FAQPage
533+
// as common follow-ups regardless of business type.
534+
return ['Organization', 'Article', 'FAQPage']
535+
}
536+
}

0 commit comments

Comments
 (0)