Skip to content

ci: add sitemap markdown endpoint consistency check#2697

Open
marcklingen wants to merge 4 commits intomainfrom
codex/add-ci-check-for-sitemap-and-.md-files
Open

ci: add sitemap markdown endpoint consistency check#2697
marcklingen wants to merge 4 commits intomainfrom
codex/add-ci-check-for-sitemap-and-.md-files

Conversation

@marcklingen
Copy link
Member

Motivation

  • Ensure the routing contract between canonical site pages and their pre-built .md equivalents is not accidentally broken.
  • Prevent regressions that would make sitemap-listed pages unreachable as markdown (used by agents, MCP tools, and the copy/export features).

Description

  • Add scripts/check-sitemap-markdown.js, a checker that fetches the locally served sitemap-0.xml, maps each sitemap URL to its .md endpoint, requests the endpoint with Accept: text/markdown, and verifies HTTP status and text/markdown content-type, failing with a list of problematic endpoints.
  • Add a sitemap-markdown-check npm script in package.json to run the new checker via pnpm sitemap-markdown-check.
  • Add a CI job check-sitemap-markdown-endpoints to .github/workflows/ci.yml that builds the app, starts the server, runs the checker, and fails the job if any endpoint is missing or has an unexpected response.
  • Wire the new CI job into the all-checks-pass summary job so the check runs for pull requests and merges.

Testing

  • Ran a syntax check with node --check scripts/check-sitemap-markdown.js, which succeeded.
  • Validated package.json by parsing it with node -e "const fs=require('fs');JSON.parse(fs.readFileSync('package.json','utf8'));console.log('package.json valid')", which succeeded.
  • The new CI job is added to the pipeline and will run automatically on PRs to validate markdown endpoint consistency during CI.

Codex Task

@vercel
Copy link

vercel bot commented Mar 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
langfuse-docs Ready Ready Preview, Comment Mar 23, 2026 7:57pm

Request Review

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Mar 23, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: debaba8f12

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

function toLocalMarkdownUrl(absoluteUrl) {
const parsed = new URL(absoluteUrl);
const pathname = parsed.pathname;
const mdPath = pathname.endsWith('.md') ? pathname : `${pathname}.md`;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Normalize slash-terminated paths before appending .md

toLocalMarkdownUrl appends .md directly to the sitemap pathname, so slash-terminated URLs are transformed into invalid endpoints like /.md (and /foo/.md). Since next-sitemap.config.js always includes { loc: '/' } in additionalPaths, this check will falsely fail on the homepage even when markdown routing is healthy, which can block the new CI job on otherwise valid builds.

Useful? React with 👍 / 👎.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex pls fix

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

  • Fixed toLocalMarkdownUrl to normalize slash-terminated sitemap paths before appending .md, preventing invalid outputs like /.md and /foo/.md. scripts/check-sitemap-markdown.jsL86-L94
  • Added explicit root handling so / maps to /index.md, and trimmed trailing slashes on non-root paths so /foo/ maps to /foo.md. scripts/check-sitemap-markdown.jsL89-L93
  • Committed the change (fix: normalize trailing slashes in sitemap markdown URL mapping) and created a follow-up PR via make_pr with a summary of the fix.

Testing

  • node --check scripts/check-sitemap-markdown.js

View task →

Comment on lines +66 to +68
const parseString = promisify(parser.parseString);

try {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 promisify(parser.parseString) detaches the method from its xml2js Parser instance, causing this to be undefined in strict mode when called as a standalone function. This throws a TypeError on every invocation, making the CI check permanently non-functional. Fix: change line 67 to const parseString = promisify(parser.parseString.bind(parser));.

Extended reasoning...

Bug: promisify loses this context for xml2js Parser.parseString

What the bug is and how it manifests

In parseSitemap() (line 66-68), the code does:

const parser = new xml2js.Parser();
const parseString = promisify(parser.parseString);

This extracts the parseString method from the parser instance as a bare function reference. When await parseString(xmlContent) is later called (line 70), Node.js's util.promisify wrapper invokes original.call(this, ...) where this is the calling context, but since parseString is called without a receiver, this is undefined in strict mode.

The specific code path that triggers it

The file opens with 'use strict' (line 1). Under strict mode, unbound method calls have this === undefined rather than the global object. xml2js's Parser.prototype.parseString internally calls this.reset(), reads this.options, and accesses other instance state. With this being undefined, accessing any property on it throws a TypeError.

Why existing code does not prevent it

The try/catch in parseSitemap() will catch the TypeError and re-throw it as Failed to parse sitemap XML: ..., so the error is surfaced. However, this happens on every single invocation. There is no code path where parseString succeeds without .bind(parser).

Step-by-step proof

  1. parser = new xml2js.Parser() creates instance with this.options, this.remaining, etc.
  2. parseString = promisify(parser.parseString) extracts bare function, no this captured.
  3. await parseString(xmlContent) has promisify call fn.call(this, xmlContent, callback) where this is undefined in strict mode.
  4. Inside Parser.prototype.parseString, first access of this.options throws TypeError.
  5. The catch block in parseSitemap wraps it as Failed to parse sitemap XML: Cannot read properties of undefined.
  6. main() catches this, logs it, and calls process.exit(1) so the CI job fails every time.

Impact

The entire CI check (check-sitemap-markdown-endpoints) is permanently non-functional. It will never successfully parse the sitemap XML and will always exit with code 1, either blocking merges or generating false positives depending on the branch protection configuration.

How to fix it

Bind the method to its instance before promisifying:

const parseString = promisify(parser.parseString.bind(parser));

This ensures this inside parseString always refers to the correct parser instance, preserving access to all instance properties and methods that xml2js relies on internally.

Comment on lines +31 to +40
response.headers.location
) {
const redirectUrl = response.headers.location;
const resolvedUrl = redirectUrl.startsWith('http')
? redirectUrl
: new URL(redirectUrl, url).href;
fetchUrl(resolvedUrl, headers).then(resolve).catch(reject);
return;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In fetchUrl(), when a 3xx redirect is received, the response body is not drained before following the redirect — response.resume() is never called. In Node.js, an unconsumed response body holds the underlying socket open, preventing it from being released to the connection pool. Fix by adding response.resume() immediately before the recursive fetchUrl() call.

Extended reasoning...

Connection Leak on Redirect Responses

What the bug is: In scripts/check-sitemap-markdown.js, the fetchUrl() function handles HTTP redirects by detecting a 3xx status code with a location header and immediately making a recursive call. However, it never calls response.resume() on the redirect response before doing so. In Node.js, a response object is a readable stream; if that stream is not consumed, the underlying TCP socket cannot be returned to the keep-alive connection pool.

The specific code path: Lines 31–40 of the new script show the redirect branch:

if (response.statusCode >= 300 && response.statusCode < 400 && response.headers.location) {
  const redirectUrl = response.headers.location;
  const resolvedUrl = ...;
  fetchUrl(resolvedUrl, headers).then(resolve).catch(reject);
  return; // response body is never drained
}

The return exits the callback without attaching any data/end listeners or calling response.resume(), leaving the socket in a half-open state.

Why existing code does not prevent it: The script uses Node built-in http/https modules with the default http.globalAgent. In Node.js 18+ the global agent has keep-alive enabled, meaning it tries to reuse connections. A socket is only returned to the pool once the response body stream is fully consumed. Without response.resume(), the socket stays reserved until the remote server closes it.

Impact: Because http.globalAgent.maxSockets defaults to Infinity in modern Node.js, the connection pool will not be exhausted — new connections are created as needed. However, each redirect silently leaks a socket handle that accumulates for the duration of the script. In this CI context the practical risk is very low: the script runs against localhost, redirect bodies are tiny or absent, MAX_CONCURRENT_REQUESTS is 10, and the process exits after completion which cleans everything up.

Step-by-step proof:

  1. The sitemap contains a URL /some/page that the server redirects (301) to /some/page/.
  2. fetchUrl(http://localhost:3333/some/page.md) is called.
  3. The response callback fires with statusCode: 301 and location: /some/page/.md.
  4. The code calls fetchUrl(/some/page/.md, headers).then(resolve).catch(reject) and returns.
  5. No data or end event handler is attached to the 301 response stream; response.resume() is never called.
  6. The socket remains open, held by the unconsumed stream, until the local Next.js server decides to close it.
  7. With enough redirected URLs, many sockets accumulate in this limbo state simultaneously.

How to fix it: Add a single response.resume() call before following the redirect:

if (response.statusCode >= 300 && response.statusCode < 400 && response.headers.location) {
  response.resume(); // drain body so the socket is returned to the pool
  const redirectUrl = response.headers.location;
  // ...
  fetchUrl(resolvedUrl, headers).then(resolve).catch(reject);
  return;
}

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b98da7cfc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +90 to +91
const normalizedPathname =
pathname === '/' ? '/index' : pathname.replace(/\/+$/, '');

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exclude root URL from markdown endpoint mapping

next-sitemap.config.js always includes / in additionalPaths, but this logic rewrites / to /index.md. The markdown export path generation in scripts/copy_md_sources.js only creates public/md-src/index.md when content/index.(md|mdx) exists, and this repo currently has no such source file, so the checker will consistently hit a 404 for the homepage and fail CI even when other sitemap pages are healthy.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

codex enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant