ci: add sitemap markdown endpoint consistency check#2697
ci: add sitemap markdown endpoint consistency check#2697marcklingen wants to merge 4 commits intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: debaba8f12
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
scripts/check-sitemap-markdown.js
Outdated
| function toLocalMarkdownUrl(absoluteUrl) { | ||
| const parsed = new URL(absoluteUrl); | ||
| const pathname = parsed.pathname; | ||
| const mdPath = pathname.endsWith('.md') ? pathname : `${pathname}.md`; |
There was a problem hiding this comment.
Normalize slash-terminated paths before appending .md
toLocalMarkdownUrl appends .md directly to the sitemap pathname, so slash-terminated URLs are transformed into invalid endpoints like /.md (and /foo/.md). Since next-sitemap.config.js always includes { loc: '/' } in additionalPaths, this check will falsely fail on the homepage even when markdown routing is healthy, which can block the new CI job on otherwise valid builds.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Summary
- Fixed
toLocalMarkdownUrlto normalize slash-terminated sitemap paths before appending.md, preventing invalid outputs like/.mdand/foo/.md. scripts/check-sitemap-markdown.jsL86-L94 - Added explicit root handling so
/maps to/index.md, and trimmed trailing slashes on non-root paths so/foo/maps to/foo.md. scripts/check-sitemap-markdown.jsL89-L93 - Committed the change (
fix: normalize trailing slashes in sitemap markdown URL mapping) and created a follow-up PR viamake_prwith a summary of the fix.
Testing
- ✅
node --check scripts/check-sitemap-markdown.js
scripts/check-sitemap-markdown.js
Outdated
| const parseString = promisify(parser.parseString); | ||
|
|
||
| try { |
There was a problem hiding this comment.
🔴 promisify(parser.parseString) detaches the method from its xml2js Parser instance, causing this to be undefined in strict mode when called as a standalone function. This throws a TypeError on every invocation, making the CI check permanently non-functional. Fix: change line 67 to const parseString = promisify(parser.parseString.bind(parser));.
Extended reasoning...
Bug: promisify loses this context for xml2js Parser.parseString
What the bug is and how it manifests
In parseSitemap() (line 66-68), the code does:
const parser = new xml2js.Parser();
const parseString = promisify(parser.parseString);This extracts the parseString method from the parser instance as a bare function reference. When await parseString(xmlContent) is later called (line 70), Node.js's util.promisify wrapper invokes original.call(this, ...) where this is the calling context, but since parseString is called without a receiver, this is undefined in strict mode.
The specific code path that triggers it
The file opens with 'use strict' (line 1). Under strict mode, unbound method calls have this === undefined rather than the global object. xml2js's Parser.prototype.parseString internally calls this.reset(), reads this.options, and accesses other instance state. With this being undefined, accessing any property on it throws a TypeError.
Why existing code does not prevent it
The try/catch in parseSitemap() will catch the TypeError and re-throw it as Failed to parse sitemap XML: ..., so the error is surfaced. However, this happens on every single invocation. There is no code path where parseString succeeds without .bind(parser).
Step-by-step proof
parser = new xml2js.Parser()creates instance withthis.options,this.remaining, etc.parseString = promisify(parser.parseString)extracts bare function, nothiscaptured.await parseString(xmlContent)has promisify callfn.call(this, xmlContent, callback)wherethisisundefinedin strict mode.- Inside Parser.prototype.parseString, first access of
this.optionsthrows TypeError. - The catch block in parseSitemap wraps it as
Failed to parse sitemap XML: Cannot read properties of undefined. - main() catches this, logs it, and calls
process.exit(1)so the CI job fails every time.
Impact
The entire CI check (check-sitemap-markdown-endpoints) is permanently non-functional. It will never successfully parse the sitemap XML and will always exit with code 1, either blocking merges or generating false positives depending on the branch protection configuration.
How to fix it
Bind the method to its instance before promisifying:
const parseString = promisify(parser.parseString.bind(parser));This ensures this inside parseString always refers to the correct parser instance, preserving access to all instance properties and methods that xml2js relies on internally.
| response.headers.location | ||
| ) { | ||
| const redirectUrl = response.headers.location; | ||
| const resolvedUrl = redirectUrl.startsWith('http') | ||
| ? redirectUrl | ||
| : new URL(redirectUrl, url).href; | ||
| fetchUrl(resolvedUrl, headers).then(resolve).catch(reject); | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
🟡 In fetchUrl(), when a 3xx redirect is received, the response body is not drained before following the redirect — response.resume() is never called. In Node.js, an unconsumed response body holds the underlying socket open, preventing it from being released to the connection pool. Fix by adding response.resume() immediately before the recursive fetchUrl() call.
Extended reasoning...
Connection Leak on Redirect Responses
What the bug is: In scripts/check-sitemap-markdown.js, the fetchUrl() function handles HTTP redirects by detecting a 3xx status code with a location header and immediately making a recursive call. However, it never calls response.resume() on the redirect response before doing so. In Node.js, a response object is a readable stream; if that stream is not consumed, the underlying TCP socket cannot be returned to the keep-alive connection pool.
The specific code path: Lines 31–40 of the new script show the redirect branch:
if (response.statusCode >= 300 && response.statusCode < 400 && response.headers.location) {
const redirectUrl = response.headers.location;
const resolvedUrl = ...;
fetchUrl(resolvedUrl, headers).then(resolve).catch(reject);
return; // response body is never drained
}The return exits the callback without attaching any data/end listeners or calling response.resume(), leaving the socket in a half-open state.
Why existing code does not prevent it: The script uses Node built-in http/https modules with the default http.globalAgent. In Node.js 18+ the global agent has keep-alive enabled, meaning it tries to reuse connections. A socket is only returned to the pool once the response body stream is fully consumed. Without response.resume(), the socket stays reserved until the remote server closes it.
Impact: Because http.globalAgent.maxSockets defaults to Infinity in modern Node.js, the connection pool will not be exhausted — new connections are created as needed. However, each redirect silently leaks a socket handle that accumulates for the duration of the script. In this CI context the practical risk is very low: the script runs against localhost, redirect bodies are tiny or absent, MAX_CONCURRENT_REQUESTS is 10, and the process exits after completion which cleans everything up.
Step-by-step proof:
- The sitemap contains a URL
/some/pagethat the server redirects (301) to/some/page/. fetchUrl(http://localhost:3333/some/page.md)is called.- The response callback fires with
statusCode: 301andlocation: /some/page/.md. - The code calls
fetchUrl(/some/page/.md, headers).then(resolve).catch(reject)and returns. - No
dataorendevent handler is attached to the 301 response stream;response.resume()is never called. - The socket remains open, held by the unconsumed stream, until the local Next.js server decides to close it.
- With enough redirected URLs, many sockets accumulate in this limbo state simultaneously.
How to fix it: Add a single response.resume() call before following the redirect:
if (response.statusCode >= 300 && response.statusCode < 400 && response.headers.location) {
response.resume(); // drain body so the socket is returned to the pool
const redirectUrl = response.headers.location;
// ...
fetchUrl(resolvedUrl, headers).then(resolve).catch(reject);
return;
}There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b98da7cfc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const normalizedPathname = | ||
| pathname === '/' ? '/index' : pathname.replace(/\/+$/, ''); |
There was a problem hiding this comment.
Exclude root URL from markdown endpoint mapping
next-sitemap.config.js always includes / in additionalPaths, but this logic rewrites / to /index.md. The markdown export path generation in scripts/copy_md_sources.js only creates public/md-src/index.md when content/index.(md|mdx) exists, and this repo currently has no such source file, so the checker will consistently hit a 404 for the homepage and fail CI even when other sitemap pages are healthy.
Useful? React with 👍 / 👎.
Motivation
.mdequivalents is not accidentally broken.Description
scripts/check-sitemap-markdown.js, a checker that fetches the locally servedsitemap-0.xml, maps each sitemap URL to its.mdendpoint, requests the endpoint withAccept: text/markdown, and verifies HTTP status andtext/markdowncontent-type, failing with a list of problematic endpoints.sitemap-markdown-checknpm script inpackage.jsonto run the new checker viapnpm sitemap-markdown-check.check-sitemap-markdown-endpointsto.github/workflows/ci.ymlthat builds the app, starts the server, runs the checker, and fails the job if any endpoint is missing or has an unexpected response.all-checks-passsummary job so the check runs for pull requests and merges.Testing
node --check scripts/check-sitemap-markdown.js, which succeeded.package.jsonby parsing it withnode -e "const fs=require('fs');JSON.parse(fs.readFileSync('package.json','utf8'));console.log('package.json valid')", which succeeded.Codex Task