Skip to content

fix(scripts): restore markdown aliases for PDF downloads#2698

Open
marcklingen wants to merge 1 commit intomainfrom
codex/fix-broken-download-as-pdf-links
Open

fix(scripts): restore markdown aliases for PDF downloads#2698
marcklingen wants to merge 1 commit intomainfrom
codex/fix-broken-download-as-pdf-links

Conversation

@marcklingen
Copy link
Member

Motivation

  • The docs Download as PDF feature broke because the site no longer exposed root-level .md source URLs like /terms.md that the PDF endpoint expects.
  • The copy pipeline produced only scoped exports under public/md-src/marketing/*, so legacy direct markdown URLs were missing and PDF requests failed.

Description

  • Refactored scripts/copy_md_sources.js to return multiple output destinations per source and renamed mapDestination to mapDestinations to support this behavior.
  • Added logic to emit a root-level alias for direct content/marketing/* pages (for example, producing both public/md-src/marketing/terms.md and public/md-src/terms.md).
  • Made the script testable by accepting injectable sourceDir/outputDir, exporting helpers (mapDestinations, copyAll, cleanOutputDir), and guarding execution with require.main === module to avoid side effects on import.
  • Added scripts/copy_md_sources.test.js to validate that marketing pages get root-level aliases while non-marketing docs (e.g., security/toms) remain scoped.

Testing

  • Ran the unit test with node --test scripts/copy_md_sources.test.js, which verifies the root-level alias for marketing pages and scoped behavior for security pages, and the test passed.

Codex Task

@vercel
Copy link

vercel bot commented Mar 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
langfuse-docs Ready Ready Preview, Comment Mar 23, 2026 7:31pm

Request Review

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 23, 2026
Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the core fix is correct and the nits found are cosmetic/performance-only with no functional impact.

Extended reasoning...

Overview

This PR modifies scripts/copy_md_sources.js (a build-time script) and adds a test file. The change refactors mapDestinationmapDestinations to emit multiple output paths per source, adds dependency injection for testability, and gates execution with require.main === module. The functional goal is to restore root-level .md aliases for marketing pages so the PDF download feature works.

Security Risks

None. This is a build-time file copy script with no auth, user input, or external network calls.

Level of Scrutiny

Low. This is a build tooling script, not production application code. The change is mechanical and the core logic is straightforward. The test validates the key behavioral requirement.

Bug Assessment

The three bugs flagged are all nits with zero functional impact: the copied counter double-counts writes (cosmetic log message), non-markdown files are read before filtering (minor performance waste on build, no correctness issue), and the test temp dir leaks on assertion failure (minor CI hygiene, no false positives/negatives). None of these affect the actual file output that fixes the PDF download feature.

Comment on lines +86 to +91
for (const dest of destinations) {
const dir = path.dirname(dest);
ensureDir(dir);
fs.writeFileSync(dest, processed, 'utf8');
copied += 1;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The copied counter increments inside the inner for-loop over destinations, so a single marketing page source file increments it by 2 (once per destination). The log message "Copied N markdown source files into public/md-src" is misleading -- N counts file writes, not source files. Moving copied += 1 outside the inner loop would restore accurate semantics.

Extended reasoning...

What the bug is: In copyAll() at line 90, copied += 1 is placed inside the for (const dest of destinations) loop. This means the counter increments once per destination write, not once per source file processed.

The specific code path: For a marketing page like content/marketing/terms.mdx, mapDestinations() returns two entries: public/md-src/marketing/terms.md and public/md-src/terms.md. The inner loop iterates twice, incrementing copied by 2 for a single source file.

Why existing code does not prevent it: The refactoring that introduced multiple destinations per source (to fix the PDF download feature) moved copied += 1 inside the new inner loop without adjusting for the change in loop semantics. The original code had a single destination per source and incremented once per file.

Concrete step-by-step proof: Suppose the repo has 5 non-marketing docs and 3 marketing pages. Non-marketing docs each produce 1 destination, contributing +5. Marketing pages each produce 2 destinations, contributing +6. Total copied = 11, but the log prints "Copied 11 markdown source files into public/md-src" when only 8 source files were actually processed.

Impact: This is purely a cosmetic logging inaccuracy. No files are missing, duplicated incorrectly, or corrupted. The actual copy behavior is completely correct -- both scoped and root-level aliases are written as intended. Only the human-readable build log is misleading.

Fix: Move copied += 1 outside the inner loop (but still inside the outer for-of-allFiles loop), and only increment when at least one destination was written (i.e., destinations.length > 0). This restores the original semantics of counting source files processed.

Comment on lines 79 to 92
}
const allFiles = walkDir(SOURCE_DIR);
const allFiles = walkDir(sourceDir);
let copied = 0;
for (const file of allFiles) {
const dest = mapDestination(file);
if (!dest) continue;
const dir = path.dirname(dest);
ensureDir(dir);
const originalContent = fs.readFileSync(file, 'utf8');
const processed = inlineComponentsMdx(originalContent, file);
fs.writeFileSync(dest, processed, 'utf8');
copied += 1;
const destinations = mapDestinations(file, { sourceDir, outputDir });
for (const dest of destinations) {
const dir = path.dirname(dest);
ensureDir(dir);
fs.writeFileSync(dest, processed, 'utf8');
copied += 1;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The refactored copyAll reads every file with fs.readFileSync and runs inlineComponentsMdx on it before calling mapDestinations to determine whether to skip the file, reversing the original filter-then-read pattern. This causes unnecessary I/O and CPU work on non-markdown files (meta.json, _-prefixed files, and any other assets in content/) that are ultimately discarded.

Extended reasoning...

What the bug is

In the original copyAll implementation, mapDestination was called first and an early continue skipped any file that should not be copied. The new implementation inverts this order: it unconditionally calls fs.readFileSync(file, "utf8") and inlineComponentsMdx(originalContent, file) on every file returned by walkDir, and only afterwards calls mapDestinations to determine if there are any output locations.

The specific code path

// NEW (buggy ordering)
for (const file of allFiles) {
    const originalContent = fs.readFileSync(file, utf8);   // reads EVERY file
    const processed = inlineComponentsMdx(originalContent, file); // processes EVERY file
    const destinations = mapDestinations(file, { sourceDir, outputDir }); // [] for non-md
    for (const dest of destinations) { /* skipped when empty */ }
}

The original pattern was:

// OLD (correct ordering)
const dest = mapDestination(file);
if (!dest) continue;   // early exit before any I/O
const originalContent = fs.readFileSync(file, utf8);

Why existing code does not prevent it

walkDir returns all files recursively, including meta.json files and any _-prefixed files. mapDestinations returns [] for these, causing the inner write loop to be a no-op, but the read and processing have already occurred. inlineComponentsMdx is wrapped in a try-catch so it swallows errors from binary/non-markdown content silently.

Impact

The impact is a performance regression, not a correctness failure. fs.readFileSync(binaryFile, "utf8") in Node.js does not throw; it returns a garbled string. inlineComponentsMdx handles errors via try-catch. Since destinations is empty for non-markdown files, no corrupted files are written. However, every non-markdown file in content/ (at minimum every meta.json and _-prefixed file) is read and parsed on every build run unnecessarily.

Step-by-step proof

  1. walkDir(sourceDir) collects all files, including content/foo/meta.json.
  2. The loop reaches content/foo/meta.json.
  3. fs.readFileSync("content/foo/meta.json", "utf8") succeeds, reading JSON content.
  4. inlineComponentsMdx(jsonContent, file) runs the regex import scanner on the JSON string.
  5. mapDestinations("content/foo/meta.json", ...) returns [] because base === "meta.json".
  6. The inner for (const dest of destinations) loop executes zero times.
  7. Steps 3–4 were wasted work.

Fix

Move the mapDestinations call before readFileSync and restore the early-continue pattern:

for (const file of allFiles) {
    const destinations = mapDestinations(file, { sourceDir, outputDir });
    if (destinations.length === 0) continue;
    const originalContent = fs.readFileSync(file, utf8);
    const processed = inlineComponentsMdx(originalContent, file);
    for (const dest of destinations) { ... }
}

Comment on lines +14 to +30
test('creates root-level aliases for marketing pages and keeps security paths scoped', () => {
const tempRoot = fs.mkdtempSync(path.join(os.tmpdir(), 'copy-md-sources-'));
const sourceDir = path.join(tempRoot, 'content');
const outputDir = path.join(tempRoot, 'public', 'md-src');

writeFile(path.join(sourceDir, 'marketing', 'terms.mdx'), '# Terms');
writeFile(path.join(sourceDir, 'security', 'toms.mdx'), '# TOMS');

copyAll({ sourceDir, outputDir });

assert.equal(fs.readFileSync(path.join(outputDir, 'marketing', 'terms.md'), 'utf8'), '# Terms');
assert.equal(fs.readFileSync(path.join(outputDir, 'terms.md'), 'utf8'), '# Terms');
assert.equal(fs.readFileSync(path.join(outputDir, 'security', 'toms.md'), 'utf8'), '# TOMS');
assert.equal(fs.existsSync(path.join(outputDir, 'toms.md')), false);

fs.rmSync(tempRoot, { recursive: true, force: true });
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The temp directory cleanup at the end of the test will not run if any assert.equal call throws an AssertionError, leaving directories under /tmp on test failures. Wrapping the assertions in a try/finally block (or using after() from node:test) ensures cleanup always runs.

Extended reasoning...

What the bug is and how it manifests

In scripts/copy_md_sources.test.js, the test body creates a temp directory via fs.mkdtempSync, runs assertions, and then calls fs.rmSync(tempRoot, { recursive: true, force: true }) as the last statement. JavaScript does not guarantee that the final statement runs if a preceding statement throws, so if any assertion fails, the cleanup is skipped and a directory like /tmp/copy-md-sources-XXXXXX is left behind.

The specific code path that triggers it

Lines 24-27 call assert.equal(...) four times. assert.equal (from node:assert/strict) throws AssertionError on mismatch. Since there is no try/finally guarding the cleanup at line 29, any thrown error causes the JavaScript call stack to unwind past that statement entirely.

Why existing code does not prevent it

The test uses a plain synchronous test body with no lifecycle hooks (after(), afterEach()) and no try/finally. The node:test framework does call an after() hook after the test regardless of pass/fail, but that mechanism is not used here. There is nothing to intercept the thrown AssertionError and redirect execution to the cleanup.

Impact

On test failure, the /tmp directory accumulates copy-md-sources-XXXXXX directories. In CI environments these are typically cleaned at job end, and most OS configurations periodically sweep /tmp, so the practical impact is low. This does not affect production behavior or cause false positive or negative test results.

How to fix it

Wrap assertions in a try/finally block to guarantee cleanup:

const tempRoot = fs.mkdtempSync(path.join(os.tmpdir(), 'copy-md-sources-'));
try {
    // setup and assertions
} finally {
    fs.rmSync(tempRoot, { recursive: true, force: true });
}

Alternatively, use node:test's after() hook registered before assertions run.

Step-by-step proof

  1. Test begins: fs.mkdtempSync creates /tmp/copy-md-sources-abc123.
  2. copyAll runs and writes files to the temp output dir.
  3. assert.equal at line 24 throws AssertionError due to a content mismatch (hypothetical failure).
  4. JavaScript unwinds the call stack; line 29 (fs.rmSync(tempRoot, ...)) is never reached.
  5. /tmp/copy-md-sources-abc123 and all its contents remain on disk until the OS cleans /tmp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

codex size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant