Add LLM-friendly documentation export for improved AI context #119

Pabl0cks · 2025-03-06T11:30:26Z

See llmstxt website for extra llms.txt / llms-full.txt files context.

Based on @portdeveloper idea and initial iteration, I've added a feature to generate an LLM-friendly version of our documentation that helps AI assistants like Claude and ChatGPT better understand our Scaffold-ETH 2 project.

What this PR does:

Creates a llms-full.txt file containing all our documentation in a format optimized for LLMs The doc will be accessible at /llms-full.txt
In this approach I've mixed a hardcoded initial context for a high level overview, and a auto-generated context from our docs content.
I'd love if we can polish together the hardcoded stuff and verify that the autogenerated content is generated correctly:
- Organizes content by folders with proper hierarchical structure
- Preserves document titles and URLs while removing frontmatter and other noise
- Adds folder descriptions from _category_.json files (in the folders where this file exists)

Implementation details:

Added three scripts:
- generate-llms-txt.js: Core logic for processing markdown files
- generate-llms-full.js: Standalone script for manual generation
- llms-txt-plugin.js: Docusaurus plugin for automatic generation during builds
Added an npm script: npm run generate-llms for easy manual generation

Maybe we can get rid of the standalone script / manual generation when this is mature?

How to test:

Run npm run generate-llms
Check the generated static/llms-full.txt file
Build the site with npm run build to verify the plugin works

This is the generated file if you just want to review the content: llms-full.txt

Examples of other projects generating a txt file for llms:

vercel · 2025-03-06T11:30:30Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
scaffold-eth-2-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Mar 6, 2025 0:22am

carletex

Hey Pablo, thanks for the PR. I think this is a cool feature to have. If we can generate both llm files on build it would be amazing, so we don't need to anything special when creating a PR (and doing it on build it's probably better than generating them with a GitHub action, as I initially suggested)

I started to review the PR but got discouraged. There are a bunch of things that don't make sense/confusing/under-optimized (I'm guessing vibe coding) in 500+ lines of code. In this particular case (SE-2 docs) not sure if it makes sense to spend the required time to do a proper review.... which is a bit sad. So maybe if it works, we should just move forward.

What we need to change is the hardcoded part, there are parts that don't make sense to me. Happy to give some feedback there.

But let's see what others think too.

portdeveloper · 2025-03-07T13:03:35Z

How about we have some very basic generation script that conforms to anthropics formatting:

Here is the branch:
https://github.com/scaffold-eth/se-2-docs/tree/feat/llms-txt-generator-port

const path = require("path");

/** @type {import('@docusaurus/types').Plugin} */
function docusaurusPluginLLMsFull(context, options = {}) {
  const { siteDir } = context;
  const outputFile = options.outputFile || "llms-full.txt";

  // Function to recursively get all markdown files
  function getAllMarkdownFiles(dir, fileList = []) {
    const files = fs.readdirSync(dir);

    files.forEach(file => {
      const filePath = path.join(dir, file);
      const stat = fs.statSync(filePath);

      if (stat.isDirectory()) {
        getAllMarkdownFiles(filePath, fileList);
      } else if (file.endsWith(".md") || file.endsWith(".mdx")) {
        fileList.push(filePath);
      }
    });

    return fileList;
  }

  function extractTitle(content) {
    // Look for the first # heading
    const titleMatch = content.match(/^#\s+(.+)$/m);
    if (titleMatch && titleMatch[1]) {
      return titleMatch[1].trim();
    }

    // If no # heading, try to find a title in frontmatter
    const frontmatterMatch = content.match(/^---\s*\n([\s\S]*?)\n---/);
    if (frontmatterMatch) {
      const frontmatter = frontmatterMatch[1];
      const titleInFrontmatter = frontmatter.match(/title:\s*["']?([^"'\n]+)["']?/);
      if (titleInFrontmatter && titleInFrontmatter[1]) {
        return titleInFrontmatter[1].trim();
      }
    }

    return null;
  }

  function generateUrl(filePath) {
    const relativePath = path.relative(path.join(siteDir, "docs"), filePath);
    const pathWithoutExt = relativePath.replace(/\.(md|mdx)$/, "");
    return `${context.siteConfig.url}/docs/${pathWithoutExt.replace(/\\/g, "/")}`;
  }

  function cleanContent(content) {
    // Remove frontmatter
    let cleaned = content.replace(/^---\s*\n[\s\S]*?\n---\s*\n/, "");

    // Remove any duplicate first-level headings
    cleaned = cleaned.replace(/^#\s+.+\n/m, "");

    // Clean up excessive whitespace
    cleaned = cleaned.replace(/\n{3,}/g, "\n\n");

    return cleaned.trim();
  }

  async function generateContent() {
    try {
      const docsDir = path.join(siteDir, "docs");
      const markdownFiles = getAllMarkdownFiles(docsDir);
      let fullText = "";

      for (const filePath of markdownFiles) {
        const content = fs.readFileSync(filePath, "utf8");

        // Extract title or use filename
        let title = extractTitle(content);
        if (!title) {
          const filename = path.basename(filePath, path.extname(filePath));
          title = filename.charAt(0).toUpperCase() + filename.slice(1).replace(/-/g, " ");
        }

        const url = generateUrl(filePath);

        // Add title and source URL as header (Anthropic format)
        fullText += `# ${title}\nSource: ${url}\n\n`;

        fullText += cleanContent(content);
        fullText += "\n\n\n";
      }
      const staticDir = path.join(siteDir, "static");

      fs.writeFileSync(path.join(staticDir, outputFile), fullText);
      console.log(`Successfully generated ${outputFile} in static folder`);

      return { success: true };
    } catch (error) {
      console.error(`Error generating ${outputFile}:`, error);
      return { success: false, error: error.message };
    }
  }

  return {
    name: "docusaurus-plugin-llms-full",
    async loadContent() {
      try {
        await generateContent();
        return { success: true };
      } catch (error) {
        console.error("Error in loadContent:", error);
        return { success: false, error: error.message };
      }
    },
    async contentLoaded({ actions, content }) {
      const { setGlobalData } = actions;
      setGlobalData(content);
    },
  };
}

module.exports = docusaurusPluginLLMsFull;

portdeveloper

Apart from maybe extracting the frontmatter removal logic to a separate function, and some other minor improvements i think this looks pretty good! We can always refactor it further I believe

MattPereira · 2025-03-31T20:48:48Z

Viem has a pretty sweet plugin script that parses markdown into abstract syntax tree

Wasn't too hard for us to port into vuepress

Thanks again for the head start friends ❤️

Pabl0cks · 2025-04-01T06:36:43Z

Viem has a pretty sweet plugin script that parses markdown into abstract syntax tree
Wasn't too hard for us to port into vuepress

We were thinking about migrating to vocs #124 but porting this script is a good option to explore too, tysm @MattPereira !! ♥

Pabl0cks added 2 commits March 6, 2025 11:50

Initial version of llms-full generator

d2ef740

Fix generation on build and reduce logging

c134fcf

Pabl0cks requested review from portdeveloper and technophile-04 March 6, 2025 11:31

Prettier the new scripts

7bfd659

vercel bot deployed to Preview March 6, 2025 11:40 View deployment

Header content unique source of truth

1272edb

vercel bot deployed to Preview March 6, 2025 12:22 View deployment

carletex reviewed Mar 6, 2025

View reviewed changes

portdeveloper approved these changes Mar 21, 2025

View reviewed changes

Pabl0cks mentioned this pull request Mar 21, 2025

Add llms-full.txt file manually #123

Merged

MattPereira mentioned this pull request Mar 27, 2025

Investigate making docs AI friendly balancer/docs-v3#247

Closed

Pabl0cks mentioned this pull request Aug 12, 2025

feat: update createExtension #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add LLM-friendly documentation export for improved AI context #119

Add LLM-friendly documentation export for improved AI context #119

Pabl0cks commented Mar 6, 2025 •

edited

Loading

Uh oh!

vercel bot commented Mar 6, 2025 •

edited

Loading

Uh oh!

carletex left a comment

Uh oh!

portdeveloper commented Mar 7, 2025 •

edited

Loading

Uh oh!

portdeveloper left a comment

Uh oh!

MattPereira commented Mar 31, 2025

Uh oh!

Pabl0cks commented Apr 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add LLM-friendly documentation export for improved AI context #119

Are you sure you want to change the base?

Add LLM-friendly documentation export for improved AI context #119

Conversation

Pabl0cks commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does:

Implementation details:

How to test:

Uh oh!

vercel bot commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carletex left a comment

Choose a reason for hiding this comment

Uh oh!

portdeveloper commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

portdeveloper left a comment

Choose a reason for hiding this comment

Uh oh!

MattPereira commented Mar 31, 2025

Uh oh!

Pabl0cks commented Apr 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Pabl0cks commented Mar 6, 2025 •

edited

Loading

vercel bot commented Mar 6, 2025 •

edited

Loading

portdeveloper commented Mar 7, 2025 •

edited

Loading