Skip to content

Conversation

@Pabl0cks
Copy link
Collaborator

@Pabl0cks Pabl0cks commented Mar 6, 2025

See llmstxt website for extra llms.txt / llms-full.txt files context.

Based on @portdeveloper idea and initial iteration, I've added a feature to generate an LLM-friendly version of our documentation that helps AI assistants like Claude and ChatGPT better understand our Scaffold-ETH 2 project.

What this PR does:

  • Creates a llms-full.txt file containing all our documentation in a format optimized for LLMs The doc will be accessible at /llms-full.txt
  • In this approach I've mixed a hardcoded initial context for a high level overview, and a auto-generated context from our docs content.
  • I'd love if we can polish together the hardcoded stuff and verify that the autogenerated content is generated correctly:
    • Organizes content by folders with proper hierarchical structure
    • Preserves document titles and URLs while removing frontmatter and other noise
    • Adds folder descriptions from _category_.json files (in the folders where this file exists)

Implementation details:

  • Added three scripts:

    • generate-llms-txt.js: Core logic for processing markdown files
    • generate-llms-full.js: Standalone script for manual generation
    • llms-txt-plugin.js: Docusaurus plugin for automatic generation during builds
  • Added an npm script: npm run generate-llms for easy manual generation

Maybe we can get rid of the standalone script / manual generation when this is mature?

How to test:

  1. Run npm run generate-llms
  2. Check the generated static/llms-full.txt file
  3. Build the site with npm run build to verify the plugin works

This is the generated file if you just want to review the content: llms-full.txt

Examples of other projects generating a txt file for llms:

@vercel
Copy link

vercel bot commented Mar 6, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
scaffold-eth-2-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 6, 2025 0:22am

Copy link
Member

@carletex carletex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Pablo, thanks for the PR. I think this is a cool feature to have. If we can generate both llm files on build it would be amazing, so we don't need to anything special when creating a PR (and doing it on build it's probably better than generating them with a GitHub action, as I initially suggested)

I started to review the PR but got discouraged. There are a bunch of things that don't make sense/confusing/under-optimized (I'm guessing vibe coding) in 500+ lines of code. In this particular case (SE-2 docs) not sure if it makes sense to spend the required time to do a proper review.... which is a bit sad. So maybe if it works, we should just move forward.

What we need to change is the hardcoded part, there are parts that don't make sense to me. Happy to give some feedback there.

But let's see what others think too.

@portdeveloper
Copy link
Member

portdeveloper commented Mar 7, 2025

How about we have some very basic generation script that conforms to anthropics formatting:

Here is the branch:
https://github.com/scaffold-eth/se-2-docs/tree/feat/llms-txt-generator-port

const path = require("path");

/** @type {import('@docusaurus/types').Plugin} */
function docusaurusPluginLLMsFull(context, options = {}) {
  const { siteDir } = context;
  const outputFile = options.outputFile || "llms-full.txt";

  // Function to recursively get all markdown files
  function getAllMarkdownFiles(dir, fileList = []) {
    const files = fs.readdirSync(dir);

    files.forEach(file => {
      const filePath = path.join(dir, file);
      const stat = fs.statSync(filePath);

      if (stat.isDirectory()) {
        getAllMarkdownFiles(filePath, fileList);
      } else if (file.endsWith(".md") || file.endsWith(".mdx")) {
        fileList.push(filePath);
      }
    });

    return fileList;
  }

  function extractTitle(content) {
    // Look for the first # heading
    const titleMatch = content.match(/^#\s+(.+)$/m);
    if (titleMatch && titleMatch[1]) {
      return titleMatch[1].trim();
    }

    // If no # heading, try to find a title in frontmatter
    const frontmatterMatch = content.match(/^---\s*\n([\s\S]*?)\n---/);
    if (frontmatterMatch) {
      const frontmatter = frontmatterMatch[1];
      const titleInFrontmatter = frontmatter.match(/title:\s*["']?([^"'\n]+)["']?/);
      if (titleInFrontmatter && titleInFrontmatter[1]) {
        return titleInFrontmatter[1].trim();
      }
    }

    return null;
  }

  function generateUrl(filePath) {
    const relativePath = path.relative(path.join(siteDir, "docs"), filePath);
    const pathWithoutExt = relativePath.replace(/\.(md|mdx)$/, "");
    return `${context.siteConfig.url}/docs/${pathWithoutExt.replace(/\\/g, "/")}`;
  }

  function cleanContent(content) {
    // Remove frontmatter
    let cleaned = content.replace(/^---\s*\n[\s\S]*?\n---\s*\n/, "");

    // Remove any duplicate first-level headings
    cleaned = cleaned.replace(/^#\s+.+\n/m, "");

    // Clean up excessive whitespace
    cleaned = cleaned.replace(/\n{3,}/g, "\n\n");

    return cleaned.trim();
  }

  async function generateContent() {
    try {
      const docsDir = path.join(siteDir, "docs");
      const markdownFiles = getAllMarkdownFiles(docsDir);
      let fullText = "";

      for (const filePath of markdownFiles) {
        const content = fs.readFileSync(filePath, "utf8");

        // Extract title or use filename
        let title = extractTitle(content);
        if (!title) {
          const filename = path.basename(filePath, path.extname(filePath));
          title = filename.charAt(0).toUpperCase() + filename.slice(1).replace(/-/g, " ");
        }

        const url = generateUrl(filePath);

        // Add title and source URL as header (Anthropic format)
        fullText += `# ${title}\nSource: ${url}\n\n`;

        fullText += cleanContent(content);
        fullText += "\n\n\n";
      }
      const staticDir = path.join(siteDir, "static");

      fs.writeFileSync(path.join(staticDir, outputFile), fullText);
      console.log(`Successfully generated ${outputFile} in static folder`);

      return { success: true };
    } catch (error) {
      console.error(`Error generating ${outputFile}:`, error);
      return { success: false, error: error.message };
    }
  }

  return {
    name: "docusaurus-plugin-llms-full",
    async loadContent() {
      try {
        await generateContent();
        return { success: true };
      } catch (error) {
        console.error("Error in loadContent:", error);
        return { success: false, error: error.message };
      }
    },
    async contentLoaded({ actions, content }) {
      const { setGlobalData } = actions;
      setGlobalData(content);
    },
  };
}

module.exports = docusaurusPluginLLMsFull;

Copy link
Member

@portdeveloper portdeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from maybe extracting the frontmatter removal logic to a separate function, and some other minor improvements i think this looks pretty good! We can always refactor it further I believe

@MattPereira
Copy link
Contributor

Viem has a pretty sweet plugin script that parses markdown into abstract syntax tree

Wasn't too hard for us to port into vuepress

Thanks again for the head start friends ❤️

@Pabl0cks
Copy link
Collaborator Author

Pabl0cks commented Apr 1, 2025

Viem has a pretty sweet plugin script that parses markdown into abstract syntax tree
Wasn't too hard for us to port into vuepress

We were thinking about migrating to vocs #124 but porting this script is a good option to explore too, tysm @MattPereira !! ♥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants