Skip to content

Conversation

@kacperlukawski
Copy link
Member

@kacperlukawski kacperlukawski commented Aug 14, 2025

This PR contains a script that automatically generates llms.txt and llms-full.txt, so whenever we change anything in the content, it's automatically reflected in these files.

Key characteristics

  • GitHub Models are used to summarize the files for the llms.txt
  • The summarization is not applied to the URLs that already exist in LLMS.txt
  • The llms-full.txt is always regenerated

The current state of both files was generated with this script. I also added a GitHub action that should automatically update the state of them.

@netlify
Copy link

netlify bot commented Aug 14, 2025

Deploy Preview for condescending-goldwasser-91acf0 ready!

Name Link
🔨 Latest commit f4a94af
🔍 Latest deploy log https://app.netlify.com/projects/condescending-goldwasser-91acf0/deploys/689e13fca2ef7700081dfa72
😎 Deploy Preview https://deploy-preview-1857--condescending-goldwasser-91acf0.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces automated generation of llms.txt and llms-full.txt files for the Qdrant documentation. It uses GitHub Models for content summarization and ensures these files stay synchronized with documentation changes.

  • Adds a Python script that scans Hugo content and generates summaries using GitHub Models API
  • Creates a GitHub Actions workflow to automatically run the generation script on content changes
  • Updates configuration documentation comments to clarify the full_scan_threshold_kb parameter behavior

Reviewed Changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 5 comments.

File Description
automation/generate-llms-txt.py Core script that processes Hugo content and generates llms.txt files with AI summaries
.github/workflows/generate-llms-txt.yml GitHub Actions workflow to automate the generation process on content changes
qdrant-landing/content/documentation/guides/configuration.md Updated comments for full_scan_threshold_kb parameter
qdrant-landing/content/documentation/concepts/indexing.md Updated comments for full_scan_threshold parameter

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

# Load the current state of the llms.txt file to avoid duplicates
with open(os.path.join(OUTPUT_DIR, "llms.txt"), "r", encoding="utf-8") as llms_file:
existing_urls = {line.split("](")[1].split(")")[0] for line in llms_file if line.startswith("- [")}

Copy link

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code attempts to read from llms.txt before checking if it exists. If the file doesn't exist on first run, this will raise a FileNotFoundError. Consider using a try-except block or checking file existence first.

Suggested change
try:
with open(os.path.join(OUTPUT_DIR, "llms.txt"), "r", encoding="utf-8") as llms_file:
existing_urls = {line.split("](")[1].split(")")[0] for line in llms_file if line.startswith("- [")}
except FileNotFoundError:
existing_urls = set()

Copilot uses AI. Check for mistakes.


# Load the paths to all the published content in Hugo and process them sequentially
# to generate the llms.txt and llms-full.txt files.
with (open(os.path.join(OUTPUT_DIR, "llms.txt"), "a+", encoding="utf-8") as llms_file, \
Copy link

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening llms.txt in append mode ('a+') after reading existing URLs will result in duplicating content since the file pointer is at the end. Consider opening in write mode ('w') and rewriting the entire file, or handle the file pointer position correctly.

Suggested change
with (open(os.path.join(OUTPUT_DIR, "llms.txt"), "a+", encoding="utf-8") as llms_file, \
with (open(os.path.join(OUTPUT_DIR, "llms.txt"), "w", encoding="utf-8") as llms_file, \

Copilot uses AI. Check for mistakes.

# `full_scan_threshold_kb`, the query planner will use full-scan search instead of HNSW index
# traversal for better performance.
# Note: 1Kb = 1 vector of size 256
full_scan_threshold: 10000
Copy link

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name 'full_scan_threshold' is inconsistent with the configuration guide which uses 'full_scan_threshold_kb'. This should likely be 'full_scan_threshold_kb' for consistency.

Suggested change
full_scan_threshold: 10000
full_scan_threshold_kb: 10000

Copilot uses AI. Check for mistakes.

@kacperlukawski kacperlukawski force-pushed the automate-llms-txt-generation branch from 0765fe1 to 7498193 Compare August 14, 2025 16:34
# Call the GitHub Models API to generate a summary
client = openai.OpenAI(
api_key=os.environ.get("GITHUB_TOKEN"),
base_url="https://models.github.ai/inference",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you propose to call language model as a part of CI process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it is supposed to work only on the newly added Hugo content, except for the first run. I assumed the overall meaning of a doc should not change much over time, so the summary should only be created for the new subpages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants