-
Notifications
You must be signed in to change notification settings - Fork 123
Automate llms.txt generation #1857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for condescending-goldwasser-91acf0 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces automated generation of llms.txt and llms-full.txt files for the Qdrant documentation. It uses GitHub Models for content summarization and ensures these files stay synchronized with documentation changes.
- Adds a Python script that scans Hugo content and generates summaries using GitHub Models API
- Creates a GitHub Actions workflow to automatically run the generation script on content changes
- Updates configuration documentation comments to clarify the full_scan_threshold_kb parameter behavior
Reviewed Changes
Copilot reviewed 4 out of 6 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| automation/generate-llms-txt.py | Core script that processes Hugo content and generates llms.txt files with AI summaries |
| .github/workflows/generate-llms-txt.yml | GitHub Actions workflow to automate the generation process on content changes |
| qdrant-landing/content/documentation/guides/configuration.md | Updated comments for full_scan_threshold_kb parameter |
| qdrant-landing/content/documentation/concepts/indexing.md | Updated comments for full_scan_threshold parameter |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| # Load the current state of the llms.txt file to avoid duplicates | ||
| with open(os.path.join(OUTPUT_DIR, "llms.txt"), "r", encoding="utf-8") as llms_file: | ||
| existing_urls = {line.split("](")[1].split(")")[0] for line in llms_file if line.startswith("- [")} | ||
|
|
Copilot
AI
Aug 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code attempts to read from llms.txt before checking if it exists. If the file doesn't exist on first run, this will raise a FileNotFoundError. Consider using a try-except block or checking file existence first.
| try: | |
| with open(os.path.join(OUTPUT_DIR, "llms.txt"), "r", encoding="utf-8") as llms_file: | |
| existing_urls = {line.split("](")[1].split(")")[0] for line in llms_file if line.startswith("- [")} | |
| except FileNotFoundError: | |
| existing_urls = set() |
Copilot uses AI. Check for mistakes.
|
|
||
| # Load the paths to all the published content in Hugo and process them sequentially | ||
| # to generate the llms.txt and llms-full.txt files. | ||
| with (open(os.path.join(OUTPUT_DIR, "llms.txt"), "a+", encoding="utf-8") as llms_file, \ |
Copilot
AI
Aug 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening llms.txt in append mode ('a+') after reading existing URLs will result in duplicating content since the file pointer is at the end. Consider opening in write mode ('w') and rewriting the entire file, or handle the file pointer position correctly.
| with (open(os.path.join(OUTPUT_DIR, "llms.txt"), "a+", encoding="utf-8") as llms_file, \ | |
| with (open(os.path.join(OUTPUT_DIR, "llms.txt"), "w", encoding="utf-8") as llms_file, \ |
Copilot uses AI. Check for mistakes.
| # `full_scan_threshold_kb`, the query planner will use full-scan search instead of HNSW index | ||
| # traversal for better performance. | ||
| # Note: 1Kb = 1 vector of size 256 | ||
| full_scan_threshold: 10000 |
Copilot
AI
Aug 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter name 'full_scan_threshold' is inconsistent with the configuration guide which uses 'full_scan_threshold_kb'. This should likely be 'full_scan_threshold_kb' for consistency.
| full_scan_threshold: 10000 | |
| full_scan_threshold_kb: 10000 |
Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <[email protected]>
0765fe1 to
7498193
Compare
| # Call the GitHub Models API to generate a summary | ||
| client = openai.OpenAI( | ||
| api_key=os.environ.get("GITHUB_TOKEN"), | ||
| base_url="https://models.github.ai/inference", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you propose to call language model as a part of CI process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it is supposed to work only on the newly added Hugo content, except for the first run. I assumed the overall meaning of a doc should not change much over time, so the summary should only be created for the new subpages.
This PR contains a script that automatically generates llms.txt and llms-full.txt, so whenever we change anything in the content, it's automatically reflected in these files.
Key characteristics
The current state of both files was generated with this script. I also added a GitHub action that should automatically update the state of them.