The functional core of the llms.txt generation system. Contains all business logic for generating, validating, and updating llms.txt files from websites using LLM models. Also includes a CLI tool for standalone generation.
The core-ltx crate is the heart of the application, providing:
- Web content extraction: Downloads and parses HTML from target websites
- LLM integration: Interfaces with OpenAI GPT models and Anthropic Claude models
- llms.txt generation: Transforms web content into structured llms.txt format
- Validation and retry logic: Ensures generated files meet format requirements
- Update detection: Compares existing llms.txt files with new versions
- CLI tool: Standalone command-line interface for one-off generation
- Common utilities: Shared configuration, logging, and helpers used by other crates
src/core-ltx/
├── src/
│ ├── lib.rs # Core library exports
│ ├── main.rs # CLI entry point
│ ├── errors.rs # Error types
│ ├── llms/ # LLM model integrations
│ │ ├── mod.rs # Model interface and generation logic
│ │ ├── chatgpt.rs # OpenAI GPT integration
│ │ ├── claude.rs # Anthropic Claude integration (placeholder)
│ │ └── prompts.rs # System prompts for llms.txt generation
│ ├── web_html.rs # HTML fetching and parsing
│ ├── md_llm_txt.rs # Markdown/llms.txt format handling
│ └── common/ # Shared utilities
│ ├── mod.rs # Common module exports
│ ├── auth_config.rs # Authentication configuration helpers
│ ├── tls_config.rs # TLS configuration helpers
│ ├── db_env.rs # Database configuration helpers
│ ├── hostname.rs # Hostname parsing utilities
│ ├── logging.rs # Logging setup
│ └── poll_interval.rs # Polling interval configuration
├── Cargo.toml
└── build.rs # Build script for embedding prompts
- Web Content Fetching: Downloads HTML from the target URL
- HTML Parsing: Extracts meaningful content using html5ever
- Content Preprocessing: Cleans and structures the extracted text
- LLM Prompting: Sends content to GPT-5.2 with specialized prompts
- Format Validation: Ensures output conforms to llms.txt specification
- Retry Logic: Automatically retries with fix prompts if validation fails
- Result Storage: Returns generated content for storage/serving
Currently supports:
- OpenAI GPT-5.2: Primary model for generation (requires
OPENAI_API_KEY) - OpenAI GPT-5 Mini: Faster, more cost-effective option
- OpenAI GPT-5 Nano: Lightweight option for simple sites
- Anthropic Claude: Integration structure in place (not yet fully implemented)
The system uses carefully crafted prompts (see src/llms/prompts.rs) to ensure the generated llms.txt files:
- Follow the proper markdown format
- Include accurate summaries of the website
- Provide useful context for LLM consumption
- Maintain consistent structure
When regenerating an llms.txt file:
- Fetches current content from the website
- Generates new llms.txt content
- Compares with existing version
- Determines if meaningful changes occurred
- Only updates if content has substantively changed
This prevents unnecessary updates for minor formatting differences or timestamp changes.
The common module provides shared functionality used across all crates:
- Auth configuration: Parsing and validation of authentication settings
- TLS configuration: Loading and configuring TLS certificates
- Database configuration: PostgreSQL connection string parsing
- Hostname utilities: URL parsing and validation
- Logging setup: Structured logging with tracing
- Poll intervals: Configuration of periodic task intervals
The core library uses environment variables for configuration:
OPENAI_API_KEY: OpenAI API key (required for generation)RUST_LOG: Logging level (default:info)
# Build the library and CLI
cargo build -p core-ltx
# Production build
cargo build -p core-ltx --releaseThe CLI tool allows standalone generation of llms.txt files:
# Basic usage: generate llms.txt for a website
cargo run -p core-ltx -- generate https://example.com
# Specify output file
cargo run -p core-ltx -- generate https://example.com --output example-llms.txt
# Update an existing llms.txt file
cargo run -p core-ltx -- update https://example.com --existing old-llms.txt
# Use different GPT model
cargo run -p core-ltx -- generate https://example.com --model gpt-5-mini
# View help
cargo run -p core-ltx -- --helpuse core_ltx::llms::{generate_llms_txt, LlmModel};
use core_ltx::web_html::fetch_and_parse_html;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Fetch website content
let html_content = fetch_and_parse_html("https://example.com").await?;
// Generate llms.txt using GPT-5.2
let llms_txt = generate_llms_txt(
"https://example.com",
&html_content,
LlmModel::Gpt52
).await?;
println!("Generated llms.txt:\n{}", llms_txt);
Ok(())
}# Run unit tests
cargo test -p core-ltx
# Run integration tests (requires OPENAI_API_KEY)
OPENAI_API_KEY=your_key cargo test -p core-ltx -- --ignored
# Run with coverage
just testThe system uses multi-stage prompting:
- Initial generation prompt: Describes the llms.txt format and requirements
- Fix prompts: If validation fails, provides specific feedback for correction
- Update prompts: For existing files, guides the model to detect meaningful changes
Prompts are embedded at compile time from template files and can be customized by modifying src/llms/prompts.rs.
The crate defines comprehensive error types in errors.rs:
WebFetchError: Problems downloading or parsing HTMLLlmError: Issues communicating with LLM APIsValidationError: llms.txt format validation failuresConfigError: Configuration or environment variable issues
All functions return Result types with descriptive errors.
Key dependencies:
async-openai: OpenAI API clientreqwest: HTTP client for web fetchinghtml5ever: HTML parsingmarkup5ever_rcdom: DOM representation for HTMLurl: URL parsing and validationmarkdown-ppp: Markdown preprocessingnom: Parser combinators for format validationtokio: Async runtimetracing: Structured logging
See Cargo.toml for the complete dependency list.
To add support for a new LLM provider:
- Create a new module in
src/llms/(e.g.,gemini.rs) - Implement the generation function following the existing pattern
- Add the provider to the
LlmModelenum insrc/llms/mod.rs - Update the CLI argument parsing in
src/main.rs - Add tests for the new provider
Prompts are defined in src/llms/prompts.rs. To customize:
- Modify the prompt templates in
prompts.rs - Test thoroughly to ensure generated content still validates
- Consider adding prompt versioning for reproducibility
Enable detailed logging:
RUST_LOG=core_ltx=debug cargo run -p core-ltx -- generate https://example.comThis will show:
- HTTP request/response details
- Raw HTML content (truncated)
- LLM prompts sent
- LLM responses received
- Validation errors (if any)
- Web fetching typically takes 1-3 seconds
- LLM generation typically takes 10-30 seconds
- Larger websites may require more processing time
- Consider using GPT-5-mini or GPT-5-nano for faster generation
- Connection pooling and request timeouts are configured for reliability
- llmstxt.org - Official llms.txt specification
- Project Root README - Overall project documentation
- api-ltx README - API server documentation
- worker-ltx README - Worker service documentation
- cron-ltx README - Update scheduler documentation