PDF document loader. Parses PDF files into DoclingDocument structured models via PdfPig. Provides font-size based heading detection, bullet/numbered list recognition, and spatial paragraph grouping. Supports encrypted PDFs, metadata extraction, and page number headers.
dotnet add package Mythosia.Documents.Pdfusing Mythosia.Documents.Pdf;
var loader = new PdfDocumentLoader();
IReadOnlyList<DoclingDocument> docs = await loader.LoadAsync("docs/manual.pdf");
string markdown = docs[0].ToMarkdown();var service = new AnthropicService(apiKey, httpClient)
.WithRag(rag => rag
.AddDocuments(new PdfDocumentLoader(), "docs/manual.pdf")
);
// Or auto-select loader by extension:
var service = new AnthropicService(apiKey, httpClient)
.WithRag(rag => rag.AddDocument("docs/manual.pdf"));The parser analyses font sizes and spatial layout to produce a structured DoclingDocument:
- Headings — text with font size exceeding the body font size (mode) by ≥15% is classified as heading level 1–3 based on size ratio.
- Lists — lines starting with bullet characters (
•,-,*, etc.) or numbered patterns (1.,a),iv.) are emitted as list items. - Paragraphs — words are grouped into lines by Y-coordinate proximity. Consecutive body-text lines are merged into a single paragraph; vertical gaps larger than 1.4× line height trigger a paragraph break.
- Fallback — if
GetWords()returns no results but raw page text exists, the text is preserved as a paragraph.
using Mythosia.Documents.Pdf;
var options = new PdfParserOptions
{
Password = null, // For encrypted PDFs
IncludeMetadata = true, // Extract title, author, page count
IncludePageNumbers = false, // Add page number headers
NormalizeWhitespace = true, // Collapse excessive whitespace (preserves newlines)
};
var loader = new PdfDocumentLoader(options: options);Implement IDocumentParser and pass it to the loader:
var loader = new PdfDocumentLoader(parser: new MyCustomPdfParser());| Package | Description |
|---|---|
| Mythosia.Documents.Abstractions | Core abstractions (DoclingDocument, IDocumentLoader) |
| Mythosia.Documents.Office | Word / Excel / PowerPoint loaders |
| Mythosia.AI.Rag | RAG pipeline |