Name	Name	Last commit message	Last commit date
parent directory ..
Mythosia.Documents.Pdf.csproj	Mythosia.Documents.Pdf.csproj
PdfDocumentLoader.cs	PdfDocumentLoader.cs
PdfParserOptions.cs	PdfParserOptions.cs
PdfPigParser.cs	PdfPigParser.cs
README.md	README.md
RELEASE_NOTES.md	RELEASE_NOTES.md

Name

Last commit message

Last commit date

Mythosia.Documents.Pdf

PDF document loader. Parses PDF files into DoclingDocument structured models via PdfPig. Provides font-size based heading detection, bullet/numbered list recognition, and spatial paragraph grouping. Supports encrypted PDFs, metadata extraction, and page number headers.

Installation

dotnet add package Mythosia.Documents.Pdf

Quick Start

using Mythosia.Documents.Pdf;

var loader = new PdfDocumentLoader();
IReadOnlyList<DoclingDocument> docs = await loader.LoadAsync("docs/manual.pdf");

string markdown = docs[0].ToMarkdown();

With RAG Pipeline

var service = new AnthropicService(apiKey, httpClient)
    .WithRag(rag => rag
        .AddDocuments(new PdfDocumentLoader(), "docs/manual.pdf")
    );

// Or auto-select loader by extension:
var service = new AnthropicService(apiKey, httpClient)
    .WithRag(rag => rag.AddDocument("docs/manual.pdf"));

Structured Extraction

The parser analyses font sizes and spatial layout to produce a structured DoclingDocument:

Headings — text with font size exceeding the body font size (mode) by ≥15% is classified as heading level 1–3 based on size ratio.
Lists — lines starting with bullet characters (•, -, *, etc.) or numbered patterns (1., a), iv.) are emitted as list items.
Paragraphs — words are grouped into lines by Y-coordinate proximity. Consecutive body-text lines are merged into a single paragraph; vertical gaps larger than 1.4× line height trigger a paragraph break.
Fallback — if GetWords() returns no results but raw page text exists, the text is preserved as a paragraph.

Parser Options

using Mythosia.Documents.Pdf;

var options = new PdfParserOptions
{
    Password = null,              // For encrypted PDFs
    IncludeMetadata = true,       // Extract title, author, page count
    IncludePageNumbers = false,   // Add page number headers
    NormalizeWhitespace = true,   // Collapse excessive whitespace (preserves newlines)
};

var loader = new PdfDocumentLoader(options: options);

Custom Parser

Implement IDocumentParser and pass it to the loader:

var loader = new PdfDocumentLoader(parser: new MyCustomPdfParser());

Related Packages

Package	Description
Mythosia.Documents.Abstractions	Core abstractions (DoclingDocument, IDocumentLoader)
Mythosia.Documents.Office	Word / Excel / PowerPoint loaders
Mythosia.AI.Rag	RAG pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Mythosia.Documents.Pdf

Installation

Quick Start

With RAG Pipeline

Structured Extraction

Parser Options

Custom Parser

Related Packages

FilesExpand file tree

Mythosia.Documents.Pdf

Directory actions

More options

Directory actions

More options

Latest commit

History

Mythosia.Documents.Pdf

Folders and files

parent directory

README.md

Mythosia.Documents.Pdf

Installation

Quick Start

With RAG Pipeline

Structured Extraction

Parser Options

Custom Parser

Related Packages