A high-performance Rust library that converts 40+ document formats to clean, readable Markdown. Perfect for preparing documents for LLM consumption, documentation generation, knowledge bases, or archival.
π Rust implementation of the original markitdown Python library with extensive format support and async-first design.
- 40+ Format Support: Word, Excel, PowerPoint (modern & legacy), PDF, EPUB, HTML, Markdown, LaTeX, and more
- Async-First Design: Non-blocking I/O with Tokio runtime
- Archive Extraction: Automatically extract and convert ZIP, TAR, GZIP, and more
- Image Extraction: Optional intelligent image extraction with LLM-powered descriptions
- LLM Integration: Works with OpenAI, Gemini, Claude, Cohere, and custom providers
- Streaming Support: Process large files efficiently
- Rich Output Structure: Preserves pagination, images, tables, and metadata
- Production-Ready: Comprehensive test suite with 198+ passing tests
Microsoft Office (Modern)
- Word (.docx, .dotx, .dotm)
- Excel (.xlsx, .xltx, .xltm)
- PowerPoint (.pptx, .potx, .potm)
Microsoft Office (Legacy)
- Word 97-2003 (.doc)
- Excel 97-2003 (.xls)
- PowerPoint 97-2003 (.ppt)
- Rich Text Format (.rtf)
OpenDocument Format
- Text (.odt, .ott)
- Spreadsheet (.ods, .ots)
- Presentation (.odp, .otp)
Apple iWork
- Pages (.pages)
- Numbers (.numbers)
- Keynote (.key)
Other Document Formats
- PDF (.pdf)
- Intelligent fallback mechanism: Automatically detects scanned PDFs, complex pages with diagrams, or pages with limited text and images
- Uses text extraction by default for efficiency
- Falls back to LLM-powered page rendering when:
- Page has < 10 words (likely scanned)
- Low alphanumeric ratio < 0.5 (OCR artifacts/garbage)
- Unstructured content < 50 characters
- Page contains images + < 350 words (provides full context to LLM)
- Renders entire page as PNG for LLM processing when needed
- EPUB (.epub)
- Markdown (.md)
- CSV (.csv)
- Excel spreadsheets (.xlsx, .xls)
- SQLite databases (.sqlite, .db)
- XML (.xml)
- RSS feeds (.rss, .atom)
- HTML (.html, .htm)
- Email (.eml, .msg)
- vCard (.vcf)
- iCalendar (.ics)
- BibTeX (.bib)
- ZIP (.zip)
- TAR (.tar, .tar.gz, .tar.bz2, .tar.xz)
- GZIP (.gz)
- BZIP2 (.bz2)
- XZ (.xz)
- ZSTD (.zst)
- 7-Zip (.7z)
- Images (.jpg, .png, .gif, .bmp, .tiff, .webp)
- With LLM integration for intelligent image descriptions
- Audio (planned)
- Plain text (.txt)
- Log files (.log)
Note: All formats support both file path and in-memory bytes conversion.
cargo install markitdown
markitdown path-to-file.pdf
Or use -o to specify the output file:
markitdown path-to-file.pdf -o document.md
Supported formats include Office documents (.docx, .xlsx, .pptx), legacy Office (.doc, .xls, .ppt), OpenDocument (.odt, .ods), Apple iWork (.pages, .numbers, .key), PDFs, EPUB, images, archives, and more. See the full list above.
Add the following to your Cargo.toml:
[dependencies]
markitdown = "0.1.10"use markitdown::MarkItDown;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let md = MarkItDown::new();
Ok(())
}use markitdown::{ConversionOptions, MarkItDown};
use object_store::local::LocalFileSystem;
use object_store::path::Path as ObjectPath;
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let md = MarkItDown::new();
// Create a local file system object store
let store = Arc::new(LocalFileSystem::new());
// Convert file path string to ObjectStore Path
let path = ObjectPath::from("path/to/file.xlsx");
// Basic conversion - file type is auto-detected from extension
let result = md.convert_with_store(store.clone(), &path, None).await?;
println!("Converted Text: {}", result.to_markdown());
// Convert legacy Office formats
let doc_path = ObjectPath::from("document.doc");
let result = md.convert_with_store(store.clone(), &doc_path, None).await?;
// Convert archives (extracts and converts contents)
let zip_path = ObjectPath::from("archive.zip");
let result = md.convert_with_store(store.clone(), &zip_path, None).await?;
// Or explicitly specify options
let options = ConversionOptions::default()
.with_extension(".xlsx")
.with_extract_images(true);
let result = md.convert_with_store(store, &path, Some(options)).await?;
Ok(())
}Important: The library uses
object_storefor file operations, not plain file paths. You must:
- Create an
ObjectStoreimplementation (likeLocalFileSystemfor local files)- Convert file path strings to
object_store::path::PathusingPath::from()- Use
convert_with_store()method with the store and pathFor convenience, there's also a
convert()method that accepts string paths and usesLocalFileSysteminternally.
use markitdown::{ConversionOptions, MarkItDown, create_llm_client};
use rig::providers::openai;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let md = MarkItDown::new();
// Create an LLM client using any rig-core compatible provider
// OpenAI example:
let openai_client = openai::Client::from_env();
let model = openai_client.completion_model("gpt-4o");
let llm = create_llm_client(model);
// Google Gemini example:
// let gemini_client = gemini::Client::from_env();
// let model = gemini_client.completion_model("gemini-2.0-flash");
// let llm = create_llm_client(model);
// Anthropic Claude example:
// let anthropic_client = anthropic::Client::from_env();
// let model = anthropic_client.completion_model("claude-sonnet-4-20250514");
// let llm = create_llm_client(model);
// Cohere example with custom endpoint:
// let api_key = std::env::var("COHERE_API_KEY")?;
// let mut builder = rig::providers::cohere::Client::builder(&api_key);
// if let Some(endpoint) = custom_endpoint {
// builder = builder.base_url(endpoint);
// }
// let client = builder.build();
// let model = client.completion_model("command-r-plus");
// let llm = create_llm_client(model);
let options = ConversionOptions::default()
.with_extension(".jpg")
.with_llm(llm);
let result = md.convert("path/to/image.jpg", Some(options)).await?;
println!("Image description: {}", result.to_markdown());
Ok(())
}Environment Variables for LLM Tests (OpenRouter):
The integration test in tests/llm.rs expects these variables (via .env or your shell):
export OPENROUTER_API_KEY="your_api_key"
export OPENROUTER_ENDPOINT="https://openrouter.ai/api/v1"
export OPENROUTER_MODEL="@preset/prod-free"If any of them are missing, the LLM test is skipped.
Supported LLM Providers (via rig-core):
- OpenAI (GPT-4, GPT-4o, etc.)
- Google Gemini (gemini-2.0-flash, gemini-pro, etc.)
- Anthropic Claude (claude-sonnet, claude-opus, etc.)
- Cohere (command-r-plus, etc.)
- Any custom provider implementing
CompletionModel
use markitdown::{ConversionOptions, MarkItDown};
use bytes::Bytes;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let md = MarkItDown::new();
let file_bytes = std::fs::read("path/to/file.pdf")?;
// Auto-detect file type from bytes
let result = md.convert_bytes(Bytes::from(file_bytes.clone()), None).await?;
println!("Converted: {}", result.to_markdown());
// Or specify options explicitly
let options = ConversionOptions::default()
.with_extension(".pdf");
let result = md.convert_bytes(Bytes::from(file_bytes), Some(options)).await?;
Ok(())
}The conversion returns a Document struct that preserves the page/slide structure of the original file:
use markitdown::{MarkItDown, Document, Page, ContentBlock, ExtractedImage};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let md = MarkItDown::new();
let result: Document = md.convert("presentation.pptx", None).await?;
// Access document metadata
if let Some(title) = &result.title {
println!("Document: {}", title);
}
// Iterate through pages/slides
for page in &result.pages {
println!("Page {}", page.page_number);
// Get page content as markdown
let markdown = page.to_markdown();
// Or access individual content blocks
for block in &page.content {
match block {
ContentBlock::Text(text) => println!("Text: {}", text),
ContentBlock::Heading { level, text } => println!("H{}: {}", level, text),
ContentBlock::Image(img) => {
println!("Image: {} ({} bytes)", img.id, img.data.len());
if let Some(desc) = &img.description {
println!(" Description: {}", desc);
}
}
ContentBlock::Table { headers, rows } => {
println!("Table: {} cols, {} rows", headers.len(), rows.len());
}
ContentBlock::List { ordered, items } => {
println!("List ({} items)", items.len());
}
ContentBlock::Code { language, code } => {
println!("Code block: {:?}", language);
}
ContentBlock::Quote(text) => println!("Quote: {}", text),
ContentBlock::Markdown(md) => println!("Markdown: {}", md),
}
}
// Get all images from this page
let images: Vec<&ExtractedImage> = page.images();
// Access rendered page image (for scanned PDFs, complex pages)
if let Some(rendered) = &page.rendered_image {
println!("Page rendered as image: {} bytes", rendered.data.len());
}
}
// Convert entire document to markdown (with page separators)
let full_markdown = result.to_markdown();
// Get all images from the entire document
let all_images = result.images();
Ok(())
}Output Structure:
Document- Complete document with optional title, pages, and metadataPage- Single page/slide with page number and content blocksContentBlock- Individual content element (Text, Heading, Image, Table, List, Code, Quote, Markdown)rendered_image- Optional full-page render (for scanned PDFs, slides with complex layouts)
ExtractedImage- Image data with id, bytes, MIME type, dimensions, alt text, and LLM description
This structure is ideal for:
- Pagination-aware processing - Handle each page separately
- Image extraction - Access embedded images with their metadata
- Structured content - Work with tables, lists, headings programmatically
- LLM pipelines - Pass individual pages or content blocks to AI models
- 40+ new formats including legacy Office (.doc, .xls, .ppt), OpenDocument (.odt, .ods, .odp), Apple iWork (.pages, .numbers, .key)
- Archive support for ZIP, TAR, GZIP, BZIP2, XZ, ZSTD, and 7-Zip with automatic content extraction
- Additional formats: EPUB, vCard, iCalendar, BibTeX, log files, SQLite databases, email files
- Static compilation for compression libraries (bzip2, xz2) for better portability
- Improved file detection - prioritizes file extension over magic byte detection for legacy formats
- Template support for Office formats (.dotx, .potx, .xltx)
- LLM flexibility - works with any rig-core compatible model (OpenAI, Gemini, Claude, Cohere, custom providers)
- Comprehensive test suite using real-world files from Kreuzberg
- Tests for all supported formats with both file and bytes conversion
- In-memory test generation for compression formats
You can extend MarkItDown by implementing the DocumentConverter trait for your custom converters and registering them:
use markitdown::{DocumentConverter, Document, ConversionOptions, MarkItDown};
use markitdown::error::MarkitdownError;
use async_trait::async_trait;
use bytes::Bytes;
use std::sync::Arc;
use object_store::ObjectStore;
struct MyCustomConverter;
#[async_trait]
impl DocumentConverter for MyCustomConverter {
async fn convert(
&self,
store: Arc<dyn ObjectStore>,
path: &object_store::path::Path,
options: Option<ConversionOptions>,
) -> Result<Document, MarkitdownError> {
// Implement file conversion logic
todo!()
}
async fn convert_bytes(
&self,
bytes: Bytes,
options: Option<ConversionOptions>,
) -> Result<Document, MarkitdownError> {
// Implement bytes conversion logic
todo!()
}
fn supported_extensions(&self) -> &[&str] {
&[".custom"]
}
}
let mut md = MarkItDown::new();
md.register_converter(Box::new(MyCustomConverter));Contributions are welcome! Please feel free to submit issues or pull requests.
For detailed information, see:
- FORMATS.md β Complete reference of all 40+ supported formats with capabilities and limitations
- ARCHITECTURE.md β Internal design, converter pattern, and how to implement new formats
- TESTING.md β Comprehensive testing guide with 198+ test examples
- FORMAT_COVERAGE.md β Converter matrix with extensions and test locations
- API Documentation β Usage examples and API reference
- CLI Usage β Command-line tool guide
- Adding Formats β Extend with custom converters
- LLM Integration β Use AI for image descriptions
- Original Python implementation: microsoft/markitdown
- Test files from: Kreuzberg
MarkItDown is licensed under the MIT License. See LICENSE for more details.