Simple script to automatically convert Markdown files, specifically CoSAI's white papers, into nicely formatted PDFs. The process makes use of a few dependencies. The heavy lifting is performed by pandoc, plus a simple Python script to handle various nuances and corner cases that popped up. To run the tool
python convert.py whitepaper.md whitepaper.pdfThe convert.py script takes a few optional parameters, though we try to minimize the need to use them.
usage: convert.py [-h] [--title TITLE] [--author AUTHOR] [--date DATE] [--version VERSION]
[--engine {tectonic,pdflatex,xelatex,lualatex}] [--debug]
input_file output_file
Convert Markdown to PDF with Mermaid support.
positional arguments:
input_file Path to input Markdown file
output_file Path to output PDF file
options:
-h, --help show this help message and exit
--title TITLE Document title
--author AUTHOR Document author(s)
--date DATE Document date
--version VERSION Version of the paper
--engine {tectonic,pdflatex,xelatex,lualatex}
LaTeX engine to use (default: tectonic)
--debug Save intermediate files (processed.md, .tex) and show verbose output
- Python: 3.12 or higher
- Pandoc: 3.9 or higher
- Node.js: 20 or higher (for Mermaid CLI)
- LaTeX engine: One of: tectonic (default), pdflatex, xelatex, or lualatex
If your project uses VS Code devcontainers, add this feature to your .devcontainer/devcontainer.json:
{
"image": "mcr.microsoft.com/devcontainers/base:debian",
"features": {
"ghcr.io/cosai-oasis/cosai-whitepaper-converter/whitepaper-converter:1": {}
}
}This installs all dependencies automatically. See the Feature Documentation for configuration options (LaTeX engine selection, skip components, etc.).
Clone this repository and open in VS Code with the Dev Containers extension. All dependencies are pre-configured.
For CI/CD pipelines or manual installation, see docs/installation.md.
The converter supports multiple LaTeX engines. The engine is determined by priority:
- CLI flag:
--engine pdflatex - Environment variable:
LATEX_ENGINE=xelatex - Config file:
converter_config.jsonwith{"latex_engine": "lualatex"} - Default:
tectonic
| Engine | Unicode Support | Speed | Notes |
|---|---|---|---|
| tectonic | Full | Fast | Default, auto-downloads packages |
| xelatex | Full | Medium | Good for complex fonts |
| lualatex | Full | Slow | Most flexible |
| pdflatex | Limited | Fast | Requires Unicode normalization |
There are a few conventions we can use to simplify things. First, YAML metadata headers for all Markdown files can be automatically processed by pandoc, e.g.,
---
title: A Very Good White Paper
author: V. S. People
date: 1 January 2026
---
# Whitepaper Title
Lorem ipsum dolor sit amet consectetur adipiscing elit scelerisque semper felis gravida, pretium urna ornare facilisis est habitant tellus arcu euismod sodales egestas nibh, tincidunt cursus faucibus ultrices proin potenti facilisi magnis ligula blandit. Cursus penatibus per aptent placerat euismod mus lectus pharetra morbi, nascetur felis blandit sollicitudin bibendum etiam sed fames, nec facilisis ac tempus tempor sem venenatis vel. Est arcu at iaculis sed tellus nam nascetur primis nibh etiam odio penatibus, dis integer nostra euismod consequat interdum sociis parturient habitant ornare sagittis, morbi per dictumst enim purus justo fusce feugiat leo facilisis mauris.The converter can be embedded in other repositories as a git submodule, allowing whitepapers to live alongside their source code while sharing a common conversion toolchain. This is the recommended approach for organizations managing multiple documents.
# Add the converter as a submodule (conventionally named latex-template/)
git submodule add https://github.com/cosai-oasis/cosai-whitepaper-converter.git latex-template
# Initialize and fetch the submodule
git submodule update --init --recursiveA Makefile can automate builds and pass dynamic values like git commit hashes as the document version:
MDs := whitepaper-1.md \
whitepaper-2.md
PDFs := $(MDs:.md=.pdf)
all: $(PDFs)
%.pdf: %.md
@echo "Converting $< → $@"
python latex-template/convert.py $< $@ --version=$$(git log -1 --format=%H $<)
.PHONY: clean
clean:
@rm -f $(PDFs)The converter automatically detects when running from a latex-template/ subdirectory and locates its assets accordingly.
For detailed submodule setup instructions, see docs/installation.md.
- rsvg-convert: For converting SVG to PDF (used internally by Pandoc for Mermaid diagrams)
Testing several markdown files has revealed a few best practices or things to note:
- The
convert.pyfile will omit a manually included table of contents in favor of the LaTeX generated TOC. As such, any references numbered section will break, e.g.,[Link](#123-Subsection-Link). A link to the section itself,#Subsection-Linkwill still work. A consistent section name should be used. - Anchor tags can be used to include links which pandoc will preserve
- White space at the end of lines, such as lists, causes Markdown to render as separate sections (includes extra vertical whitespace) and will mess up PDF formatting. The
convert.pyscript will strip any line-ending whitespace globally. - Mermaid files with a title will have that title extracted, and stripped, so that it can be used in the latex
\figurecaption. Mermaid is converted to PDF and will have a consistent CoSAI theme applied unless already present in the metadata header. (config:section) - Any break,
<br>, and variants, are converted to\newline. Errant<br>will break the LaTeX compilation. If you get an error about\newline, try this. - Pandoc attributes like
{width=55%}and raw LaTeX commands like\newpagecan be wrapped in HTML comments (<!--{width=55%}-->,<!--\newpage-->) to hide them from GitHub rendering. The converter automatically strips the comment wrapper during PDF conversion so they take effect in LaTeX output.
Bold text in tables not renderingFixed by upgrading Pandoc to 3.9. Earlier versions occasionally failed to convert**bold**to\textbfinside Markdown tables."No counter '' defined"Fixed by upgrading Pandoc to 3.8.2.1. Pandoc 3.8.1 emitted\def\LTcaptype{}for uncaptioned tables, which broke with thecaptionpackage. If you see this error, upgrade Pandoc.
- Unicode characters: If using pdflatex, Unicode characters are auto-converted. For full Unicode support, use
--engine tectonicor--engine xelatex. - Missing packages: tectonic auto-downloads packages; other engines may require manual installation via
tlmgror system package manager. \newlineerrors: Often caused by errant<br>tags in unexpected places. Check your Markdown source.
- Ensure Node.js 20+ is installed
- Check that
npxis available in your PATH - Mermaid CLI is run via
npx -y @mermaid-js/mermaid-cli
Use --debug to save intermediate files and see verbose engine output:
python convert.py input.md output.pdf --debugThis produces output_debug.md (preprocessed Markdown) and output_debug.tex (intermediate LaTeX) alongside the PDF.
- Verify Pandoc is installed:
pandoc --version(must be 3.9+) - Ensure Pandoc is in your PATH
- The devcontainer includes Pandoc pre-installed
For more details, see docs/troubleshooting.md.