PandocUltimateConverter

MediaWiki extension for importing documents/webpages into wiki pages and exporting wiki pages to external formats — powered by Pandoc.

Import: convert DOCX, ODT, PDF, DOC, or a webpage URL into a wiki page (with images)
Export: download wiki pages as DOCX, ODT, EPUB, PDF, HTML, RTF, or TXT
AI cleanup: optional LLM-powered post-conversion wikitext polish (OpenAI or Claude)
Confluence migration: mass-import an entire Confluence space (Cloud or Server) into the wiki

MediaWiki page: https://www.mediawiki.org/wiki/Extension:PandocUltimateConverter

Supported on MediaWiki 1.42–1.45, Windows and Linux.

Installation

Install Pandoc
Download the extension into your extensions/ folder
Add to LocalSettings.php:

wfLoadExtension( 'PandocUltimateConverter' );

$wgEnableUploads = true;
$wgFileExtensions[] = 'docx';
$wgFileExtensions[] = 'odt';
$wgFileExtensions[] = 'pdf';
$wgFileExtensions[] = 'doc';

// Only needed if Pandoc is not in PATH:
// $wgPandocUltimateConverter_PandocExecutablePath = 'C:\Program Files\Pandoc\pandoc.exe';

Optional dependencies (only needed for specific formats):

PDF import: poppler (pdftohtml) — see Installing poppler
Scanned PDF / OCR: Tesseract — see Installing Tesseract
DOC import and PDF export: LibreOffice — see Installing LibreOffice

Import (Special:PandocUltimateConverter)

Go to Special:PandocUltimateConverter to convert a file or URL into a wiki page.

Choose source: file upload or URL
Enter the target page name
Click convert — you'll be redirected to the new page

What happens during conversion:

Images are extracted and uploaded to the wiki automatically (duplicates are skipped)
The uploaded source file is removed after conversion
Temporary files are cleaned up A legacy (non-Codex) form is available at Special:PandocUltimateConverter?codex=0.

AI Cleanup (LLM Polish)

The extension can optionally run an LLM (OpenAI or Claude) to clean up wikitext after conversion — fixing formatting issues, removing artefacts, and improving readability.

Setup

Add to LocalSettings.php:

$wgPandocUltimateConverter_LlmProvider = 'openai';   // or 'claude'
$wgPandocUltimateConverter_LlmApiKey   = 'sk-...';
// Optional: override the default model
// $wgPandocUltimateConverter_LlmModel = 'gpt-5.4-nano';   // OpenAI default; or 'claude-3-5-haiku-20241022' for Claude

Usage

There are two ways to use AI cleanup:

Batch mode — check the "Polish with AI" checkbox before clicking Convert all. Each item is converted first, then automatically queued for AI cleanup. The conversion queue and the AI cleanup queue run in parallel.
Per-item — click the ✨ button on any already-converted item to run AI cleanup on demand.

If AI cleanup fails, a per-item error is shown with a Retry button.

LLM Configuration

Parameter	Default	Description
`PandocUltimateConverter_LlmProvider`	`null`	`"openai"` or `"claude"`. Leave null to disable.
`PandocUltimateConverter_LlmApiKey`	`null`	API key for the configured provider.
`PandocUltimateConverter_LlmModel`	`null`	Model override. Defaults to `gpt-5.4-nano` (OpenAI) or `claude-3-5-haiku-20241022` (Claude).
`PandocUltimateConverter_LlmPrompt`	`null`	Custom system prompt for the cleanup step.

Export (Special:PandocExport)

Export one or more wiki pages to an external document format.

Go to Special:PandocExport or use the Export action in the page tools menu (the same menu where "Delete" and "Move" appear).

Supported export formats: DOCX, ODT, EPUB, PDF, HTML, RTF, TXT.

Features:

Export a single page or multiple pages into one document
Export entire categories (subcategories are resolved recursively)
"Separate files" option bundles each page as an individual file in a ZIP archive
Images referenced in wikitext are embedded into the output document
PDF export uses a Pandoc → DOCX → LibreOffice pipeline (no LaTeX required)

Demos

File import:

URL import:

Export to file

Confluence Migration (Special:ConfluenceMigration)

Mass-migrate an entire Confluence space to this wiki in one operation.

Go to Special:ConfluenceMigration and fill in:

Field	Description
Confluence URL	Base URL of your Confluence instance (see below)
Space key	Key of the Confluence space to migrate (e.g. `DOCS`, `DEV`)
Email / Username	Your Confluence login email (Cloud) or username (Server)
API token / Password	API token (Cloud) or password / personal access token (Server)
Target page prefix	Optional prefix prepended to every page title, e.g. `Confluence/DOCS`
Overwrite existing pages	When checked, existing wiki pages are replaced
Auto-categorize	Creates MediaWiki categories mirroring the Confluence page hierarchy (checked by default)

Cloud vs. Server

	Confluence Cloud	Confluence Server / Data Center
Base URL	`https://yourcompany.atlassian.net`	`https://confluence.yourcompany.com`
Username field	Your Atlassian account email	Your Confluence username
Token field	Atlassian API token	Password or Personal Access Token

What gets migrated

All pages in the specified space are fetched via the Confluence REST API v1.
Page content (Confluence "storage format" HTML) is converted to MediaWiki wikitext using Pandoc.
Common Confluence macros (code blocks, info/note/warning/tip panels) are converted to their MediaWiki equivalents.
File attachments are downloaded from Confluence and uploaded to the MediaWiki file repository.
Pages are created with the edit summary "Imported from Confluence".
When auto-categorize is enabled, pages with sub-pages get a matching category; nested sub-pages produce nested categories.

How it runs

The migration is processed as a background job via the MediaWiki job queue. You do not have to keep your browser open. When the migration finishes you receive an Echo notification (requires the Echo extension).

Jobs are processed by maintenance/runJobs.php or automatically during regular wiki requests if $wgJobRunRate > 0 (the default).

Disabling the feature

// LocalSettings.php
$wgPandocUltimateConverter_EnableConfluenceMigration = false;

Setting this to false hides Special:ConfluenceMigration entirely and displays a notice to users who navigate to it directly.

Supported import formats

Supports everything Pandoc supports. Tested: DOCX, ODT, PDF, DOC.

Format	Pipeline	Extra dependency
DOCX, ODT	Pandoc → wikitext	—
DOC	LibreOffice → DOCX → Pandoc	LibreOffice
PDF (text)	pdftohtml → HTML → Pandoc	poppler
PDF (scanned)	pdftoppm → Tesseract OCR → wikitext	poppler + Tesseract

Configuration

All parameters are set in LocalSettings.php with the $wg prefix.

Parameter	Default	Description
`PandocUltimateConverter_PandocExecutablePath`	`null`	Path to the Pandoc binary. Not needed if Pandoc is in PATH.
`PandocUltimateConverter_TempFolderPath`	`null`	Temp folder for conversion files. Uses system default if not set.
`PandocUltimateConverter_PdfToHtmlExecutablePath`	`null`	Path to poppler's `pdftohtml`. Not needed if in PATH.
`PandocUltimateConverter_LibreOfficeExecutablePath`	`null`	Path to `soffice`/`libreoffice`. Not needed if in PATH.
`PandocUltimateConverter_TesseractExecutablePath`	`null`	Path to the Tesseract OCR binary. Not needed if in PATH.
`PandocUltimateConverter_OcrLanguage`	`"eng"`	Tesseract language code(s). Use `+` for multiple, e.g. `"eng+deu"`.
`PandocUltimateConverter_PandocCustomUserRight`	`""`	Restrict access to a specific user right.
`PandocUltimateConverter_MediaFileExtensionsToSkip`	`[]`	File extensions to skip during image upload (e.g. `["emf"]`).
`PandocUltimateConverter_FiltersToUse`	`[]`	Custom Pandoc Lua filters to apply. Must be in the `filters/` folder.
`PandocUltimateConverter_UseColorProcessors`	`false`	Preserve text/background colors from DOCX/ODT files.
`PandocUltimateConverter_ShowExportInPageTools`	`true`	Show "Export" in the page Actions menu.
`PandocUltimateConverter_LlmProvider`	`null`	LLM provider: `"openai"` or `"claude"`.
`PandocUltimateConverter_LlmApiKey`	`null`	API key for the LLM provider.
`PandocUltimateConverter_LlmModel`	`null`	Model name override.
`PandocUltimateConverter_LlmPrompt`	`null`	Custom system prompt for AI cleanup.
`PandocUltimateConverter_EnableConfluenceMigration`	`true`	Set to `false` to disable `Special:ConfluenceMigration`.

Built-in Lua filters

Filters are placed in the filters/ subfolder. Add them via:

$wgPandocUltimateConverter_FiltersToUse[] = 'increase_heading_level.lua';

Filter	Description
`increase_heading_level.lua`	Increase heading levels by 1 (useful when documents start at H1)
`colorize_mark_class.lua`	Highlight "mark" classes with yellow background

Installing optional dependencies

Installing poppler

Required for PDF import. If not installed, PDF files will fail to convert — all other formats work normally.

Linux:

sudo apt install poppler-utils          # Debian/Ubuntu
sudo dnf install poppler-utils          # RHEL/Fedora

Windows:

choco install poppler

Or download manually from https://github.com/oschwartz10612/poppler-windows/releases and add bin/ to PATH, or set:

$wgPandocUltimateConverter_PdfToHtmlExecutablePath = 'C:\poppler\Library\bin\pdftohtml.exe';

Installing Tesseract

Required for scanned PDF OCR. Also requires poppler (pdftoppm, installed with pdftohtml).

Linux:

sudo apt install tesseract-ocr          # Debian/Ubuntu
sudo apt install tesseract-ocr-deu      # additional languages
sudo dnf install tesseract              # RHEL/Fedora

Windows:

choco install tesseract

Or download from https://github.com/UB-Mannheim/tesseract/wiki and add to PATH, or set:

$wgPandocUltimateConverter_TesseractExecutablePath = 'C:\Program Files\Tesseract-OCR\tesseract.exe';

Installing LibreOffice

Required for DOC import and PDF export.

Linux:

sudo apt install libreoffice            # Debian/Ubuntu
sudo dnf install libreoffice            # RHEL/Fedora

Windows: Download from https://www.libreoffice.org/download/download/ and add the program/ folder to PATH, or set:

$wgPandocUltimateConverter_LibreOfficeExecutablePath = 'C:\Program Files\LibreOffice\program\soffice.exe';

Action API

The extension exposes three API modules. Write operations (pandocconvert, pandocllmpolish) require a CSRF token and POST.

Obtain a CSRF token first:

GET /api.php?action=query&meta=tokens&format=json

action=pandocconvert

Converts a file or URL to a wiki page. Requires a CSRF token and POST.

POST /api.php
action=pandocconvert&pagename=My Article&url=https://example.com&forceoverwrite=1&token=<csrf>&format=json

Response:

{ "pandocconvert": { "result": "success", "pagename": "My Article" } }

Parameter	Required	Description
`pagename`	yes	Target wiki page title
`filename`	one of	Uploaded file name (mutually exclusive with `url`)
`url`	one of	`http`/`https` URL to fetch (mutually exclusive with `filename`)
`forceoverwrite`	no	`1` to overwrite existing page (default: `0`)
`token`	yes	CSRF token

action=pandocllmpolish

Runs LLM AI cleanup on an existing wiki page's wikitext. Requires a CSRF token and POST. The LLM provider must be configured.

POST /api.php
action=pandocllmpolish&pagename=My Article&token=<csrf>&format=json

Response:

{ "pandocllmpolish": { "result": "success", "pagename": "My Article" } }

Parameter	Required	Description
`pagename`	yes	Title of existing wiki page to polish
`token`	yes	CSRF token

action=pandocurltitle

Fetches remote URLs and extracts their HTML <title> tags. Used internally by the Codex UI to suggest page names for URL imports. GET request, no token required.

GET /api.php?action=pandocurltitle&urls=https://example.com&format=json

Response:

{ "pandocurltitle": { "results": [ { "url": "https://example.com", "title": "Example Domain" } ] } }

Accepts multiple URLs (pipe-separated). Only http/https URLs are accepted.

Parameter	Required	Description
`urls`	yes	One or more URLs (pipe-separated) to fetch titles from

API error codes

pandocconvert:

Code	Meaning
`nosource`	Neither `filename` nor `url` supplied
`multiplesource`	Both `filename` and `url` supplied
`invalidurlscheme`	URL is not `http`/`https`
`pageexists`	Page exists and `forceoverwrite` not set

pandocllmpolish:

Code	Meaning
`pagenotfound`	The specified page does not exist
`notconfigured`	LLM provider is not configured on this wiki
`notwikitext`	The page content is not wikitext

Debugging

Add to LocalSettings.php:

$wgShowExceptionDetails = true;
$wgDebugLogGroups['PandocUltimateConverter'] = '/var/log/mediawiki/pandoc.log';

The extension logs diagnostic messages to the PandocUltimateConverter log group.

Name		Name	Last commit message	Last commit date
Latest commit History 202 Commits
.github		.github
filters		filters
i18n		i18n
includes		includes
modules		modules
readers		readers
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
PandocUltimateConverter.alias.php		PandocUltimateConverter.alias.php
README.md		README.md
composer.json		composer.json
extension.json		extension.json
phpunit-e2e.xml.dist		phpunit-e2e.xml.dist
phpunit.xml.dist		phpunit.xml.dist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PandocUltimateConverter

Installation

Import (Special:PandocUltimateConverter)

AI Cleanup (LLM Polish)

Setup

Usage

LLM Configuration

Export (Special:PandocExport)

Demos

Confluence Migration (Special:ConfluenceMigration)

Cloud vs. Server

What gets migrated

How it runs

Disabling the feature

Supported import formats

Configuration

Built-in Lua filters

Installing optional dependencies

Installing poppler

Installing Tesseract

Installing LibreOffice

Action API

action=pandocconvert

action=pandocllmpolish

action=pandocurltitle

API error codes

Debugging

About

Uh oh!

Releases 9

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PandocUltimateConverter

Installation

Import (Special:PandocUltimateConverter)

AI Cleanup (LLM Polish)

Setup

Usage

LLM Configuration

Export (Special:PandocExport)

Demos

Confluence Migration (Special:ConfluenceMigration)

Cloud vs. Server

What gets migrated

How it runs

Disabling the feature

Supported import formats

Configuration

Built-in Lua filters

Installing optional dependencies

Installing poppler

Installing Tesseract

Installing LibreOffice

Action API

action=pandocconvert

action=pandocllmpolish

action=pandocurltitle

API error codes

Debugging

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages