Skip to content

Commit bc4d650

Browse files
authored
Merge pull request #19 from Griboedow/copilot/consider-claude-and-openai
Add optional LLM polishing step after Pandoc conversion (Claude + OpenAI)
2 parents 7d6ed68 + 3f2efd1 commit bc4d650

33 files changed

+1382
-109
lines changed

README.md

Lines changed: 95 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ MediaWiki extension for **importing** documents/webpages into wiki pages and **e
44

55
- **Import**: convert DOCX, ODT, PDF, DOC, or a webpage URL into a wiki page (with images)
66
- **Export**: download wiki pages as DOCX, ODT, EPUB, PDF, HTML, RTF, or TXT
7+
- **AI cleanup**: optional LLM-powered post-conversion wikitext polish (OpenAI or Claude)
78

89
MediaWiki page: https://www.mediawiki.org/wiki/Extension:PandocUltimateConverter
910

@@ -47,9 +48,40 @@ What happens during conversion:
4748
- Images are extracted and uploaded to the wiki automatically (duplicates are skipped)
4849
- The uploaded source file is removed after conversion
4950
- Temporary files are cleaned up
50-
5151
A legacy (non-Codex) form is available at `Special:PandocUltimateConverter?codex=0`.
5252

53+
## AI Cleanup (LLM Polish)
54+
55+
The extension can optionally run an LLM (OpenAI or Claude) to clean up wikitext after conversion — fixing formatting issues, removing artefacts, and improving readability.
56+
57+
### Setup
58+
59+
Add to `LocalSettings.php`:
60+
```php
61+
$wgPandocUltimateConverter_LlmProvider = 'openai'; // or 'claude'
62+
$wgPandocUltimateConverter_LlmApiKey = 'sk-...';
63+
// Optional: override the default model
64+
// $wgPandocUltimateConverter_LlmModel = 'gpt-5.4-nano'; // OpenAI default; or 'claude-3-5-haiku-20241022' for Claude
65+
```
66+
67+
### Usage
68+
69+
There are two ways to use AI cleanup:
70+
71+
1. **Batch mode** — check the "Polish with AI" checkbox before clicking **Convert all**. Each item is converted first, then automatically queued for AI cleanup. The conversion queue and the AI cleanup queue run in parallel.
72+
2. **Per-item** — click the ✨ button on any already-converted item to run AI cleanup on demand.
73+
74+
If AI cleanup fails, a per-item error is shown with a **Retry** button.
75+
76+
### LLM Configuration
77+
78+
| Parameter | Default | Description |
79+
|-----------|---------|-------------|
80+
| `PandocUltimateConverter_LlmProvider` | `null` | `"openai"` or `"claude"`. Leave null to disable. |
81+
| `PandocUltimateConverter_LlmApiKey` | `null` | API key for the configured provider. |
82+
| `PandocUltimateConverter_LlmModel` | `null` | Model override. Defaults to `gpt-5.4-nano` (OpenAI) or `claude-3-5-haiku-20241022` (Claude). |
83+
| `PandocUltimateConverter_LlmPrompt` | `null` | Custom system prompt for the cleanup step. |
84+
5385
## Export (Special:PandocExport)
5486

5587
Export one or more wiki pages to an external document format.
@@ -105,6 +137,10 @@ All parameters are set in `LocalSettings.php` with the `$wg` prefix.
105137
| `PandocUltimateConverter_FiltersToUse` | `[]` | Custom [Pandoc Lua filters](https://pandoc.org/filters.html) to apply. Must be in the `filters/` folder. |
106138
| `PandocUltimateConverter_UseColorProcessors` | `false` | Preserve text/background colors from DOCX/ODT files. |
107139
| `PandocUltimateConverter_ShowExportInPageTools` | `true` | Show "Export" in the page Actions menu. |
140+
| `PandocUltimateConverter_LlmProvider` | `null` | LLM provider: `"openai"` or `"claude"`. |
141+
| `PandocUltimateConverter_LlmApiKey` | `null` | API key for the LLM provider. |
142+
| `PandocUltimateConverter_LlmModel` | `null` | Model name override. |
143+
| `PandocUltimateConverter_LlmPrompt` | `null` | Custom system prompt for AI cleanup. |
108144

109145
### Built-in Lua filters
110146

@@ -176,26 +212,26 @@ $wgPandocUltimateConverter_LibreOfficeExecutablePath = 'C:\Program Files\LibreOf
176212

177213
## Action API
178214

179-
The extension exposes `action=pandocconvert` for programmatic conversions.
215+
The extension exposes three API modules. Write operations (`pandocconvert`, `pandocllmpolish`) require a CSRF token and POST.
180216

181-
Requires a CSRF token and POST. Obtain a token:
217+
Obtain a CSRF token first:
182218
```
183219
GET /api.php?action=query&meta=tokens&format=json
184220
```
185221

186-
**Convert a URL:**
187-
```
188-
POST /api.php
189-
action=pandocconvert&url=https://example.com&pagename=My Article&forceoverwrite=1&token=<csrf>&format=json
190-
```
222+
### action=pandocconvert
223+
224+
Converts a file or URL to a wiki page. Requires a CSRF token and POST.
191225

192-
**Convert an uploaded file:**
193226
```
194227
POST /api.php
195-
action=pandocconvert&filename=Document.docx&pagename=My Article&forceoverwrite=1&token=<csrf>&format=json
228+
action=pandocconvert&pagename=My Article&url=https://example.com&forceoverwrite=1&token=<csrf>&format=json
196229
```
197230

198-
### API parameters
231+
**Response:**
232+
```json
233+
{ "pandocconvert": { "result": "success", "pagename": "My Article" } }
234+
```
199235

200236
| Parameter | Required | Description |
201237
|-----------|----------|-------------|
@@ -205,15 +241,62 @@ action=pandocconvert&filename=Document.docx&pagename=My Article&forceoverwrite=1
205241
| `forceoverwrite` | no | `1` to overwrite existing page (default: `0`) |
206242
| `token` | yes | CSRF token |
207243

244+
### action=pandocllmpolish
245+
246+
Runs LLM AI cleanup on an existing wiki page's wikitext. Requires a CSRF token and POST. The LLM provider must be [configured](#llm-configuration).
247+
248+
```
249+
POST /api.php
250+
action=pandocllmpolish&pagename=My Article&token=<csrf>&format=json
251+
```
252+
253+
**Response:**
254+
```json
255+
{ "pandocllmpolish": { "result": "success", "pagename": "My Article" } }
256+
```
257+
258+
| Parameter | Required | Description |
259+
|-----------|----------|-------------|
260+
| `pagename` | yes | Title of existing wiki page to polish |
261+
| `token` | yes | CSRF token |
262+
263+
### action=pandocurltitle
264+
265+
Fetches remote URLs and extracts their HTML `<title>` tags. Used internally by the Codex UI to suggest page names for URL imports. GET request, no token required.
266+
267+
```
268+
GET /api.php?action=pandocurltitle&urls=https://example.com&format=json
269+
```
270+
271+
**Response:**
272+
```json
273+
{ "pandocurltitle": { "results": [ { "url": "https://example.com", "title": "Example Domain" } ] } }
274+
```
275+
276+
Accepts multiple URLs (pipe-separated). Only `http`/`https` URLs are accepted.
277+
278+
| Parameter | Required | Description |
279+
|-----------|----------|-------------|
280+
| `urls` | yes | One or more URLs (pipe-separated) to fetch titles from |
281+
208282
### API error codes
209283

284+
**pandocconvert:**
285+
210286
| Code | Meaning |
211287
|------|---------|
212288
| `nosource` | Neither `filename` nor `url` supplied |
213289
| `multiplesource` | Both `filename` and `url` supplied |
214290
| `invalidurlscheme` | URL is not `http`/`https` |
215291
| `pageexists` | Page exists and `forceoverwrite` not set |
216-
| `conversionfailed` | Pandoc conversion failed |
292+
293+
**pandocllmpolish:**
294+
295+
| Code | Meaning |
296+
|------|---------|
297+
| `pagenotfound` | The specified page does not exist |
298+
| `notconfigured` | LLM provider is not configured on this wiki |
299+
| `notwikitext` | The page content is not wikitext |
217300

218301
## Debugging
219302

extension.json

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "PandocUltimateConverter",
3-
"version": "0.5.0",
3+
"version": "0.6.0",
44
"author": [
55
"[https://www.mediawiki.org/wiki/User:Urfiner Urfiner] (Nikolai Kochkin)"
66
],
@@ -20,6 +20,7 @@
2020
},
2121
"APIModules": {
2222
"pandocconvert": "MediaWiki\\Extension\\PandocUltimateConverter\\Api\\ApiPandocConvert",
23+
"pandocllmpolish": "MediaWiki\\Extension\\PandocUltimateConverter\\Api\\ApiPandocLlmPolish",
2324
"pandocurltitle": "MediaWiki\\Extension\\PandocUltimateConverter\\Api\\ApiPandocUrlTitle"
2425
},
2526
"ExtensionMessagesFiles": {
@@ -95,7 +96,8 @@
9596
"pandocultimateconverter-codex-overwrite-toggle",
9697
"pandocultimateconverter-codex-status-uploading",
9798
"pandocultimateconverter-codex-status-converting",
98-
"pandocultimateconverter-codex-status-done",
99+
"pandocultimateconverter-codex-status-done-converted",
100+
"pandocultimateconverter-codex-status-done-polished",
99101
"pandocultimateconverter-codex-status-error",
100102
"pandocultimateconverter-codex-retry",
101103
"pandocultimateconverter-codex-column-source",
@@ -109,7 +111,12 @@
109111
"pandocultimateconverter-codex-convert-one",
110112
"pandocultimateconverter-codex-switch-classic",
111113
"pandocultimateconverter-codex-stop",
112-
"pandocultimateconverter-codex-stop-requested"
114+
"pandocultimateconverter-codex-stop-requested",
115+
"pandocultimateconverter-codex-llm-polish-toggle",
116+
"pandocultimateconverter-codex-llm-polish-btn",
117+
"pandocultimateconverter-codex-status-polishing",
118+
"pandocultimateconverter-codex-status-polish-error",
119+
"pandocultimateconverter-conversion-complete-comment"
113120
]
114121
},
115122
"ext.PandocUltimateConverter.export": {
@@ -211,6 +218,30 @@
211218
"path": false,
212219
"description": "When true, show an 'Export' action in the page tools menu (Actions tab) on content pages.",
213220
"public": true
221+
},
222+
"PandocUltimateConverter_LlmProvider": {
223+
"value": null,
224+
"path": false,
225+
"description": "LLM provider to use for optional wikitext cleanup. Supported values: 'openai' or 'claude'. Leave null to disable LLM polishing.",
226+
"public": false
227+
},
228+
"PandocUltimateConverter_LlmApiKey": {
229+
"value": null,
230+
"path": false,
231+
"description": "API key for the configured LLM provider. Required when LlmProvider is set.",
232+
"public": false
233+
},
234+
"PandocUltimateConverter_LlmModel": {
235+
"value": null,
236+
"path": false,
237+
"description": "Model name to use for LLM polishing. Defaults to 'gpt-5.4-nano' for OpenAI and 'claude-3-5-haiku-20241022' for Claude.",
238+
"public": false
239+
},
240+
"PandocUltimateConverter_LlmPrompt": {
241+
"value": null,
242+
"path": false,
243+
"description": "Custom instruction prompt for the LLM cleanup step. Uses a sensible default if not set.",
244+
"public": false
214245
}
215246
},
216247
"ConfigRegistry": {

i18n/ar.json

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@
3939
"apihelp-pandocconvert-param-forceoverwrite": "إذا تم التعيين، فسيتم استبدال الصفحة المستهدفة إن وُجدت.",
4040
"apihelp-pandocconvert-example-file": "تحويل ملف مرفوع مسبقاً باسم Document.docx إلى الصفحة MyArticle.",
4141
"apihelp-pandocconvert-example-url": "تحويل المحتوى من https://example.com إلى الصفحة MyArticle.",
42+
"apihelp-pandocllmpolish-summary": "تشغيل تنظيف الذكاء الاصطناعي (LLM) على نص الويكي لصفحة موجودة.",
43+
"apihelp-pandocllmpolish-param-pagename": "عنوان صفحة الويكي الموجودة المراد تحسينها.",
4244
"pandocultimateconverter-codex-description": "أضف ملفات أو روابط URL لتحويلها إلى صفحات ويكي. يمكنك معالجة عناصر متعددة دفعةً واحدة.",
4345
"pandocultimateconverter-codex-switch-classic": "التبديل إلى النموذج الكلاسيكي",
4446
"pandocultimateconverter-codex-tab-files": "الملفات",
@@ -54,9 +56,20 @@
5456
"pandocultimateconverter-codex-convert-all": "تحويل الكل",
5557
"pandocultimateconverter-codex-clear-all": "مسح الكل",
5658
"pandocultimateconverter-codex-overwrite-toggle": "السماح بالكتابة فوق الصفحات الموجودة",
59+
"pandocultimateconverter-codex-llm-polish-toggle": "تحسين بالذكاء الاصطناعي (تنظيف LLM)",
60+
"pandocultimateconverter-codex-llm-polish-btn": "تنظيف الذكاء الاصطناعي",
61+
"pandocultimateconverter-codex-status-polishing": "جارٍ التنظيف بالذكاء الاصطناعي…",
62+
"pandocultimateconverter-codex-status-polish-error": "فشل تنظيف الذكاء الاصطناعي: $1",
63+
"pandocultimateconverter-llmpolish-comment": "تنظيف الذكاء الاصطناعي (LLM)",
64+
"apierror-pandocllmpolish-pagenotfound": "الصفحة غير موجودة: $1",
65+
"apierror-pandocllmpolish-notconfigured": "لم يتم تهيئة تنظيف LLM على هذه الويكي.",
66+
"apierror-pandocllmpolish-notwikitext": "محتوى الصفحة ليس نص ويكي.",
67+
"apierror-pandocllmpolish-failed": "فشل تنظيف LLM: $1",
5768
"pandocultimateconverter-codex-status-uploading": "جارٍ الرفع…",
5869
"pandocultimateconverter-codex-status-converting": "جارٍ التحويل…",
5970
"pandocultimateconverter-codex-status-done": "تمّ",
71+
"pandocultimateconverter-codex-status-done-converted": "اكتمل التحويل",
72+
"pandocultimateconverter-codex-status-done-polished": "اكتمل تنظيف الذكاء الاصطناعي",
6073
"pandocultimateconverter-codex-status-error": "خطأ: $1",
6174
"pandocultimateconverter-codex-retry": "إعادة المحاولة",
6275
"pandocultimateconverter-codex-column-source": "المصدر",

i18n/cs.json

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@
3939
"apihelp-pandocconvert-param-forceoverwrite": "Pokud je nastaveno, přepíše cílovou stránku, pokud již existuje.",
4040
"apihelp-pandocconvert-example-file": "Převést již nahraný soubor Document.docx na stránku MyArticle.",
4141
"apihelp-pandocconvert-example-url": "Převést obsah z https://example.com na stránku MyArticle.",
42+
"apihelp-pandocllmpolish-summary": "Spustit vyčistění wikitextu existující stránky pomocí AI (LLM).",
43+
"apihelp-pandocllmpolish-param-pagename": "Název existující wiki stránky k vylepšení.",
4244
"pandocultimateconverter-codex-description": "Přidejte soubory nebo URL adresy k převodu na wiki stránky. Můžete zpracovat více položek najednou.",
4345
"pandocultimateconverter-codex-switch-classic": "Přepnout na klasický formulář",
4446
"pandocultimateconverter-codex-tab-files": "Soubory",
@@ -54,9 +56,20 @@
5456
"pandocultimateconverter-codex-convert-all": "Převést vše",
5557
"pandocultimateconverter-codex-clear-all": "Vymazat vše",
5658
"pandocultimateconverter-codex-overwrite-toggle": "Povolit přepsání existujících stránek",
59+
"pandocultimateconverter-codex-llm-polish-toggle": "Vylepšit pomocí AI (čištění LLM)",
60+
"pandocultimateconverter-codex-llm-polish-btn": "Čištění AI",
61+
"pandocultimateconverter-codex-status-polishing": "Čistění AI…",
62+
"pandocultimateconverter-codex-status-polish-error": "Čištění AI selhalo: $1",
63+
"pandocultimateconverter-llmpolish-comment": "Čištění AI (LLM)",
64+
"apierror-pandocllmpolish-pagenotfound": "Stránka nenalezena: $1",
65+
"apierror-pandocllmpolish-notconfigured": "Čištění LLM není nakonfigurováno na této wiki.",
66+
"apierror-pandocllmpolish-notwikitext": "Obsah stránky není wikitext.",
67+
"apierror-pandocllmpolish-failed": "Čištění LLM selhalo: $1",
5768
"pandocultimateconverter-codex-status-uploading": "Nahrávání…",
5869
"pandocultimateconverter-codex-status-converting": "Převádění…",
5970
"pandocultimateconverter-codex-status-done": "Hotovo",
71+
"pandocultimateconverter-codex-status-done-converted": "Převod dokončen",
72+
"pandocultimateconverter-codex-status-done-polished": "Čištění AI dokončeno",
6073
"pandocultimateconverter-codex-status-error": "Chyba: $1",
6174
"pandocultimateconverter-codex-retry": "Zkusit znovu",
6275
"pandocultimateconverter-codex-column-source": "Zdroj",

i18n/de.json

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@
3939
"apihelp-pandocconvert-param-forceoverwrite": "Wenn gesetzt, wird die Zielseite überschrieben, falls sie bereits existiert.",
4040
"apihelp-pandocconvert-example-file": "Eine bereits hochgeladene Datei namens Document.docx in die Seite MyArticle konvertieren.",
4141
"apihelp-pandocconvert-example-url": "Den Inhalt von https://example.com in die Seite MyArticle konvertieren.",
42+
"apihelp-pandocllmpolish-summary": "KI-Bereinigung des Wikitexts einer bestehenden Wiki-Seite durchführen.",
43+
"apihelp-pandocllmpolish-param-pagename": "Titel der bestehenden Wiki-Seite, die bereinigt werden soll.",
4244
"pandocultimateconverter-codex-description": "Dateien oder URLs hinzufügen, um sie in Wiki-Seiten zu konvertieren. Sie können mehrere Elemente gleichzeitig verarbeiten.",
4345
"pandocultimateconverter-codex-switch-classic": "Zum klassischen Formular wechseln",
4446
"pandocultimateconverter-codex-tab-files": "Dateien",
@@ -54,9 +56,20 @@
5456
"pandocultimateconverter-codex-convert-all": "Alle konvertieren",
5557
"pandocultimateconverter-codex-clear-all": "Alle entfernen",
5658
"pandocultimateconverter-codex-overwrite-toggle": "Überschreiben bestehender Seiten erlauben",
59+
"pandocultimateconverter-codex-llm-polish-toggle": "Mit KI überarbeiten (LLM-Bereinigung)",
60+
"pandocultimateconverter-codex-llm-polish-btn": "KI-Bereinigung",
61+
"pandocultimateconverter-codex-status-polishing": "KI-Bereinigung…",
62+
"pandocultimateconverter-codex-status-polish-error": "KI-Bereinigung fehlgeschlagen: $1",
63+
"pandocultimateconverter-llmpolish-comment": "KI-Bereinigung (LLM)",
64+
"apierror-pandocllmpolish-pagenotfound": "Seite nicht gefunden: $1",
65+
"apierror-pandocllmpolish-notconfigured": "Die KI-Bereinigung ist für dieses Wiki nicht konfiguriert.",
66+
"apierror-pandocllmpolish-notwikitext": "Der Seiteninhalt ist kein Wikitext.",
67+
"apierror-pandocllmpolish-failed": "KI-Bereinigung fehlgeschlagen: $1",
5768
"pandocultimateconverter-codex-status-uploading": "Wird hochgeladen…",
5869
"pandocultimateconverter-codex-status-converting": "Wird konvertiert…",
5970
"pandocultimateconverter-codex-status-done": "Fertig",
71+
"pandocultimateconverter-codex-status-done-converted": "Konvertierung abgeschlossen",
72+
"pandocultimateconverter-codex-status-done-polished": "KI-Bereinigung abgeschlossen",
6073
"pandocultimateconverter-codex-status-error": "Fehler: $1",
6174
"pandocultimateconverter-codex-retry": "Erneut versuchen",
6275
"pandocultimateconverter-codex-column-source": "Quelle",

i18n/en.json

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,9 @@
4545
"apihelp-pandocconvert-example-file": "Convert an already-uploaded file named Document.docx to the page MyArticle.",
4646
"apihelp-pandocconvert-example-url": "Convert the content at https://example.com to the page MyArticle.",
4747

48+
"apihelp-pandocllmpolish-summary": "Run LLM AI cleanup on the wikitext of an existing wiki page.",
49+
"apihelp-pandocllmpolish-param-pagename": "Title of the existing wiki page to polish.",
50+
4851
"pandocultimateconverter-codex-description": "Add files or URLs to convert into wiki pages. You can process multiple items at once.",
4952
"pandocultimateconverter-codex-switch-classic": "Switch to classic form",
5053
"pandocultimateconverter-codex-tab-files": "Files",
@@ -60,9 +63,21 @@
6063
"pandocultimateconverter-codex-convert-all": "Convert all",
6164
"pandocultimateconverter-codex-clear-all": "Clear all",
6265
"pandocultimateconverter-codex-overwrite-toggle": "Allow overwriting existing pages",
66+
"pandocultimateconverter-codex-llm-polish-toggle": "Polish with AI (LLM cleanup)",
67+
"pandocultimateconverter-codex-llm-polish-btn": "AI cleanup",
68+
"pandocultimateconverter-codex-status-polishing": "AI cleanup…",
69+
"pandocultimateconverter-codex-status-polish-error": "AI cleanup failed: $1",
70+
"pandocultimateconverter-llmpolish-comment": "LLM AI cleanup",
71+
"apierror-pandocllmpolish-pagenotfound": "Page not found: $1",
72+
"apierror-pandocllmpolish-notconfigured": "LLM polish is not configured on this wiki.",
73+
"apierror-pandocllmpolish-notwikitext": "The page content is not wikitext.",
74+
"apierror-pandocllmpolish-failed": "LLM polish failed: $1",
75+
6376
"pandocultimateconverter-codex-status-uploading": "Uploading…",
6477
"pandocultimateconverter-codex-status-converting": "Converting…",
6578
"pandocultimateconverter-codex-status-done": "Done",
79+
"pandocultimateconverter-codex-status-done-converted": "Conversion done",
80+
"pandocultimateconverter-codex-status-done-polished": "AI cleanup done",
6681
"pandocultimateconverter-codex-status-error": "Error: $1",
6782
"pandocultimateconverter-codex-retry": "Retry",
6883
"pandocultimateconverter-codex-column-source": "Source",
@@ -71,7 +86,7 @@
7186
"pandocultimateconverter-codex-column-actions": "Actions",
7287
"pandocultimateconverter-codex-navigate-warning": "Conversion is in progress. Are you sure you want to leave?",
7388
"pandocultimateconverter-codex-invalid-url": "Invalid URL (only http and https are supported)",
74-
"pandocultimateconverter-codex-convert-one": "Convert this item",
89+
"pandocultimateconverter-codex-convert-one": "Convert",
7590
"pandocultimateconverter-codex-remove-item": "Remove",
7691
"pandocultimateconverter-codex-stop": "Stop",
7792
"pandocultimateconverter-codex-stop-requested": "Stopping…",

0 commit comments

Comments
 (0)