Skip to content

Commit 1f62954

Browse files
authored
Update README
2 parents 920fe56 + c994d35 commit 1f62954

3 files changed

Lines changed: 20 additions & 10 deletions

File tree

README.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ Marker converts documents to markdown, JSON, chunks, and HTML quickly and accura
1111
- Optionally boost accuracy with LLMs (and your own prompt)
1212
- Works on GPU, CPU, or MPS
1313

14+
For our managed API or on-prem document intelligence solution, check out [our platform here](https://datalab.to?utm_source=gh-marker).
15+
1416
## Performance
1517

1618
<img src="data/images/overall.png" width="800px"/>
@@ -41,11 +43,11 @@ As you can see, the use_llm mode offers higher accuracy than marker or gemini al
4143

4244
# Commercial usage
4345

44-
Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page [here](https://www.datalab.to).
46+
Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page [here](https://www.datalab.to/pricing?utm_source=gh-marker).
4547

4648
# Hosted API
4749

48-
There's a hosted API for marker available [here](https://www.datalab.to/):
50+
There's a hosted API for marker available [here](https://www.datalab.to?utm_source=gh-marker):
4951

5052
- Supports PDF, image, PPT, PPTX, DOC, DOCX, XLS, XLSX, HTML, EPUB files
5153
- 1/4th the price of leading cloud-based competitors
@@ -102,7 +104,7 @@ Options:
102104
- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
103105
- `--output_format [markdown|json|html|chunks]`: Specify the format for the output results.
104106
- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
105-
- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
107+
- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
106108
- `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
107109
- `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text. This will also format inline math properly.
108110
- `--block_correction_prompt`: if LLM mode is active, an optional prompt that will be used to correct the output of marker. This is useful for custom formatting or logic that you want to apply to the output.
@@ -182,7 +184,7 @@ rendered = converter("FILEPATH")
182184

183185
### Extract blocks
184186

185-
Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programmatically manipulate these blocks.
187+
Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programmatically manipulate these blocks.
186188

187189
Here's an example of extracting all forms from a document:
188190

@@ -222,7 +224,7 @@ text, _, images = text_from_rendered(rendered)
222224

223225
This takes all the same configuration as the PdfConverter. You can specify the configuration `force_layout_block=Table` to avoid layout detection and instead assume every page is a table. Set `output_format=json` to also get cell bounding boxes.
224226

225-
You can also run this via the CLI with
227+
You can also run this via the CLI with
226228
```shell
227229
marker_single FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter --output_format json
228230
```
@@ -243,7 +245,7 @@ rendered = converter("FILEPATH")
243245

244246
This takes all the same configuration as the PdfConverter.
245247

246-
You can also run this via the CLI with
248+
You can also run this via the CLI with
247249
```shell
248250
marker_single FILENAME --converter_cls marker.converters.ocr.OCRConverter
249251
```
@@ -260,7 +262,7 @@ from pydantic import BaseModel
260262

261263
class Links(BaseModel):
262264
links: list[str]
263-
265+
264266
schema = Links.model_json_schema()
265267
config_parser = ConfigParser({
266268
"page_schema": schema
@@ -300,7 +302,7 @@ HTML output is similar to markdown output:
300302

301303
JSON output will be organized in a tree-like structure, with the leaf nodes being blocks. Examples of leaf nodes are a single list item, a paragraph of text, or an image.
302304

303-
The output will be a list, with each list item representing a page. Each page is considered a block in the internal marker schema. There are different types of blocks to represent different elements.
305+
The output will be a list, with each list item representing a page. Each page is considered a block in the internal marker schema. There are different types of blocks to represent different elements.
304306

305307
Pages have the keys:
306308

@@ -366,7 +368,7 @@ All output formats will return a metadata dictionary, with the following fields:
366368
], // computed PDF table of contents
367369
"page_stats": [
368370
{
369-
"page_id": 0,
371+
"page_id": 0,
370372
"text_extraction_method": "pdftext",
371373
"block_counts": [("Span", 200), ...]
372374
},

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "marker-pdf"
3-
version = "1.9.1"
3+
version = "1.9.2"
44
description = "Convert documents to markdown with high speed and accuracy."
55
authors = ["Vik Paruchuri <github@vikas.sh>"]
66
readme = "README.md"

signatures/version1/cla.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -351,6 +351,14 @@
351351
"created_at": "2025-08-25T18:41:28Z",
352352
"repoId": 712111618,
353353
"pullRequestNo": 850
354+
},
355+
{
356+
"name": "EdmondChuiHW",
357+
"id": 1967998,
358+
"comment_id": 3254531992,
359+
"created_at": "2025-09-04T16:30:48Z",
360+
"repoId": 712111618,
361+
"pullRequestNo": 869
354362
}
355363
]
356364
}

0 commit comments

Comments
 (0)