You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+11-9Lines changed: 11 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,6 +11,8 @@ Marker converts documents to markdown, JSON, chunks, and HTML quickly and accura
11
11
- Optionally boost accuracy with LLMs (and your own prompt)
12
12
- Works on GPU, CPU, or MPS
13
13
14
+
For our managed API or on-prem document intelligence solution, check out [our platform here](https://datalab.to?utm_source=gh-marker).
15
+
14
16
## Performance
15
17
16
18
<imgsrc="data/images/overall.png"width="800px"/>
@@ -41,11 +43,11 @@ As you can see, the use_llm mode offers higher accuracy than marker or gemini al
41
43
42
44
# Commercial usage
43
45
44
-
Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page [here](https://www.datalab.to).
46
+
Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page [here](https://www.datalab.to?utm_source=gh-marker).
45
47
46
48
# Hosted API
47
49
48
-
There's a hosted API for marker available [here](https://www.datalab.to/):
50
+
There's a hosted API for marker available [here](https://www.datalab.to?utm_source=gh-marker):
- 1/4th the price of leading cloud-based competitors
@@ -102,7 +104,7 @@ Options:
102
104
-`--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
103
105
-`--output_format [markdown|json|html|chunks]`: Specify the format for the output results.
104
106
-`--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
105
-
-`--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
107
+
-`--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
106
108
-`--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
107
109
-`--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text. This will also format inline math properly.
108
110
-`--block_correction_prompt`: if LLM mode is active, an optional prompt that will be used to correct the output of marker. This is useful for custom formatting or logic that you want to apply to the output.
Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programmatically manipulate these blocks.
187
+
Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programmatically manipulate these blocks.
186
188
187
189
Here's an example of extracting all forms from a document:
This takes all the same configuration as the PdfConverter. You can specify the configuration `force_layout_block=Table` to avoid layout detection and instead assume every page is a table. Set `output_format=json` to also get cell bounding boxes.
@@ -260,7 +262,7 @@ from pydantic import BaseModel
260
262
261
263
classLinks(BaseModel):
262
264
links: list[str]
263
-
265
+
264
266
schema = Links.model_json_schema()
265
267
config_parser = ConfigParser({
266
268
"page_schema": schema
@@ -300,7 +302,7 @@ HTML output is similar to markdown output:
300
302
301
303
JSON output will be organized in a tree-like structure, with the leaf nodes being blocks. Examples of leaf nodes are a single list item, a paragraph of text, or an image.
302
304
303
-
The output will be a list, with each list item representing a page. Each page is considered a block in the internal marker schema. There are different types of blocks to represent different elements.
305
+
The output will be a list, with each list item representing a page. Each page is considered a block in the internal marker schema. There are different types of blocks to represent different elements.
304
306
305
307
Pages have the keys:
306
308
@@ -366,7 +368,7 @@ All output formats will return a metadata dictionary, with the following fields:
0 commit comments