You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+22-12Lines changed: 22 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,6 +11,8 @@ Marker converts documents to markdown, JSON, chunks, and HTML quickly and accura
11
11
- Optionally boost accuracy with LLMs (and your own prompt)
12
12
- Works on GPU, CPU, or MPS
13
13
14
+
For our managed API or on-prem document intelligence solution, check out [our platform here](https://datalab.to?utm_source=gh-marker).
15
+
14
16
## Performance
15
17
16
18
<imgsrc="data/images/overall.png"width="800px"/>
@@ -41,14 +43,15 @@ As you can see, the use_llm mode offers higher accuracy than marker or gemini al
41
43
42
44
# Commercial usage
43
45
44
-
Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page [here](https://www.datalab.to).
46
+
Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page [here](https://www.datalab.to/pricing?utm_source=gh-marker).
45
47
46
-
# Hosted API
48
+
# Hosted API & On-prem
47
49
48
-
There's a hosted API for marker available [here](https://www.datalab.to/):
50
+
There's a [hosted API](https://www.datalab.to?utm_source=gh-marker) and [painless on-prem solution](https://www.datalab.to/blog/self-serve-on-prem-licensing) for marker - it's free to sign up, and we'll throw in credits for you to test it out.
- 1/4th the price of leading cloud-based competitors
54
+
-Is 1/4th the price of leading cloud-based competitors
52
55
- Fast - ~15s for a 250 page PDF
53
56
- Supports LLM mode
54
57
- High uptime (99.99%)
@@ -102,7 +105,7 @@ Options:
102
105
-`--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
103
106
-`--output_format [markdown|json|html|chunks]`: Specify the format for the output results.
104
107
-`--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
105
-
-`--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
108
+
-`--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
106
109
-`--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
107
110
-`--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text. This will also format inline math properly.
108
111
-`--block_correction_prompt`: if LLM mode is active, an optional prompt that will be used to correct the output of marker. This is useful for custom formatting or logic that you want to apply to the output.
Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programmatically manipulate these blocks.
188
+
Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programmatically manipulate these blocks.
186
189
187
190
Here's an example of extracting all forms from a document:
This takes all the same configuration as the PdfConverter. You can specify the configuration `force_layout_block=Table` to avoid layout detection and instead assume every page is a table. Set `output_format=json` to also get cell bounding boxes.
@@ -260,7 +263,7 @@ from pydantic import BaseModel
260
263
261
264
classLinks(BaseModel):
262
265
links: list[str]
263
-
266
+
264
267
schema = Links.model_json_schema()
265
268
config_parser = ConfigParser({
266
269
"page_schema": schema
@@ -300,7 +303,7 @@ HTML output is similar to markdown output:
300
303
301
304
JSON output will be organized in a tree-like structure, with the leaf nodes being blocks. Examples of leaf nodes are a single list item, a paragraph of text, or an image.
302
305
303
-
The output will be a list, with each list item representing a page. Each page is considered a block in the internal marker schema. There are different types of blocks to represent different elements.
306
+
The output will be a list, with each list item representing a page. Each page is considered a block in the internal marker schema. There are different types of blocks to represent different elements.
304
307
305
308
Pages have the keys:
306
309
@@ -366,7 +369,7 @@ All output formats will return a metadata dictionary, with the following fields:
366
369
], // computed PDF table of contents
367
370
"page_stats": [
368
371
{
369
-
"page_id": 0,
372
+
"page_id": 0,
370
373
"text_extraction_method": "pdftext",
371
374
"block_counts": [("Span", 200), ...]
372
375
},
@@ -553,4 +556,11 @@ PDF is a tricky format, so marker will not always work perfectly. Here are some
553
556
- Very complex layouts, with nested tables and forms, may not work
554
557
- Forms may not be rendered well
555
558
556
-
Note: Passing the `--use_llm` and `--force_ocr` flags will mostly solve these issues.
559
+
Note: Passing the `--use_llm` and `--force_ocr` flags will mostly solve these issues.
560
+
561
+
# Usage and Deployment Examples
562
+
563
+
You can always run `marker` locally, but if you wanted to expose it as an API, we have a few options:
564
+
-[Deployment example with Modal](./examples/README_MODAL.md) that shows you how to deploy and access `marker` through a web endpoint using [`Modal`](https://modal.com), which makes compute easy to provision and scale.
565
+
- Our platform API is also powered by `marker` and `surya` and is easy to test out - it's free to sign up, and we'll include credits, [try it out here](https://datalab.to)
566
+
- Our painless on-prem solution for commercial use, which you can [read about here](https://www.datalab.to/blog/self-serve-on-prem-licensing)
This directory contains examples of running `marker` in different contexts.
4
+
5
+
### Usage with Modal
6
+
7
+
We have a [self-contained example](./marker_modal_deployment.py) that shows how you can quickly use [Modal](https://modal.com) to deploy `marker` by provisioning a container with a GPU, and expose that with an API so you can submit PDFs for conversion into Markdown, HTML, or JSON.
8
+
9
+
It's a limited example that you can extend into different use cases.
10
+
11
+
#### Pre-requisites
12
+
13
+
Make sure you have the `modal` client installed by [following their instructions here](https://modal.com/docs/guide#getting-started).
14
+
15
+
Modal's [Starter Plan](https://modal.com/pricing) includes $30 of free compute each month.
16
+
Modal is [serverless](https://arxiv.org/abs/1902.03383), so you only pay for resources when you are using them.
17
+
18
+
#### Running the example
19
+
20
+
Once `modal` is configured, you can deploy it to your workspace by running:
21
+
22
+
> modal deploy marker_modal_deployment.py
23
+
24
+
Notes:
25
+
-`marker` has a few models it uses. By default, the endpoint will check if these models are loaded and download them if not (first request will be slow). You can avoid this by running
26
+
27
+
> modal run marker_modal_deployment.py::download_models
28
+
29
+
Which will create a [`Modal Volume`](https://modal.com/docs/guide/Volumes) to store them for re-use.
30
+
31
+
Once the deploy is finished, you can:
32
+
- Test a file upload locally through your CLI using an `invoke_conversion` command we expose through Modal's [`local_entrypoint`](https://modal.com/docs/reference/modal.App#local_entrypoint)
33
+
- Get the URL of your endpoint and make a request through a client of your choice.
34
+
35
+
**Test from your CLI with `invoke_conversion`**
36
+
37
+
If your endpoint is live, simply run this command:
38
+
39
+
```
40
+
$ modal run marker_modal_deployment.py::invoke_conversion --pdf-file <PDF_FILE_PATH> --output-format markdown
41
+
```
42
+
43
+
And it'll automatically detect the URL of your new endpoint using [`.get_web_url()`](https://modal.com/docs/guide/webhook-urls#determine-the-url-of-a-web-endpoint-from-code), make sure it's healthy, submit your file, and store its output on your machine (in the same directory).
44
+
45
+
**Making a request using your own client**
46
+
47
+
If you want to make requests elsewhere e.g. with cURL or a client like Insomnia, you'll need to get the URL.
48
+
49
+
When your `modal deploy` command from earlier finishes, it'll include your endpoint URL at the end. For example:
50
+
51
+
```
52
+
$ modal deploy marker_modal_deployment.py
53
+
...
54
+
✓ Created objects.
55
+
├── 🔨 Created mount /marker/examples/marker_modal_deployment.py
56
+
├── 🔨 Created function download_models.
57
+
├── 🔨 Created function MarkerModalDemoService.*.
58
+
└── 🔨 Created web endpoint for MarkerModalDemoService.fastapi_app => <YOUR_ENDPOINT_URL>
59
+
✓ App deployed in 149.877s! 🎉
60
+
```
61
+
62
+
If you accidentally close your terminal session, you can also always go into Modal's dashboard and:
63
+
- Find the app (default name: `datalab-marker-modal-demo`)
64
+
- Click on `MarkerModalDemoService`
65
+
- Find your endpoint URL
66
+
67
+
Once you have your URL, make a request to `{YOUR_ENDPOINT_URL}/convert` like this (you can also use Insomnia, etc.):
68
+
```
69
+
curl --request POST \
70
+
--url {BASE_URL}/convert \
71
+
--header 'Content-Type: multipart/form-data' \
72
+
--form file=@/Users/cooldev/sample.pdf \
73
+
--form output_format=html
74
+
```
75
+
76
+
You should get a response like this
77
+
78
+
```
79
+
{
80
+
"success": true,
81
+
"filename": "sample.pdf",
82
+
"output_format": "html",
83
+
"json": null,
84
+
"html": "<YOUR_RESPONSE_CONTENT>",
85
+
"markdown": null,
86
+
"images": {},
87
+
"metadata": {... page level metadata ...},
88
+
"page_count": 2
89
+
}
90
+
```
91
+
92
+
[Modal](https://modal.com) makes deploying and scaling models and inference workloads much easier.
93
+
94
+
If you're interested in Datalab's managed API or on-prem document intelligence solution, check out [our platform here](https://datalab.to/?utm_source=gh-marker).
0 commit comments