Skip to content

Commit 422f5d0

Browse files
authored
Merge pull request #302 from enoch3712/301-update-markdown-docs
301 update markdown docs
2 parents c2a60ce + 46eb5a0 commit 422f5d0

File tree

5 files changed

+105
-6
lines changed

5 files changed

+105
-6
lines changed

docs/core-concepts/extractors/image_charts.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,5 +78,3 @@ Different models are optimized for different visual tasks:
7878
- Vision processing requires GPT-4o or higher models
7979
- Processing time may be longer for vision-enabled extraction
8080
- Image quality significantly impacts extraction accuracy
81-
82-
For more examples and advanced usage, check out the [examples directory](examples/) in the repository.
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Markdown Conversion
2+
3+
The `MarkdownConverter` class provides functionality to convert documents (including text and images) into Markdown format. It leverages a configured Language Model (LLM) for sophisticated conversion, especially when dealing with images or requiring structured output.
4+
5+
## Core Concepts
6+
7+
- **LLM Integration:** The converter **requires** a configured LLM (`extract_thinker.llm.LLM`) to interpret document content and generate well-formatted Markdown. This is essential for both text and vision-based tasks (processing images) and for generating structured JSON output alongside Markdown.
8+
- **Document Loader:** It relies on a `DocumentLoader` (`extract_thinker.document_loader.DocumentLoader`) to load the source document(s) and potentially extract text and images. The behavior might vary depending on the specific loader used.
9+
- **Vision Support:** The `to_markdown` and `to_markdown_structured` methods have a `vision` parameter or operate in vision mode by default. When enabled, the converter attempts to process images within the document using the LLM's vision capabilities (if the LLM supports it).
10+
- **Structured Output:** The `to_markdown_structured` method specifically instructs the LLM to provide not only the Markdown content but also a JSON structure breaking down the content with certainty scores. This method inherently requires vision capabilities in the LLM.
11+
12+
## Initialization
13+
14+
```python
15+
from extract_thinker.markdown import MarkdownConverter
16+
from extract_thinker.document_loader import DocumentLoaderPyPdf # Example loader
17+
from extract_thinker.llm import LLM
18+
from extract_thinker.global_models import get_lite_model, get_big_model # Helpers for model config
19+
20+
# Initialize with or without components
21+
markdown_converter = MarkdownConverter()
22+
23+
# Load components later
24+
loader = DocumentLoaderPyPdf() # Configure as needed
25+
# Use helper functions to get model configurations
26+
# Replace with your actual logic for selecting/configuring models if needed
27+
llm = LLM(get_lite_model())
28+
29+
markdown_converter.load_document_loader(loader)
30+
markdown_converter.load_llm(llm)
31+
32+
# Or initialize directly
33+
markdown_converter = MarkdownConverter(document_loader=loader, llm=llm)
34+
```
35+
36+
## Usage
37+
38+
### Simple Markdown Conversion (LLM Required)
39+
40+
This method uses the configured LLM to generate Markdown. If `vision=True`, it processes images (requires an LLM with vision capabilities). **Note:** An LLM must be configured via `load_llm()` or during initialization for this method to work.
41+
42+
```python
43+
# Assuming markdown_converter is initialized with loader and LLM
44+
source_path = "path/to/your/document.pdf" # Or image file like .png, .jpg
45+
46+
# Convert with vision disabled (processes text only using LLM)
47+
markdown_pages_text = markdown_converter.to_markdown(source_path, vision=False)
48+
# Returns List[str]
49+
50+
# Convert with vision enabled (processes text and images using LLM)
51+
markdown_pages_vision = markdown_converter.to_markdown(source_path, vision=True)
52+
# Returns List[str]
53+
54+
for i, page_md in enumerate(markdown_pages_vision):
55+
print(f"--- Page {i+1} ---")
56+
print(page_md)
57+
58+
# Async version
59+
markdown_pages_vision_async = await markdown_converter.to_markdown_async(source_path, vision=True)
60+
```
61+
62+
### Structured Markdown Conversion (LLM Vision Required)
63+
64+
This method *requires* an LLM with vision capabilities and a document containing images. It returns structured data including Markdown and a JSON breakdown. **Note:** An LLM must be configured via `load_llm()` or during initialization for this method to work.
65+
66+
```python
67+
from extract_thinker.markdown import PageContent
68+
69+
# Assuming markdown_converter is initialized with loader and LLM (with vision)
70+
image_path = "path/to/your/image.png"
71+
72+
try:
73+
# This method inherently uses vision
74+
structured_output: List[PageContent] = markdown_converter.to_markdown_structured(image_path)
75+
# Returns List[PageContent]
76+
77+
for i, page_content in enumerate(structured_output):
78+
print(f"--- Page {i+1} ---")
79+
# Access structured items
80+
for item in page_content.items:
81+
print(f"Certainty: {item.certainty}, Content: {item.content[:50]}...") # Print snippet
82+
83+
except ValueError as e:
84+
print(f"Error: {e}") # e.g., if no images found or LLM not set
85+
86+
# Async version
87+
structured_output_async: List[PageContent] = await markdown_converter.to_markdown_structured_async(image_path)
88+
89+
```
90+
**Note:** The `to_markdown_structured` method expects the LLM to return both Markdown and a specific JSON format. The `extract_thinking_json` utility is used internally to parse this.
91+
92+
## Prompts
93+
94+
The converter uses specific system prompts depending on the method called:
95+
- `DEFAULT_PAGE_PROMPT`: Used by `to_markdown_structured`. Instructs the LLM to output Markdown *and* a JSON structure.
96+
- `DEFAULT_MARKDOWN_PROMPT`: Used by `to_markdown` (when using LLM). Instructs the LLM to output *only* well-formatted Markdown.
97+
- `MARKDOWN_VERIFICATION_PROMPT`: Potentially used for refining existing text (internal flag `allow_verification`).
98+
99+
These prompts guide the LLM's output format.

docs/examples/google-stack.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ for doc_content in result:
9797
print(doc_content.json(indent=2))
9898
```
9999

100-
More information about document splitting can be found in the [document splitting](../../core-concepts/splitters) section.
100+
More information about document splitting can be found in the [document splitting](../core-concepts/splitters/index.md) section.
101101

102102
**Document OCR**: Basic text extraction and layout analysis
103103

docs/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
2-
redirect: getting-started/
2+
redirect: getting-started/index.md
33
---
44

55
<script>
6-
window.location.href = 'getting-started/';
6+
window.location.href = 'getting-started/index.md';
77
</script>
88

9-
[Click here if you are not redirected automatically](getting-started/)
9+
[Click here if you are not redirected automatically](getting-started/index.md)

mkdocs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,8 @@ nav:
5858
- LLM Integration:
5959
- Overview: core-concepts/llm-integration/index.md
6060
- Thinking Models: core-concepts/llm-integration/thinking-models.md
61+
- Markdown Conversion:
62+
- Overview: core-concepts/markdown-conversion/index.md
6163
- Classification:
6264
- Overview: core-concepts/classification/index.md
6365
- Basic Classification: core-concepts/classification/basic.md

0 commit comments

Comments
 (0)