|
| 1 | +# Markdown Conversion |
| 2 | + |
| 3 | +The `MarkdownConverter` class provides functionality to convert documents (including text and images) into Markdown format. It leverages a configured Language Model (LLM) for sophisticated conversion, especially when dealing with images or requiring structured output. |
| 4 | + |
| 5 | +## Core Concepts |
| 6 | + |
| 7 | +- **LLM Integration:** The converter **requires** a configured LLM (`extract_thinker.llm.LLM`) to interpret document content and generate well-formatted Markdown. This is essential for both text and vision-based tasks (processing images) and for generating structured JSON output alongside Markdown. |
| 8 | +- **Document Loader:** It relies on a `DocumentLoader` (`extract_thinker.document_loader.DocumentLoader`) to load the source document(s) and potentially extract text and images. The behavior might vary depending on the specific loader used. |
| 9 | +- **Vision Support:** The `to_markdown` and `to_markdown_structured` methods have a `vision` parameter or operate in vision mode by default. When enabled, the converter attempts to process images within the document using the LLM's vision capabilities (if the LLM supports it). |
| 10 | +- **Structured Output:** The `to_markdown_structured` method specifically instructs the LLM to provide not only the Markdown content but also a JSON structure breaking down the content with certainty scores. This method inherently requires vision capabilities in the LLM. |
| 11 | + |
| 12 | +## Initialization |
| 13 | + |
| 14 | +```python |
| 15 | +from extract_thinker.markdown import MarkdownConverter |
| 16 | +from extract_thinker.document_loader import DocumentLoaderPyPdf # Example loader |
| 17 | +from extract_thinker.llm import LLM |
| 18 | +from extract_thinker.global_models import get_lite_model, get_big_model # Helpers for model config |
| 19 | + |
| 20 | +# Initialize with or without components |
| 21 | +markdown_converter = MarkdownConverter() |
| 22 | + |
| 23 | +# Load components later |
| 24 | +loader = DocumentLoaderPyPdf() # Configure as needed |
| 25 | +# Use helper functions to get model configurations |
| 26 | +# Replace with your actual logic for selecting/configuring models if needed |
| 27 | +llm = LLM(get_lite_model()) |
| 28 | + |
| 29 | +markdown_converter.load_document_loader(loader) |
| 30 | +markdown_converter.load_llm(llm) |
| 31 | + |
| 32 | +# Or initialize directly |
| 33 | +markdown_converter = MarkdownConverter(document_loader=loader, llm=llm) |
| 34 | +``` |
| 35 | + |
| 36 | +## Usage |
| 37 | + |
| 38 | +### Simple Markdown Conversion (LLM Required) |
| 39 | + |
| 40 | +This method uses the configured LLM to generate Markdown. If `vision=True`, it processes images (requires an LLM with vision capabilities). **Note:** An LLM must be configured via `load_llm()` or during initialization for this method to work. |
| 41 | + |
| 42 | +```python |
| 43 | +# Assuming markdown_converter is initialized with loader and LLM |
| 44 | +source_path = "path/to/your/document.pdf" # Or image file like .png, .jpg |
| 45 | + |
| 46 | +# Convert with vision disabled (processes text only using LLM) |
| 47 | +markdown_pages_text = markdown_converter.to_markdown(source_path, vision=False) |
| 48 | +# Returns List[str] |
| 49 | + |
| 50 | +# Convert with vision enabled (processes text and images using LLM) |
| 51 | +markdown_pages_vision = markdown_converter.to_markdown(source_path, vision=True) |
| 52 | +# Returns List[str] |
| 53 | + |
| 54 | +for i, page_md in enumerate(markdown_pages_vision): |
| 55 | + print(f"--- Page {i+1} ---") |
| 56 | + print(page_md) |
| 57 | + |
| 58 | +# Async version |
| 59 | +markdown_pages_vision_async = await markdown_converter.to_markdown_async(source_path, vision=True) |
| 60 | +``` |
| 61 | + |
| 62 | +### Structured Markdown Conversion (LLM Vision Required) |
| 63 | + |
| 64 | +This method *requires* an LLM with vision capabilities and a document containing images. It returns structured data including Markdown and a JSON breakdown. **Note:** An LLM must be configured via `load_llm()` or during initialization for this method to work. |
| 65 | + |
| 66 | +```python |
| 67 | +from extract_thinker.markdown import PageContent |
| 68 | + |
| 69 | +# Assuming markdown_converter is initialized with loader and LLM (with vision) |
| 70 | +image_path = "path/to/your/image.png" |
| 71 | + |
| 72 | +try: |
| 73 | + # This method inherently uses vision |
| 74 | + structured_output: List[PageContent] = markdown_converter.to_markdown_structured(image_path) |
| 75 | + # Returns List[PageContent] |
| 76 | + |
| 77 | + for i, page_content in enumerate(structured_output): |
| 78 | + print(f"--- Page {i+1} ---") |
| 79 | + # Access structured items |
| 80 | + for item in page_content.items: |
| 81 | + print(f"Certainty: {item.certainty}, Content: {item.content[:50]}...") # Print snippet |
| 82 | + |
| 83 | +except ValueError as e: |
| 84 | + print(f"Error: {e}") # e.g., if no images found or LLM not set |
| 85 | + |
| 86 | +# Async version |
| 87 | +structured_output_async: List[PageContent] = await markdown_converter.to_markdown_structured_async(image_path) |
| 88 | + |
| 89 | +``` |
| 90 | +**Note:** The `to_markdown_structured` method expects the LLM to return both Markdown and a specific JSON format. The `extract_thinking_json` utility is used internally to parse this. |
| 91 | + |
| 92 | +## Prompts |
| 93 | + |
| 94 | +The converter uses specific system prompts depending on the method called: |
| 95 | +- `DEFAULT_PAGE_PROMPT`: Used by `to_markdown_structured`. Instructs the LLM to output Markdown *and* a JSON structure. |
| 96 | +- `DEFAULT_MARKDOWN_PROMPT`: Used by `to_markdown` (when using LLM). Instructs the LLM to output *only* well-formatted Markdown. |
| 97 | +- `MARKDOWN_VERIFICATION_PROMPT`: Potentially used for refining existing text (internal flag `allow_verification`). |
| 98 | + |
| 99 | +These prompts guide the LLM's output format. |
0 commit comments