Efficiently Compressing Markdown for LLM Requests #596
Replies: 3 comments 1 reply
-
@FractalMind First of all, I like your username 😎. For your question, there are interesting text processing methods that you can do before sending the markdown to the large language model, and Crawl4AI has tools for that. First of all, you have to use a content filter, which reduces noises from the markdown and gives you a clean and linear markdown. One is the PruningFilter, and the second one, which is BM25Filter. Usually, when the size of the markdown that you're extracting is too large for your large language model, it means that you are injecting too much unnecessary information. That's why we need a mix of techniques to reduce the size, not just by removing the noise, but by removing parts of the text that do not carry the information that you need. So the first content filter technique that I mentioned, the pruning filter, removes all the noise. It gives you a lean markdown. But the second approach, BM25, is the magical secret that, based on the query or a bunch of keywords that you have, extracts parts of the markdown that are relevant to your query and questions. Then you get that smaller version of the markdown and use it for your text processing or anything else you want to do with it. That's why you have to combine some heuristics, like BM25, with other approaches before passing through large language models or your embedding models if you are using embedders. Another thing is chunking and Crawl4ai provides a few different chunker, so you can chunk the generated markdown, apply BM25 from the keywords that you have to re-rank those chunks and eventually pick the top chunks or pick enough chunks so that the total length aligns with your context window's token. So there are different ways that you can do that. Since you asked this now, maybe in the next release, I will add this helper function because it seems to me very useful, and I am a big fan of markdown. That's why I even started the library. @aravindkarnam this is a very good topic for a tutorial, or a video I create. |
Beta Was this translation helpful? Give feedback.
-
This is my C# solution... uses a simple HTML tag filter and markup compressor nothing fancy.. works on most websites if they're not too large.
|
Beta Was this translation helpful? Give feedback.
-
I'm also facing a token limitation issue with my LLMExtractionStrategy. Can I use LLMContentFilter prior to extraction? I want to understand the difference between the two and whether this will resolve my problem. Also, I noticed that in the recent version (0.5.0.post1), this function is deprecated/renamed. import asyncio
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CacheMode,
CrawlerRunConfig,
LLMExtractionStrategy,
)
from models.venue import Venue
browser_config = BrowserConfig(
browser_type="chromium",
headless=False,
verbose=True,
)
llm_strategy = LLMExtractionStrategy(
provider="groq/deepseek-r1-distill-llama-70b",
api_token="YOUR_API_KEY", # API token for authentication
schema=Venue.model_json_schema(),
extraction_type="schema",
chunk_token_threshold=1000,
overlap_rate=0.0,
apply_chunking=True,
instruction=(
"Extract all product objects from the provided content while ensuring proper categorization. "
"The content consists of a structured yet irregular table where:\n\n"
"... other instruction ..."
), # Instructions for the LLM
input_format="markdown",
verbose=True,
exra_args = {
"temperature": 0.0,
"max_tokens": 1000,
"top_p": 1.0,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
}
)
session_id = "100_ppi_extraction"
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=base_url,
config=CrawlerRunConfig(
extraction_strategy=llm_strategy,
remove_forms=True,
excluded_tags=["script", "style", "header", "footer", "nav"],
cache_mode=CacheMode.BYPASS,
css_selector="[class^='right fl']",
session_id=session_id,
log_console=True,
),
) Any help will work, thanks. |
Beta Was this translation helpful? Give feedback.
-
I'm working with an LLM API and need to send Markdown-formatted content as part of the request. However, some Markdown documents are quite large, exceeding token limits.
Is there a recommended way to compress Markdown while preserving its structure and readability for an LLM? Would stripping unnecessary whitespace, abbreviating common phrases, or using a specific compression algorithm be effective?
Any best practices or existing tools for handling this efficiently?
Beta Was this translation helpful? Give feedback.
All reactions