Efficiently Compressing Markdown for LLM Requests #596

FractalMind · 2025-01-31T06:55:26Z

FractalMind
Jan 31, 2025

I'm working with an LLM API and need to send Markdown-formatted content as part of the request. However, some Markdown documents are quite large, exceeding token limits.

Is there a recommended way to compress Markdown while preserving its structure and readability for an LLM? Would stripping unnecessary whitespace, abbreviating common phrases, or using a specific compression algorithm be effective?

Any best practices or existing tools for handling this efficiently?

unclecode · 2025-01-31T15:32:23Z

unclecode
Jan 31, 2025
Maintainer

@FractalMind First of all, I like your username 😎. For your question, there are interesting text processing methods that you can do before sending the markdown to the large language model, and Crawl4AI has tools for that.

First of all, you have to use a content filter, which reduces noises from the markdown and gives you a clean and linear markdown. One is the PruningFilter, and the second one, which is BM25Filter.

Usually, when the size of the markdown that you're extracting is too large for your large language model, it means that you are injecting too much unnecessary information. That's why we need a mix of techniques to reduce the size, not just by removing the noise, but by removing parts of the text that do not carry the information that you need.

So the first content filter technique that I mentioned, the pruning filter, removes all the noise. It gives you a lean markdown. But the second approach, BM25, is the magical secret that, based on the query or a bunch of keywords that you have, extracts parts of the markdown that are relevant to your query and questions. Then you get that smaller version of the markdown and use it for your text processing or anything else you want to do with it. That's why you have to combine some heuristics, like BM25, with other approaches before passing through large language models or your embedding models if you are using embedders.

Another thing is chunking and Crawl4ai provides a few different chunker, so you can chunk the generated markdown, apply BM25 from the keywords that you have to re-rank those chunks and eventually pick the top chunks or pick enough chunks so that the total length aligns with your context window's token.

So there are different ways that you can do that. Since you asked this now, maybe in the next release, I will add this helper function because it seems to me very useful, and I am a big fan of markdown. That's why I even started the library.

@aravindkarnam this is a very good topic for a tutorial, or a video I create.

1 reply

unclecode Jan 31, 2025
Maintainer

I havnt tested these two yet, but should be ok.

First, look at the chunk threshold, it break the markdown into chunks and call llm same time, so each call get lesser markdown.

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter

async def main():
    # Configure an LLM-based content filter that will extract the most important details
    # and generate a concise markdown summary. Adjust the instruction as needed.
    llm_filter = LLMContentFilter(
        provider="openai/gpt-4o",             # or your preferred provider
        api_token="YOUR_OPENAI_API_KEY",       # replace with your actual API key or use an environment variable
        instruction="""
            From the crawled content, extract all key points and generate a concise markdown summary.
            Focus on the most important details and omit extraneous information.
            The output should be brief and formatted in markdown for easy ingestion by LLMs.
        """,
        chunk_token_threshold=4096,
        verbose=True
    )
    
    # Set up the markdown generator with the LLM filter.
    md_generator = DefaultMarkdownGenerator(
        content_filter=llm_filter,
        options={
            "ignore_links": True,
            "escape_html": False,
            "body_width": 80
        }
    )

    # Set up the crawl configuration with cache bypass to always fetch fresh content.
    config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=md_generator
    )

    # Run the crawler on your target URL.
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com", config=config)
        # The concise markdown summary is available in result.markdown.fit_markdown.
        print("Concise Markdown Output:\n")
        print(result.markdown.fit_markdown)

if __name__ == "__main__":
    asyncio.run(main())

And here is the second approach to use BM25:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import BM25ContentFilter

async def main():
    # Configure the BM25 filter with a user query to focus on essential details.
    bm25_filter = BM25ContentFilter(
        user_query="key points summary",  # Adjust this query to match the important details you want
        bm25_threshold=1.0,               # Tweak this threshold to control the filtering strictness
        language="english"
    )
    
    # Set up the markdown generator with the BM25 content filter.
    md_generator = DefaultMarkdownGenerator(
        content_filter=bm25_filter,
        options={
            "ignore_links": True,
            "escape_html": False,
            "body_width": 80
        }
    )

    # Set up the crawl configuration. CacheMode.BYPASS ensures fresh content.
    config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=md_generator
    )

    # Run the crawler on your target URL.
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com", config=config)
        # The concise markdown is available in the fit_markdown field.
        print("Concise Markdown Output (using BM25 filter):\n")
        print(result.markdown.fit_markdown)

if __name__ == "__main__":
    asyncio.run(main())

FractalMind · 2025-02-01T02:57:29Z

FractalMind
Feb 1, 2025
Author

This is my C# solution... uses a simple HTML tag filter and markup compressor nothing fancy.. works on most websites if they're not too large.

 var crawlingResponse = await _httpClientWrapper.Post(
          route,
          content,
          new Dictionary<string, string> {
            { "Authorization", $"Bearer {_crawl4AiSettings.Value.ApiKey}" }
          }
        );

        var crawl4aiObject = JsonConvert.DeserializeObject<Crawl4aiResponseWrapper>(crawlingResponse);

        if (!crawl4aiObject.Results[0].success) {
          throw new Exception();
        }

        GroqRequestDto groqRequestDto = new GroqRequestDto();
        groqRequestDto.messages.Add(
          new GroqMessage(){
            content = "You are an AI website summarizer capable of responding in JSON... blablabla",
            role = "user"
          }
        );

        string html = crawl4aiObject.Results[0].html;

        var doc = new HtmlDocument();
        doc.LoadHtml(html);

        // Specify the tags you want to exclude
        string[] tagsToExclude = {
          "head", "header", "footer", "img", "lottie", "script", "astro-island",
          "a", "link", "meta", "noscript", "iframe"
        };

        foreach (var tag in tagsToExclude) {
          var nodes = doc.DocumentNode.SelectNodes("//" + tag);
          if (nodes != null) {
            foreach (var node in nodes.ToList()) {
              node.Remove();
            }
          }
        }

        string modifiedHtml = doc.DocumentNode.OuterHtml;

        var converter = new Converter();
        var docCompressed = new HtmlDocument();
        docCompressed.LoadHtml(modifiedHtml);
        string markdown = converter.Convert(docCompressed.DocumentNode.InnerText);

        string compressedMarkdown = Regex.Replace(markdown, @"\n\s*\n", "\n");

        // Remove leading and trailing whitespace from each line
        compressedMarkdown = Regex.Replace(compressedMarkdown, @"^\s+|\s+$", "", RegexOptions.Multiline);

        // Replace multiple spaces with a single space
        compressedMarkdown = Regex.Replace(compressedMarkdown, @"\s{2,}", " ");

        // Remove long empty strings (multiple consecutive spaces or newlines)
        compressedMarkdown = Regex.Replace(compressedMarkdown, @"\s{2,}", " "); // For spaces
        compressedMarkdown = Regex.Replace(compressedMarkdown, @"(\r?\n){2,}", "\n"); // For newlines

        groqRequestDto.messages.Add(
          new GroqMessage(){
            content = "#pageContentMarkupFormat: " + compressedMarkdown
                   + " #metadataPageTitle: " + crawl4aiObject.Results[0].metadata.title
                   + " #metadataPageDescription: " + crawl4aiObject.Results[0].metadata.description,
            role = "assistant"
          }
        );
        groqRequestDto.response_format = new GroqResponseFormat(){
          type = "json_object"
        };

        return await _groqClientWrapper.sendHttpRequest(groqRequestDto);

0 replies

ubtohts-imarc · 2025-03-25T04:46:38Z

ubtohts-imarc
Mar 25, 2025

I'm also facing a token limitation issue with my LLMExtractionStrategy. Can I use LLMContentFilter prior to extraction? I want to understand the difference between the two and whether this will resolve my problem. Also, I noticed that in the recent version (0.5.0.post1), this function is deprecated/renamed.

import asyncio
from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CacheMode,
    CrawlerRunConfig,
    LLMExtractionStrategy,
)
from models.venue import Venue

browser_config = BrowserConfig(
        browser_type="chromium",
        headless=False,
        verbose=True,
    )

llm_strategy = LLMExtractionStrategy(
        provider="groq/deepseek-r1-distill-llama-70b",
        api_token="YOUR_API_KEY",  # API token for authentication
        schema=Venue.model_json_schema(),
        extraction_type="schema",
        chunk_token_threshold=1000,
        overlap_rate=0.0,
        apply_chunking=True,
        instruction=(
            "Extract all product objects from the provided content while ensuring proper categorization. "
            "The content consists of a structured yet irregular table where:\n\n"
            "... other instruction ..."
        ),  # Instructions for the LLM
        input_format="markdown",
        verbose=True,
        exra_args = {
            "temperature": 0.0,
            "max_tokens": 1000,
            "top_p": 1.0,
            "frequency_penalty": 0.0,
            "presence_penalty": 0.0,
        }
    )

session_id = "100_ppi_extraction"
async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        url=base_url,
        config=CrawlerRunConfig(
            extraction_strategy=llm_strategy,
            remove_forms=True,
            excluded_tags=["script", "style", "header", "footer", "nav"],
            cache_mode=CacheMode.BYPASS,
            css_selector="[class^='right fl']",
            session_id=session_id,
            log_console=True,
        ),
    )

Any help will work, thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficiently Compressing Markdown for LLM Requests #596

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Efficiently Compressing Markdown for LLM Requests #596

Uh oh!

FractalMind Jan 31, 2025

Replies: 3 comments · 1 reply

Uh oh!

unclecode Jan 31, 2025 Maintainer

Uh oh!

unclecode Jan 31, 2025 Maintainer

Uh oh!

FractalMind Feb 1, 2025 Author

Uh oh!

Uh oh!

ubtohts-imarc Mar 25, 2025

FractalMind
Jan 31, 2025

Replies: 3 comments 1 reply

unclecode
Jan 31, 2025
Maintainer

unclecode Jan 31, 2025
Maintainer

FractalMind
Feb 1, 2025
Author

ubtohts-imarc
Mar 25, 2025