Skip to content

Performance issue with regex in HtmlConverterCoreNodeRenderer #633

Open
@praveen-diffbot

Description

@praveen-diffbot

HtmlConverterCoreNodeRenderer.handleTableCell has a call to String.replaceAll("\\s*\n\\s*", " ") which can be quite slow. The regex is quite simple and can be sped up by removing the regex.

To Reproduce

See attached file test.html.txt

public class LoadingTest {
  public static void main(final String[] args) throws Exception {
    final String STR = java.nio.file.Files.readString(java.nio.file.Path.of("test.html.txt"));
    final long tic = System.currentTimeMillis();
    com.diffbot.websearch.html.MarkdownNormalizer.markdown(STR);
    System.out.println("took: " + (System.currentTimeMillis() - tic));
  }
}

Expected behavior
The code takes >4000 ms to run on my laptop.

took: 4024

It should take much lesser time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions