Improve Narrative Section Extraction

Right now, a `SECSection` regex is used to identify a TOC section in [get_section_narrative](https://github.com/Unstructured-IO/pipeline-sec-filings/blob/146d339c8f98bb1bdfaffa5a25f7ac8bb763a531/test_real_docs/test_real_examples.py#L60). That generally works pretty well. The matching TOC title text is then used to look for the section in the content but rather than sticking with the original regex, a more lenient match condition is ultimately used in 10-K’s and 10-Q’s with [match_10k_toc_title_to_section](https://github.com/Unstructured-IO/pipeline-sec-filings/blob/146d339/prepline_sec_filings/sec_document.py#L366). The better thing to do is likely stick with the original matching regex.

The lenient post-TOC match is why the [EHC test](https://github.com/Unstructured-IO/pipeline-sec-filings/blob/146d339c8f98bb1bdfaffa5a25f7ac8bb763a531/test_real_docs/test_real_examples.py#L60) fails for the BUSINESS section, and may be the reason for other failures as well.

Definition of Done

* Updated section extraction logic such that fewer tests are marked as xfailed, in particular the EHC case mentioned above.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Narrative Section Extraction #69

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve Narrative Section Extraction #69

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions