Skip to content

PDF parsing routines #756

@stucka

Description

@stucka

Notes dump:

OK, this is going to be messy. The higher-level overview:
We get lists of strings from the PDF, an ostensible PDF row.

Some of these lists are going to be headers. The headers, of course, need to be detected initially.

But headers can also repeat across pages, so we need to detect them.

To add to the fun, each of these rows from the PDF may be just part of another logical row,
from when cells are divided horizontally to hold multiple data points.

We need to detect those fragmentary lines, mostly by checking to see if most cells are empty.

If they're a fragment of a header, we need to track it somehow and build a structure to hold the fragment.
And remember header fragments may occur on multiple pages with multipage headers.
That means we need to build an initial structure to hold the headers, then skip some rows if we see the header again.

For non-header fragments, we need to append the data to the previous line in an appropriate data structure.

But wait! There's more!

PDF data tends to be really dirty, lots of junky white space.

Some people will use multiline data to show multiple data points in a single cell, such as Company name<newline>, City, State ZIP.
If we strip off white space, we're losing a way to segregate and process that data later. So we can't clean it up until later.
Unless it's for fragmentary rows, because we need to know that they're fragmentary and white space will wreck the count.

And of course lots of rows are entirely white space, just blank data rows left in a PDF. Those we just drop.

To sum up:
Just about every PDF row can be
A full header row
A fragmentary header
A full data row
A fragmentary data row
A blank row

We need many little trackers to go through here and figure out what we're looking at.

We need code to clean up whitespace in cells and rows.

We need a function to delete rows with fewer than a certain number of data points (e.g., contents of a summary table).

We need a function that allows us to standardize header names.

We probably want code that tells us what PDF this is pulled from, on which row.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions