Skip to content

extract_tables(): multiline cells and artifacts when parsing tables (stream) #170

@MoyizGIT

Description

@MoyizGIT

Prework

Description

When I extract tables from a PDF file, some cells contain multiple lines of text. For example, the name of a chemical substance can be very long, or contain synonyms, annotations, or comments. The problem is that the the extract_tables() function reads each line of the PDF as a distinct line in the table, even though some lines are just a continuation of a cell.This caused errors: a substance was sometimes split into multiple lines, or information was found to be misassociated.
Here is how the table is after the extraction
Image

Reproducible example

I can't provide real data that i use.

library(tabulapdf)

#I can't provide the data for now
f <- system.file("examples", "data.pdf", package = "tabulapdf")

t1 <- extract_tables(f, pages = 2, guess = FALSE, method = "stream",  output = "tibble")

Expected result

Here is what i expect to have when i use the function.

Image

#########
I created two alternatives function to fix multiline cells problem and one function for artifacts. But i think it will be a good thing to fix it directly in tabulapdf package or extract_tables() function.

Session info

sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8 LC_MONETARY=French_France.utf8
[4] LC_NUMERIC=C LC_TIME=French_France.utf8

time zone: Europe/Paris
tzcode source: internal

attached base packages:
[1] stats graphics grDevices datasets utils methods base

other attached packages:
[1] shinycssloaders_1.1.0 DT_0.33 writexl_1.5.1 readxl_1.4.3 tabulapdf_1.0.5-5
[6] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4
[11] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
[16] pdftools_3.4.1 shiny_1.10.0

loaded via a namespace (and not attached):
[1] shinyalert_3.1.0 sass_0.4.9 generics_0.1.3 renv_1.1.0 stringi_1.8.4 hms_1.1.3
[7] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3 timechange_0.3.0 grid_4.3.2 fastmap_1.2.0
[13] cellranger_1.1.0 jsonlite_1.8.9 promises_1.3.2 crosstalk_1.2.1 scales_1.3.0 jquerylib_0.1.4
[19] cli_3.6.4 rlang_1.1.5 shinythemes_1.2.0 munsell_0.5.1 yaml_2.3.10 withr_3.0.2
[25] cachem_1.1.0 tools_4.3.2 tzdb_0.4.0 memoise_2.0.1 colorspace_2.1-1 httpuv_1.6.15
[31] vctrs_0.6.5 R6_2.6.0 mime_0.12 png_0.1-8 lifecycle_1.0.4 htmlwidgets_1.6.4
[37] fontawesome_0.5.3 pkgconfig_2.0.3 rJava_1.0-11 pillar_1.10.1 bslib_0.9.0 later_1.4.1
[43] gtable_0.3.6 glue_1.8.0 Rcpp_1.0.14 xfun_0.50 tidyselect_1.2.1 knitr_1.49
[49] rstudioapi_0.17.1 xtable_1.8-4 htmltools_0.5.8.1 qpdf_1.3.4 compiler_4.3.2 askpass_1.2.1

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions