-
Notifications
You must be signed in to change notification settings - Fork 72
Description
Prework
- Read and agree to the code of conduct and contributing guidelines.
Description
When I extract tables from a PDF file, some cells contain multiple lines of text. For example, the name of a chemical substance can be very long, or contain synonyms, annotations, or comments. The problem is that the the extract_tables() function reads each line of the PDF as a distinct line in the table, even though some lines are just a continuation of a cell.This caused errors: a substance was sometimes split into multiple lines, or information was found to be misassociated.
Here is how the table is after the extraction
Reproducible example
I can't provide real data that i use.
library(tabulapdf)
#I can't provide the data for now
f <- system.file("examples", "data.pdf", package = "tabulapdf")
t1 <- extract_tables(f, pages = 2, guess = FALSE, method = "stream", output = "tibble")
Expected result
Here is what i expect to have when i use the function.
#########
I created two alternatives function to fix multiline cells problem and one function for artifacts. But i think it will be a good thing to fix it directly in tabulapdf package or extract_tables() function.
Session info
sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8 LC_MONETARY=French_France.utf8
[4] LC_NUMERIC=C LC_TIME=French_France.utf8
time zone: Europe/Paris
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] shinycssloaders_1.1.0 DT_0.33 writexl_1.5.1 readxl_1.4.3 tabulapdf_1.0.5-5
[6] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4
[11] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
[16] pdftools_3.4.1 shiny_1.10.0
loaded via a namespace (and not attached):
[1] shinyalert_3.1.0 sass_0.4.9 generics_0.1.3 renv_1.1.0 stringi_1.8.4 hms_1.1.3
[7] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3 timechange_0.3.0 grid_4.3.2 fastmap_1.2.0
[13] cellranger_1.1.0 jsonlite_1.8.9 promises_1.3.2 crosstalk_1.2.1 scales_1.3.0 jquerylib_0.1.4
[19] cli_3.6.4 rlang_1.1.5 shinythemes_1.2.0 munsell_0.5.1 yaml_2.3.10 withr_3.0.2
[25] cachem_1.1.0 tools_4.3.2 tzdb_0.4.0 memoise_2.0.1 colorspace_2.1-1 httpuv_1.6.15
[31] vctrs_0.6.5 R6_2.6.0 mime_0.12 png_0.1-8 lifecycle_1.0.4 htmlwidgets_1.6.4
[37] fontawesome_0.5.3 pkgconfig_2.0.3 rJava_1.0-11 pillar_1.10.1 bslib_0.9.0 later_1.4.1
[43] gtable_0.3.6 glue_1.8.0 Rcpp_1.0.14 xfun_0.50 tidyselect_1.2.1 knitr_1.49
[49] rstudioapi_0.17.1 xtable_1.8-4 htmltools_0.5.8.1 qpdf_1.3.4 compiler_4.3.2 askpass_1.2.1