Skip to content

extract_text jumbles text order #132

@gpilgrim2670

Description

@gpilgrim2670

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

Put your code here:

## rJava loads successfully
# install.packages("rJava")
library("rJava")

## load package
library("tabulizer")

## code goes here
file <- "https://cdn.swimswam.com/wp-content/uploads/2019/03/D3.NCAA-2013.pdf" # source pdf
raw <- extract_text(file) # text from file is read in, but order is jumbled

# to make more clear I'll split `raw` into lines
raw_list <- as.list(unlist(strsplit(raw, '\\\n')))
raw_results <- sapply(raw_list, toString)
raw_results[10]
# [1] "Williams SR1 4:47.16Wilson, Caroline"
# same order of text as in raw, just a smaller piece to make viewing easier

# should instead be this (perhaps with different whitespaces)
# can check by viewing file at link provided, first column
# [1] "1 Wilson, Caroline SR Williams 4:47.16"

session info for your system

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] SwimmeR_0.7.2 tabulizer_0.2.2

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 pdftools_2.3.1 remotes_2.2.0 prettyunits_1.1.1 tools_4.0.2
[8] testthat_2.3.2 digest_0.6.27 packrat_0.5.0 pkgbuild_1.1.0 pkgload_1.1.0 memoise_1.1.0 lifecycle_0.2.0
[15] tibble_3.0.4 pkgconfig_2.0.3 png_0.1-7 rlang_0.4.8 cli_2.1.0 rstudioapi_0.11 xfun_0.18
[22] rJava_0.9-13 httr_1.4.2 xml2_1.3.2 knitr_1.30 stringr_1.4.0 roxygen2_7.1.1 withr_2.3.0
[29] dplyr_1.0.2 hms_0.5.3 askpass_1.1 fs_1.5.0 generics_0.0.2 desc_1.2.0 vctrs_0.3.4
[36] devtools_2.3.2 rprojroot_1.3-2 tidyselect_1.1.0 glue_1.4.2 qpdf_1.1 R6_2.5.0 processx_3.4.4
[43] fansi_0.4.1 sessioninfo_1.1.1 readr_1.4.0 purrr_0.3.4 callr_3.5.1 magrittr_1.5 usethis_1.6.3
[50] tabulizerjars_1.0.1 backports_1.1.10 ps_1.4.0 ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 stringi_1.5.3
[57] crayon_1.3.4


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions