PDF scraping #8

calthaus · 2023-09-19T14:51:02Z

calthaus
Sep 19, 2023
Collaborator

The objective of this task is to evaluate the tools to extract text, tables, and images from PDF reports, develop a roadmap for further development of this workflow, and vignettes of existing solutions.

Packages:

tabulizer
pdftools
tesseract
Not all packages on CRAN

Dependencies:

Some packages (tabulizer) require Java

Other tools:

ChatGPT-like services

Data:

Selection of publicly available PDF reports with different levels of complexity to extract data (e.g., numbers, time series, epidemic curves, maps, barplots, etc.)

Obstacles:

Financial barriers (subscription fees)
Technical barriers (no interface to R)

Outputs:

Wrapper for existing packages customized to different types of situation reports
Vignettes (could be used to extend the "First 100 Lines of Code" document from the earlier Epiverse workshop)
Guideline document for publishing PDF reports that contain easy extractable data
Roadmap for further development

adamkucharski · 2023-09-19T15:27:27Z

adamkucharski
Sep 19, 2023
Collaborator

If useful, I previously put together a script that attempted PDF extraction: https://github.com/adamkucharski/scrapR – not very elegant, and requires some manual input, but some of the underlying libraries/functions may be relevant.

On the LLM side, GPT code interpreter might be option - although currently API is python only, e.g. https://github.com/shroominic/codeinterpreter-api, and a quick play around with their chat interface on a published paper suggests it struggles to identify the location and dimensions of the plot on a page without a lot of assistance.

0 replies

Bisaloo · 2023-09-19T15:29:05Z

Bisaloo
Sep 19, 2023
Collaborator

Another output would be a collection of "difficult cases", where current solutions fail or don't perform well. This could serve as a benchmark and test cases for potential future development.

0 replies

fitznerj · 2023-09-19T15:40:54Z

fitznerj
Sep 19, 2023
Maintainer

WHO also is working on a tool with company called adapt to scrape websites, images and pdf. It is though still to be deployed.

0 replies

henryls1 · 2023-09-20T07:59:21Z

henryls1
Sep 20, 2023
Collaborator

Please use this repo for the time being: https://github.com/henryls1/pdfscraping

0 replies

PatriciaRose963 · 2023-09-21T09:37:41Z

PatriciaRose963
Sep 21, 2023
Maintainer

Link to the Google doc: https://docs.google.com/document/d/1ruPLqZWQFNZEPUtAXPacdw3O4Z8xbcIaCfa1UD0RGMg/edit
Link to repo: https://github.com/henryls1/pdfscraping

1 reply

mathiasleroy Sep 21, 2023

public repo https://github.com/mathiasleroy/pdfscraping

mathiasleroy · 2023-09-21T09:53:15Z

mathiasleroy
Sep 21, 2023

all code here : https://github.com/mathiasleroy/pdfscraping

0 replies

adamkucharski · 2023-09-21T10:31:00Z

adamkucharski
Sep 21, 2023
Collaborator

This is looking great, really nice to see everything compared side-by-side. A few comments from @sbfnk @martinamcm @rebeccanash @adamkucharski:

Suggested edits to Rmarkdown:

It would be useful to expand the summary table, to have a neat TLDR for users at the start of the document. In particular, would ideally be good to include:
- Measure of performance. Ideally quantitative (e.g. % of cells correctly parsed across the four test datasets), although a quick qualitative conclusion would also be helpful.
- Summary of ability to parse different languages
Would be clearer to have a summary of the four datasets at the start of the document (a table of tables?) with brief info on what they are and why they might be hard to extract, such as the types of data within (e.g. percentages, delineators for large numbers) as well as what isn't covered (e.g. strings of text). Then move on to the specific table used in the vignette illustrations.
Would suggest moving the data loading code/paths to the start of the Rmd, so a user can load all the test files at once, before moving through the analysis options. This would also allow users to test alternative methods in an easy way (and the above summary table would give a useful indication of how they can benchmark).
Could also include a 'Contribution' section reiterating it clear how users could test an additional new method, e.g. what data to analyse, what metrics they should compare against.

Questions around next steps

Where will this be stored in the long-term? Could see it as a useful case study, potentially for an organisation like AppliedEpi to host and/or with comparison code hosted with WHO collaboratory? Epiverse can also host vignettes and blogs if useful for dissemination, but you may have others ideas!

0 replies

adamkucharski · 2023-09-21T12:55:19Z

adamkucharski
Sep 21, 2023
Collaborator

Action: @Bisaloo to reach out to rOpenSci about potential blog.

0 replies

Bisaloo · 2023-09-26T13:49:40Z

Bisaloo
Sep 26, 2023
Collaborator

Found these additional useful resources:

0 replies

adamkucharski · 2023-10-16T11:19:40Z

adamkucharski
Oct 16, 2023
Collaborator

I got access to GPT4 vision, so thought I'd give it a go – I gave it the example table in the above Google doc with this prompt 'Extract the values in this table and output in CSV form', and got this response:

Division,New Tests (Last 7 Days),New Tests (Last 8-14 Days),Tests/100K/Week,% change in new tests,% of new tests,Test Positivity,% change in test positivity,Test/Case
Barishal,123,167,1.3,-26.3%,0.5%,3.3%,30.8
Chattogram,1 883,2 404,5.6,-21.7%,7.2%,6.7%,-37%
Dhaka,21 646,21 089,50.2,2.6%,82.9%,3.5%,-37%,28.7
Khulna,277,397,1.5,-30.2%,1.1%,4.0%,-44%,25.2
Mymensingh,521,594,4.0,-12.3%,2.0%,5.2%,-48%,19.3
Rajshahi,584,903,2.7,-35.3%,2.2%,9.9%,-18%,10.1
Rangpur,284,807,1.5,-64.8%,1.1%,3.9%,42%,25.8
Sylhet,800,807,6.8,-0.9%,3.1%,2.4%,-51%,42.1
National,26 118,27 168,15.3,-3.9%,100.0%,3.9%,-38%,25.8

Looks pretty good, so maybe worth adding a mention to the GPT section of document? Obvious limitations are that it's not free, and can't run locally, but useful to show recent progress I think.

0 replies

PDF scraping #8

Uh oh!

calthaus Sep 19, 2023 Collaborator

Replies: 10 comments · 1 reply

Uh oh!

adamkucharski Sep 19, 2023 Collaborator

Uh oh!

Bisaloo Sep 19, 2023 Collaborator

Uh oh!

fitznerj Sep 19, 2023 Maintainer

Uh oh!

henryls1 Sep 20, 2023 Collaborator

Uh oh!

PatriciaRose963 Sep 21, 2023 Maintainer

Uh oh!

mathiasleroy Sep 21, 2023

Uh oh!

mathiasleroy Sep 21, 2023

Uh oh!

adamkucharski Sep 21, 2023 Collaborator

Uh oh!

adamkucharski Sep 21, 2023 Collaborator

Uh oh!

Bisaloo Sep 26, 2023 Collaborator

Uh oh!

adamkucharski Oct 16, 2023 Collaborator

calthaus
Sep 19, 2023
Collaborator

Replies: 10 comments 1 reply

adamkucharski
Sep 19, 2023
Collaborator

Bisaloo
Sep 19, 2023
Collaborator

fitznerj
Sep 19, 2023
Maintainer

henryls1
Sep 20, 2023
Collaborator

PatriciaRose963
Sep 21, 2023
Maintainer

mathiasleroy
Sep 21, 2023

adamkucharski
Sep 21, 2023
Collaborator

adamkucharski
Sep 21, 2023
Collaborator

Bisaloo
Sep 26, 2023
Collaborator

adamkucharski
Oct 16, 2023
Collaborator