Replies: 10 comments 1 reply
-
If useful, I previously put together a script that attempted PDF extraction: https://github.com/adamkucharski/scrapR – not very elegant, and requires some manual input, but some of the underlying libraries/functions may be relevant. On the LLM side, GPT code interpreter might be option - although currently API is python only, e.g. https://github.com/shroominic/codeinterpreter-api, and a quick play around with their chat interface on a published paper suggests it struggles to identify the location and dimensions of the plot on a page without a lot of assistance. |
Beta Was this translation helpful? Give feedback.
-
Another output would be a collection of "difficult cases", where current solutions fail or don't perform well. This could serve as a benchmark and test cases for potential future development. |
Beta Was this translation helpful? Give feedback.
-
WHO also is working on a tool with company called adapt to scrape websites, images and pdf. It is though still to be deployed. |
Beta Was this translation helpful? Give feedback.
-
Please use this repo for the time being: https://github.com/henryls1/pdfscraping |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
all code here : https://github.com/mathiasleroy/pdfscraping |
Beta Was this translation helpful? Give feedback.
-
This is looking great, really nice to see everything compared side-by-side. A few comments from @sbfnk @martinamcm @rebeccanash @adamkucharski: Suggested edits to Rmarkdown:
Questions around next steps
|
Beta Was this translation helpful? Give feedback.
-
Action: @Bisaloo to reach out to rOpenSci about potential blog. |
Beta Was this translation helpful? Give feedback.
-
Found these additional useful resources: |
Beta Was this translation helpful? Give feedback.
-
I got access to GPT4 vision, so thought I'd give it a go – I gave it the example table in the above Google doc with this prompt 'Extract the values in this table and output in CSV form', and got this response:
Looks pretty good, so maybe worth adding a mention to the GPT section of document? Obvious limitations are that it's not free, and can't run locally, but useful to show recent progress I think. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The objective of this task is to evaluate the tools to extract text, tables, and images from PDF reports, develop a roadmap for further development of this workflow, and vignettes of existing solutions.
Packages:
Dependencies:
Other tools:
Data:
Obstacles:
Outputs:
Beta Was this translation helpful? Give feedback.
All reactions