fulltext/08-fetch.Rmd at master · ropensci-books/fulltext · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# Fetch {#fetch}

The `ft_get` function makes it easy to fetch full text articles. There are a few different ways to use `ft_get`:

* Pass in only DOIs - leave `from` parameter `NULL`. This route will first query Crossref API for the publisher of the DOI, then we'll use the appropriate method to fetch full text from the publisher. If a publisher is not found for the DOI, then we'll throw back a message telling you a publisher was not found.
* Pass in DOIs (or other pub IDs) and use the `from` parameter. This route means we don't have to make an extra API call to Crossref (thus, this route is faster) to determine the publisher for each DOI. We go straight to getting full text based on the publisher.
* Use `ft_search()` to search for articles. Then pass that output to this function, which will use info in that object. This behaves the same as the previous option in that each DOI has publisher info so we know how to get full text for each DOI.

Note that some publishers are available through other data sources, e.g., through Entrez's Pubmed.

`ft_get` is a bit complicated. These are just some of the hurdles we're jumping over:

* Negotiating various user inputs, likely seeing new publishers we've not dealt with
* Dealing with authentication and trying to make it easier for users
* Users sometimes being at an IP address that has access to a publisher and sometimes not
* Caching results to avoid unnecessary downloads if the content has already been acquired

Thus, expect some hiccups here, and please do report problems, and if a certain publisher is not supported yet.

## Data formats {#fetch-data-formats}

You can specify whether you want PDF, XML or plaint text with the `type` parameter. It is sometimes ignored, sometimes used, depending on the data source. For certain data sources, they only accept one type. Details by data source/publisher:

- PLOS: pdf and xml
- Entrez: only xml
- eLife: pdf and xml
- Pensoft: pdf and xml
- arXiv: only pdf
- BiorXiv: only pdf
- Elsevier: pdf and plain
- Wiley: only pdf
- Peerj: pdf and xml
- Informa: only pdf
- FrontiersIn: pdf and xml
- Copernicus: pdf and xml
- Scientific Societies: only pdf
- Crossref: depends on the publisher
- other data sources/publishers: there are too many to cover here - will try to make a helper in the future for what is covered by different publishers

## How data is stored  {#how-data-is-stored}

This depends on what `backend` value you use. If you use the default (rds) we store all data in `.rds` files. These are binary compressed files that are specific to R. Because they are specific to R, you don't want to use this option if part of your downstream workflow is using another tool/programming language.

The three types are stored in differnt ways. xml and plain text are parsed to plain text then stored with whatever backend you choose. However, pdf is retrived as raw bytes and stored as such. Thus, we no longer write pdf files to disk. However, you can easily do that yourself with `ft_extract()` or yourself by using `pdftools::pdf_text` which accepts a file path to a pdf or raw bytes.

## Usage {#fetch-usage}

```{r}
library(fulltext)
```

List backends available

```{r}
ft_get_ls()
```

The simplest approach is passing a DOI directly to `ft_get`

```{r}
(res <- ft_get('10.1371/journal.pone.0086169'))
res$plos
```

You can pass many DOIs in at once

```{r}
(res <- ft_get(c('10.3389/fphar.2014.00109', '10.3389/feart.2015.00009')))
res$frontiersin
```

## Errors {#fetch-errors}


`ft_get()` for each article has an `error` slot.

If the `error` slot is `NULL` then there is no error.

If the `error` slot is not `NULL` there was an error, and the error message
will be a character string in that slot.

Possible errors include:

- An error reported by the web service
    - e.g. "Timeout was reached: Connection timed out after 10003 milliseconds"
- No link found
    - e.g. "no link found from Crossref" OR "has no link available"
- We attempted to fetch the article but the content type wasn't what was expected. In this case we skip to the next article.
    - e.g. "type was supposed to be `pdf`, but was `text/html; charset=UTF-8`"
- Weird uninformative errors
    - e.g. "Recv failure: Operation timed out" OR "Operation was aborted by an application callback"
- An error associated mostly with PLOS. PLOS gives DOIs for parts of articles, like figures, so it doesn't make sense to get full text of a figure.
    - e.g., "was not found or may be a DOI for a part of an article"

An additional top level slot called `errors` has a data.frame of all errors
from each article, like:

```r
res <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.aaaa'), from = "elife")
res$elife$errors
                   id                   error
1 10.7554/eLife.03032                    <NA>
2  10.7554/eLife.aaaa subscript out of bounds
```

Where the second DOI was invalid. Granted the error in the data.frame "subscript out of bounds"
isn't very informative, but we can work on that.

## Cleanup {#fetch-cleanup}

The above section about errors suggests that we often run into errors. When we run into errors
downloading full text we capture the error message, if there is one, and delete the file we
were trying to create. That is, we cleanup upon hitting an error such that you shouldn't end
up with blank files on your machine. Let us know if this isn't true in your case and we'll
get it fixed.

Note that even if you exit out of the all to `ft_get()` it should clean up a file
if it is not completey done creating it, so you shouldn't end up with bad files
if you exit out of the function when it's running.

## Internals {#fetch-internals}

What's going on under the hood in `ft_get()`? It goes like this:

* If you request data from a specific data source (use of `from`, only allowed for PLOS, Entrez, eLife, Pensoft, arXiv, BiorXiv, Elsevier and Wiley):
    * Grab publisher specific collector function
    * Each of these functions has specific code for that publisher to pull full text given an article identifier
* If you don't request a specific data source:
    * Guess which publisher the DOI comes from
    * If publisher discovered that we have plugins for
        * Get publisher specific collector function
    * If publisher not discovered
        * Ping the <https://ftdoi.org> API
        * If publisher found with ftdoi API:
            * Link for full text used from the ftdoi API response
        * If publisher not found with ftdoi API:
            * Attempt fetch via Crossref API - if links found we try those, some work and some don't

## Notes about specific data sources {#fetch-notes}

### Elsevier

When you don't have access to the full text of Elsevier articles they will
often still give you something, but it will sometimes be just metadata
of the paper or sometimes an abstract if you're lucky. When you go to extract
the text this will be rather obvious.