Skip to content

Commit 8e52f43

Browse files
committed
Merge branch 'main' of github.com:howisonlab/software-mentions-dataset-analysis
2 parents cb3c364 + 87de29d commit 8e52f43

File tree

3 files changed

+139
-10
lines changed

3 files changed

+139
-10
lines changed

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
/.idea
22
/out
33

4-
*.png
54
*.prof

README.md

Lines changed: 139 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,152 @@
1-
# software-mentions-dataset-analysis
2-
Analyses of software mentions and dependencies
1+
# Softcite Extractions from Open Access Literature
32

4-
## What this dataset is
3+
The softcite-extractions-oa dataset is a collection of ML-identified mentions of software detected in about 24 million academic papers. The papers are all open access papers available circa 2024. The extractions were created from academic PDFs using the [Softcite mention extraction toolchain](https://github.com/softcite#mention-extraction-tool-chain), which is built on the [Grobid](https://github.com/kermitt2/grobid) model trained on the [Softcite Annotations dataset v2](https://github.com/softcite/softcite_dataset_v2). More details available at the [Softcite Org home page](https://github.com/softcite/).
54

6-
The software-mentions dataset is a collection of ML-identified mentions of software
7-
detected in about 24,000,000 academic papers.
5+
This work used JetStream 2 at Indiana through allocation CIS220172 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
86

9-
### The data model
7+
## The data
8+
9+
The data files are hosted outside github, on Zenodo at <https://zenodo.org/records/15066399>. This GitHub repo is documentation and hosts the files used to convert the data from json into tabular format in parquet.
10+
11+
### Reporting extraction errors/omissions
12+
13+
These extractions are the result of a machine learning model; they are probabilistic and will have both false positives and false negatives. The performance of the model shows f-scores around 0.8 (see <https://doi.org/10.1145/3459637.3481936> for full details, and <https://github.com/softcite/#papers> for more details, including the annotation scheme used in the underlying gold standard dataset. See <https://github.com/softcite/> for the training data, models, tools, and papers.
14+
15+
Please create [Issues](https://github.com/softcite/softcite-extractions-oa/issues) in this repository when you encounter problems in the dataset. That said, we can't correct these manually, but any explanation you can give will help us improve the training data and improve the model. Please also share transformations you have to apply to the dataset in your work with it.
16+
17+
## The data model
1018

1119
A __paper__ can contain many __mentions__, each of which was found in a full-text snippet of __context__, and extracts the (raw and normalized) __software name__ , __version number__, __creator__, __url__, as well as associated __citation__ to the reference list of the paper.
1220

1321
Each __mention__ has multiple __purpose assessments__ about the relationship between the software and the paper: Was the software __used__ in the research?, Was it __created__ in the course of the research?, Was the software __shared__ alongside this paper? These probabilistic assessments (0..1 range) are made in two ways: using only the information from the specific mention and using all the mentions within a single paper together (mention-level vs document-level); thus each mention has six __purpose assessments__.
1422

15-
ER diagram goes here.
23+
<img src="class-diagram.png" alt="drawing" width="300"/>
1624

1725
## Getting Started
1826

1927
### Getting the Parquet files
2028

21-
If you want to extract the .parquet tables yourself, or work with the original dataset, see [Extracting Tables](EXTRACTING_TABLES.md).
22-
Otherwise, you can download the tables in a friendlier format from (INSERT LOCATION).
29+
Parquet files are available from Zenodo at <https://zenodo.org/records/15066399>. There are three sub-folders:
30+
31+
```
32+
full_dataset
33+
p01_one_percent_random_subset
34+
p05_five_percent_random_subset
35+
```
36+
37+
The random subsets are subsets of papers, with all of the extractions in those papers. We created these to make prototyping analyses easier. Inside each folder are three files:
38+
39+
```
40+
papers.parquet
41+
mentions.pdf.parquet
42+
purpose_assessments.pdf.parquet
43+
```
44+
### Example Analyses
45+
46+
For these examples, the 5% subset of the data is used.
47+
These examples require the `tidyverse` and `arrow` packages to run, but should otherwise work as-is.
48+
49+
```R
50+
library(tidyverse)
51+
library(arrow)
52+
```
53+
54+
1. How many papers mention OpenStreetMap?
55+
56+
This example filters by `software_normalied` as this is less noisy than `software_raw`.
57+
58+
```R
59+
> mentions <- open_dataset("p05_five_percent_random_subset/mentions.pdf.parquet")
60+
> mentions |>
61+
+ filter(software_normalized == "OpenStreetMap") |>
62+
+ select(paper_id) |>
63+
+ distinct() |>
64+
+ count() |>
65+
+ collect()
66+
# A tibble: 1 × 1
67+
n
68+
<int>
69+
1 376
70+
```
71+
72+
2. How did the number of papers referencing STATA each year change from 2000-2020?
73+
74+
By joining the Mentions table with Papers, we can compute statistics requiring access to paper metadata. Analyses like these are why we include fields such as `paper_id` in Mentions, even though it denormalizes the tables.
75+
76+
```R
77+
> papers <- open_dataset("p05_five_percent_random_subset/papers.parquet")
78+
> mentions <- open_dataset("p05_five_percent_random_subset/mentions.pdf.parquet")
79+
>
80+
> mentions |>
81+
+ filter(software_normalized == "STATA") |>
82+
+ select(paper_id) |>
83+
+ distinct() |>
84+
+ inner_join(papers, by = c("paper_id")) |>
85+
+ filter(published_year >= 2000, published_year <= 2020) |>
86+
+ count(published_year) |>
87+
+ arrange(published_year) |>
88+
+ collect()
89+
# A tibble: 21 × 2
90+
published_year n
91+
<int> <int>
92+
1 2000 11
93+
2 2001 14
94+
3 2002 20
95+
4 2003 29
96+
5 2004 51
97+
6 2005 32
98+
7 2006 42
99+
8 2007 49
100+
9 2008 77
101+
10 2009 87
102+
# ℹ 11 more rows
103+
# ℹ Use `print(n = ...)` to see more rows
104+
```
105+
106+
3. What are the most popular software packages used since 2020, by number of distinct papers?
107+
108+
Answering this question requires joining all three tables.
109+
Especially with the full dataset, we generally recommend using `select` statements before and after joins to reduce memory overhead.
110+
Here we use the PurposeAssessments table to evaluate whether software was "used" in a paper.
111+
The "document" scope is appropriate here as we're interested in whether the software was used by the paper, not whether particular mentions of the software indicate this.
112+
113+
```R
114+
> papers <- open_dataset("p05_five_percent_random_subset/papers.parquet")
115+
> mentions <- open_dataset("p05_five_percent_random_subset/mentions.pdf.parquet")
116+
> purposes <- open_dataset("p05_five_percent_random_subset/purpose_assessments.pdf.parquet")
117+
>
118+
> papers |>
119+
+ filter(published_year >= 2020) |>
120+
+ select(paper_id) |>
121+
+ inner_join(mentions, by=c("paper_id")) |>
122+
+ select(software_mention_id, software_normalized) |>
123+
+ inner_join(purposes, by=c("software_mention_id")) |>
124+
+ filter(scope=="document", purpose=="used", certainty_score > 0.5) |>
125+
+ select(paper_id, software_normalized) |>
126+
+ distinct() |>
127+
+ count(software_normalized) |>
128+
+ arrange(desc(n)) |>
129+
+ collect()
130+
# A tibble: 79,730 × 2
131+
software_normalized n
132+
<chr> <int>
133+
1 SPSS 22596
134+
2 GraphPad Prism 8080
135+
3 Excel 6131
136+
4 ImageJ 5477
137+
5 MATLAB 5117
138+
6 SAS 3480
139+
7 SPSS Statistics 3065
140+
8 Stata 2545
141+
9 script 2247
142+
10 Matlab 2225
143+
# ℹ 79,720 more rows
144+
# ℹ Use `print(n = ...)` to see more rows
145+
```
146+
## Additional details and provenance
147+
148+
The Grobid extraction pipeline worked with multiple sources for each paper, including PDFs and xml sources from publishers, such as JATS and TEI XML. This produced json files, which were then processed to tabular formats in parquet.
149+
150+
The tablular dataset includes only extractions from PDF sources, to avoid complexity of multiple source types for a single paper. This decision was made easier based on the reality that PDF was available for all papers, but other papers sources were only available for smaller subsets.
151+
152+
Details of the full json data, from all source document types, and the way those were read and mapped to tabular data are available in [Extracting Tables](EXTRACTING_TABLES.md).

class-diagram.png

275 KB
Loading

0 commit comments

Comments
 (0)