Skip to content

Commit 44171fd

Browse files
authored
Upate docs to include field name descriptions (#25)
* Add more docs on table definitions WIP Signed-off-by: Will Beason <willbeason@gmail.com> * Remove stale table definitions These were old definitions for the first iteration. Signed-off-by: Will Beason <willbeason@gmail.com> * Add explanation of every field in each table. Add appendices for dictionary types with small dictionaries. Signed-off-by: Will Beason <willbeason@gmail.com> * Move table documentation to its own file. Signed-off-by: Will Beason <willbeason@gmail.com> * Update golangci-lint Signed-off-by: Will Beason <willbeason@gmail.com> * Update to Go 1.24 Causing CI failures Signed-off-by: Will Beason <willbeason@gmail.com> * Document 'gg' pattern --------- Signed-off-by: Will Beason <willbeason@gmail.com>
1 parent 043223f commit 44171fd

File tree

7 files changed

+130
-67
lines changed

7 files changed

+130
-67
lines changed

.github/workflows/golangci-lint.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,4 @@ jobs:
2121
- name: golangci-lint
2222
uses: golangci/golangci-lint-action@v6
2323
with:
24-
version: v1.63.4
24+
version: v1.64.3

EXTRACTING_TABLES.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -108,11 +108,16 @@ These are currently a work in progress.
108108

109109
### Extracting Tables
110110

111-
To extract tables, run `extract-columns` on the directory containing the JSONL files.
111+
To extract tables, run `extract-columns`, passing both the IN_DIR containing the JSONL files and the out directory to write tables to.
112+
113+
To fully extract tables, you need to run the three commands in order. That the first and third command are identical is not a mistake. The full process will take approximately one hour and will use about 10GiB of memory while running.
112114

113115
```shell
114-
go run cmd/extract-columns/extract-columns.go [papers|software] IN_DIR OUT_DIR
116+
go run cmd/extract-columns/extract-columns.go papers IN_DIR OUT_DIR
117+
go run cmd/extract-columns/extract-columns.go pdf IN_DIR OUT_DIR
118+
go run cmd/extract-columns/extract-columns.go papers IN_DIR OUT_DIR
115119
```
116120

117-
For now there are only two tables, `papers` and `software`.
118-
You may define new Parquet table definitions that extract information from the JSONL files, but you must insert the reference in extract-columns.go to use them.
121+
The reason for this is a circular dependency between the datasets, which can only be resolved by iterating over at least one of the datasets twice:
122+
1. Papers has a "has_mentions" field, which requires knowledge from the Mentions table of whether any mentions exist for a paper.
123+
2. Mentions has a "paper_id" field, which is computed as part of extracting the Papers table.

cmd/extract-columns/extract-column.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,8 @@ func runE(_ *cobra.Command, args []string) error {
106106
}
107107
}
108108

109+
// The "gg" here is for a special file for two papers which were missing metadata
110+
// in the original SoftCite dataset.
109111
var (
110112
paperPattern = regexp.MustCompile(`([0-9a-f]{2}|gg)\.jsonl.gz`)
111113
pdfPattern = regexp.MustCompile(`[0-9a-f]{2}\.software\.jsonl\.gz`)

go.mod

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
module github.com/willbeason/software-mentions
22

3-
go 1.23.1
3+
go 1.24
44

55
require (
66
github.com/VividCortex/ewma v1.2.0

pkg/tables/papers.go

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,16 +43,16 @@ var Papers = arrow.NewSchema([]arrow.Field{
4343
).Build(),
4444
Nullable: true,
4545
},
46-
{Name: "journal_name",
46+
{Name: "publication_venue",
4747
Type: arrow.BinaryTypes.String,
4848
Metadata: NewMetadataBuilder().Add(
49-
comment, "The parsed journal the paper was published in",
49+
comment, "The parsed venue the paper was published in",
5050
).Build(),
5151
},
5252
{Name: "publisher_name",
5353
Type: arrow.BinaryTypes.String,
5454
Metadata: NewMetadataBuilder().Add(
55-
comment, "The parsed publisher of the paper's journal",
55+
comment, "The parsed publisher of the paper's venue",
5656
).Build(),
5757
Nullable: true,
5858
},
@@ -72,7 +72,7 @@ var Papers = arrow.NewSchema([]arrow.Field{
7272
{Name: "pmid",
7373
Type: arrow.BinaryTypes.String,
7474
Metadata: NewMetadataBuilder().Add(
75-
comment, "The PubMed Identifier of the paper",
75+
comment, "The PubMed identifier of the paper",
7676
).Build(),
7777
Nullable: true,
7878
},
@@ -82,6 +82,9 @@ var Papers = arrow.NewSchema([]arrow.Field{
8282
ValueType: arrow.BinaryTypes.String,
8383
Ordered: false,
8484
},
85+
Metadata: NewMetadataBuilder().Add(
86+
comment, "The type of document the paper is, such as a journal article or a book",
87+
).Build(),
8588
Nullable: true,
8689
},
8790
{Name: "license_type",

pkg/tables/tables.go

Lines changed: 1 addition & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,6 @@
11
package tables
22

3-
import "github.com/apache/arrow/go/v18/arrow"
4-
53
const (
6-
Software = "software"
7-
4+
Software = "software"
85
ParquetExt = ".parquet"
96
)
10-
11-
var (
12-
GrobidRunSchema = arrow.NewSchema([]arrow.Field{
13-
{Name: "uuid", Type: arrow.BinaryTypes.String},
14-
{Name: "application", Type: &arrow.DictionaryType{
15-
IndexType: arrow.PrimitiveTypes.Uint8,
16-
ValueType: arrow.BinaryTypes.String,
17-
Ordered: false,
18-
}},
19-
{Name: "date", Type: arrow.BinaryTypes.String},
20-
{Name: "file", Type: arrow.BinaryTypes.String},
21-
{Name: "softcite_file_name", Type: arrow.BinaryTypes.String},
22-
{Name: "id", Type: arrow.BinaryTypes.String},
23-
{Name: "md5", Type: arrow.BinaryTypes.String},
24-
{Name: "metadata.id", Type: arrow.BinaryTypes.String},
25-
{Name: "original_file_path", Type: arrow.BinaryTypes.String},
26-
{Name: "runtime", Type: arrow.PrimitiveTypes.Uint32},
27-
{Name: "version", Type: &arrow.DictionaryType{
28-
IndexType: arrow.PrimitiveTypes.Uint8,
29-
ValueType: arrow.BinaryTypes.String,
30-
Ordered: false,
31-
}},
32-
}, nil)
33-
34-
PapersSchema = arrow.NewSchema([]arrow.Field{
35-
{Name: "uuid", Type: arrow.BinaryTypes.String},
36-
{Name: "doi", Type: arrow.BinaryTypes.String},
37-
{Name: "year", Type: arrow.PrimitiveTypes.Uint16},
38-
}, nil)
39-
40-
SoftwareSchema = arrow.NewSchema([]arrow.Field{
41-
{Name: "normalizedForm", Type: arrow.BinaryTypes.String},
42-
{Name: "wikidataId", Type: arrow.BinaryTypes.String},
43-
//{Name: "softwareType", Type: &arrow.DictionaryType{
44-
// IndexType: arrow.PrimitiveTypes.Uint8,
45-
// ValueType: arrow.BinaryTypes.String,
46-
// Ordered: false,
47-
//},
48-
//},
49-
}, nil)
50-
51-
MentionsSchema = arrow.NewSchema([]arrow.Field{
52-
{Name: "paperId", Type: arrow.BinaryTypes.String},
53-
{Name: "mentionIndex", Type: arrow.PrimitiveTypes.Uint16},
54-
{Name: "normalizedForm", Type: arrow.BinaryTypes.String},
55-
{Name: "documentContextAttributes.created.value", Type: arrow.FixedWidthTypes.Boolean},
56-
{Name: "documentContextAttributes.shared.value", Type: arrow.FixedWidthTypes.Boolean},
57-
{Name: "documentContextAttributes.used.value", Type: arrow.FixedWidthTypes.Boolean},
58-
{Name: "mentionContextAttributes.created.value", Type: arrow.FixedWidthTypes.Boolean},
59-
{Name: "mentionContextAttributes.shared.value", Type: arrow.FixedWidthTypes.Boolean},
60-
{Name: "mentionContextAttributes.used.value", Type: arrow.FixedWidthTypes.Boolean},
61-
}, nil)
62-
)

tables.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# Tables
2+
3+
This document lists every table in
4+
5+
## Table Definitions
6+
7+
The Parquet files are three tables of the SoftCite data.
8+
They do not contain all fields in the SoftCite dataset, but are a (hopefully useful) subset specifically related to mentions.
9+
10+
Much of the information below can be gleaned from the metadata field `comment`, which is present in every table and for every field.
11+
Where this documentation conflicts with what is in `comment`, trust what is in `comment`.
12+
13+
Where field names are repeated between tables, they have identical meaning (e.g. "paper_id").
14+
15+
Tables are mostly normalized, but with several technically-redundant precalculated fields (such as "published_year" from "published_date") which have been added for convenience.
16+
17+
### Papers
18+
19+
This table contains paper metadata.
20+
Each entry represents a single paper analyzed by SoftCite.
21+
Many papers do not have any associated mentions - see the `has_mentions` field.
22+
23+
- **paper_id** is a unique key for each paper, specific to this dataset.
24+
- **softcite_id** is the UUID for each paper in the original SoftCite dataset.
25+
- **title** is the title of the paper as parsed by SoftCite.
26+
- **published_year** is the year the paper was published, calculated from published_date.
27+
- **published_date** is the publication date of the paper as parsed by SoftCite.
28+
- **publication_venue** is the venue the paper was published in. This covers
29+
- **publisher_name** is the publisher of the paper's venue.
30+
- **doi** is the raw DOI of the paper (non-URL form).
31+
- **pmcid** is the PubMed Central identifier for the paper, if one exists.
32+
- **pmid** is the PubMed identifier of the paper, if one exists.
33+
- **genre** is the type of document the paper is, such as a journal article or a book. The full list of genres is shown [below](#genres).
34+
- **license_type*** is the license of the document parsed by SoftCite. The full list of licenses is shown [below](#licenses).
35+
- **has_mentions** is whether SoftCite identified any software mentions for the paper.
36+
37+
### Mentions
38+
39+
This table contains an entry for every identified mention of a piece of software in the analyzed papers.
40+
41+
- **software_mention_id** is a unique key for each software mention. It is a composite of _paper_id_, _source_file_type_, and _mention_index_.
42+
- **paper_id** is the equivalent to _paper_id_ in the Papers table.
43+
- **source_file_type** is the format of the document parsed by SoftCite. For now this is always "pdf", but in the future may include other formats.
44+
- **mention_index** is a unique key for each mention within a paper.
45+
- **software_raw** is the raw string of the mentioned software.
46+
- **software_normalized** is a normalized form of _software_raw_.
47+
- **version_raw** is the version of the mentioned software, if present in the mention.
48+
- **version_normalized** is a normalized form of _version_raw_.
49+
- **publisher_raw** is the raw string of the publisher of the mentioned software, if present in the mention.
50+
- **publisher_normalized** is a normalized form of _publisher_raw_.
51+
- **language_raw** is the raw string of the mentioned software's programming language, if present in the mention.
52+
- **language_normalized** is a normalized form of _language_raw_.
53+
- **url_raw** is the raw string of the URL for the mentioned software, if present in the mention.
54+
- **url_normalized** is a normalized form of _url_raw_.
55+
- **context_full_text** is the surrounding context of the software mention in the paper, as parsed by SoftCite. This is often a sentence, but can be a fragment.
56+
57+
### PurposeAssessments
58+
59+
Each mention has Purpose Assessments which try to determine whether the mention has a given purpose.
60+
Each Mention in the Mentions table has exactly six of these assessments, one for each possible combination of scope and purpose (see below).
61+
62+
- **software_mention_id** is identical to _software_mention_id in the Mentions table.
63+
- **paper_id** is identical to _paper_id_ in the Papers table.
64+
- **source_file_type** is identical to _source_file_type_ in the Mentions table.
65+
- **mention_index** is identical to _mention_index_ in the Mentions table.
66+
- **scope** is either "document" or "local". A "local" scope indicates the analysis was done specifically on the local context of the mention when determining its purpose. A "document" scope indicates that the analysis covered the entire document.
67+
- **purpose** is either "created", "used", and "shared", representing the reason the software was mentioned in this context. These purposes are not necessarily distinct: a mention could both indicate that some software was created by the papers' authors and is available on GitHub, for instance, making it both "created" and "shared".
68+
69+
### Appendix
70+
71+
#### Genres
72+
73+
For reference, these are the known values for the "genre" field:
74+
75+
- "book"
76+
- "book-chapter"
77+
- "book-part"
78+
- "book-section"
79+
- "book-series"
80+
- "book-set"
81+
- "database"
82+
- "dataset"
83+
- "dissertation"
84+
- "edited-book"
85+
- "grant"
86+
- "journal"
87+
- "journal-article"
88+
- "journal-issue"
89+
- "journal-volume"
90+
- "monograph"
91+
- "other"
92+
- "peer-review"
93+
- "posted-content"
94+
- "proceedings"
95+
- "proceedings-article"
96+
- "proceedings-series"
97+
- "reference-book"
98+
- "reference-entry"
99+
- "report"
100+
- "report-component"
101+
- "report-series"
102+
- "standard"
103+
- NA (not present)
104+
105+
#### Licenses
106+
107+
For reference, these are the known value for the "license" field:
108+
109+
TBD.

0 commit comments

Comments
 (0)