Skip to content

Commit 1226a79

Browse files
authored
Merge pull request #3 from Str3am786/main
Ready for First Release
2 parents beb7e26 + 812ac85 commit 1226a79

115 files changed

Lines changed: 7500 additions & 2025 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 198 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,77 +1,227 @@
11

22

3-
# Research Software Extraction Framework (RSEF)
4-
README IN PROGRESS
5-
## Introduction
6-
7-
This tool verifies the link between a scientific paper and a software repository. It accomplishes this by locating the URL of the software repository within the scientific paper. It then extracts the repository's metadata to find any URLs associated with scientific papers and checks if they lead back to the original paper. If a bidirectional link is established, it marks it as "bidirectional".
83

9-
There is also a "unidirectional" metric, which finds a repository url and see's within the repository if the paper is named.
10-
11-
## Dependencies
12-
- Python 3.10
13-
- Java 8 or above (please see [Tika requirements](https://tika.apache.org))
14-
15-
## Installation
16-
17-
Install the required dependencies by running:
18-
```
19-
pip install -r requirements.txt
20-
```
21-
Highly recommended steps:
22-
23-
```text
24-
somef configure
25-
```
26-
You will be asked to provide:
27-
28-
* A GitHub authentication token [**optional, leave blank if not used**], which SOMEF uses to retrieve metadata from GitHub. If you don't include an authentication token, you can still use SOMEF. However, you may be limited to a series of requests per hour. For more information, see [https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line)
4+
# Research Software Extraction Framework (RSEF)
5+
296

30-
* The path to the trained classifiers (pickle files). If you have your own classifiers, you can provide them here. Otherwise, you can leave it blank
7+
## Introduction
318

32-
### Docker
33-
TODO
9+
This tool verifies the link between a scientific paper and a software repository. It accomplishes this by locating the URL of the software repository within the scientific paper. It then extracts the repository's metadata to find any URLs associated with scientific papers and checks if they lead back to the original paper. If a bidirectional link is established, it marks it as "bidirectional".
3410

35-
## Usage
3611

37-
To see an example of usage please look at [example.ipynb](./example/example.ipynb)
12+
13+
There is also a "unidirectional" metric, which finds a repository url and see's within the repository if the paper is named.
14+
15+
## Dependencies
16+
17+
- Python 3.9
18+
19+
- Java 8 or above (please see [Tika requirements](https://pypi.org/project/tika/))
20+
21+
## Installation
22+
23+
Install the required dependencies by running:
24+
25+
```
26+
27+
pip install -e .
28+
29+
```
30+
31+
Highly recommended steps:
32+
33+
```text
34+
35+
somef configure
36+
37+
```
38+
39+
You will be asked to provide:
40+
41+
* A GitHub authentication token [**optional, leave blank if not used**], which SOMEF uses to retrieve metadata from GitHub. If you don't include an authentication token, you can still use SOMEF. However, you may be limited to a series of requests per hour. For more information, see [https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line)
42+
43+
* The path to the trained classifiers (pickle files). If you have your own classifiers, you can provide them here. Otherwise, you can leave it blank
44+
3845

39-
### The repository is divided into the following directories:
46+
47+
## Usage
48+
49+
```text
50+
51+
Usage: rsef [OPTIONS] COMMAND [ARGS]...
52+
53+
RRRRRRRRR SSSSSSSSS EEEEEEEEE FFFFFFFFF
54+
RRR RRR SSS SSS EEE FFF
55+
RRR RRR SSSS EEE FFF
56+
RRRRRRRRR SSSSSSSSS EEEEEEE FFFFFFF
57+
RRR RRR SSSS EEE FFF
58+
RRR RRR SSS SSS EEE FFF
59+
RRR RRR SSSSSSSS EEEEEEEEE FFF
4060
41-
1. Download_pdf
61+
Research Software Extraction Framework (RSEF)\n
62+
Find and assess Research Software within Research papers.
63+
64+
Usage:
65+
1. (assess) Assess doi for unidirectionality or bidirectionality
66+
2. (download) Download PDF (paper) from a doi or list
67+
3. (process) Process downloaded pdf to find urls and abstract
68+
69+
Options:
70+
--version Show the version and exit.
71+
-h, --help Show this message and exit.
72+
73+
Commands:
74+
assess
75+
download
76+
process
77+
78+
```
79+
80+
### Assess
81+
82+
The assess command allows for a user to determine whether a given Identifier, in this case ArXiv or DOI, is bidirectional or not.
83+
84+
The command allows for the user to input a single DOI/ArXiv, a list of identifiers given as a ```.txt```, or a ```processed_metadata.json```
85+
86+
87+
```text
88+
rsef assess -h
89+
Usage: sskg assess [OPTIONS]
90+
91+
Options:
92+
93+
-i, --input <name> DOI, path to .txt list of DOIs or path to processed_metadata.json [required]
94+
95+
-o, --output <path> Output csv file [default: output]
96+
97+
-U, --unidir Unidirectionality
98+
99+
-B, --bidir Bidirectionality
100+
101+
-h, --help Show this message and exit.
102+
```
103+
104+
### Download
105+
106+
The download command allows for a user to download the pdf with its metadata given an Identifier: ArXiv or DOI. Alongside the PDFs folder there will be a `download_metadata.json` which will have the Title, DOI, ArXiv and filename/filepath for each paper downloaded.
107+
```
108+
rsef download -h
109+
Usage: rsef download [OPTIONS]
110+
111+
Options:
112+
113+
-i, --input <name> DOI or path to .txt list of DOIs [required]
114+
115+
-o, --output <path> Output Directory [default: ./]
116+
117+
-h, --help Show this message and exit.
118+
```
119+
120+
### Processed
121+
122+
The process command allows to take Identifier, or downloaded paper and process it to extract the abstract and github and zenodo urls. These will be saved in a json named ```processed_metadata.json```
123+
```
124+
rsef process -h
125+
Usage: rsef process [OPTIONS]
126+
127+
Options:
128+
129+
-i, --input <name> DOI, path to .txt list of DOIs or path to downloaded_metadata.json [required]
130+
131+
-o, --output <path> Output Directory [default: ./]
132+
133+
-h, --help Show this message and exit.
134+
```
135+
136+
137+
138+
139+
### The repository is divided into the following directories:
140+
141+
1. Download_pdf
142+
42143
2. Metadata
144+
43145
3. Extraction
44-
4. Object_creator
146+
147+
4. Object_creator
148+
45149
5. Modelling
150+
46151
6. Prediction
47-
48-
### Download_pdf
49-
Pertains to all the downloading of pdfs.
50-
Downloaded_obj is a representation of downloaded papers which have not been processed yet.
152+
153+
7. Utils
154+
51155

52156
### Metadata
53-
TODO
54-
Encompasses petitions to OpenAlex for fetching the paper's metadata.
157+
158+
159+
Encompasses all petitions to OpenAlex and other api's for fetching the paper's metadata or general requests.
160+
55161
MetadataObj contains the metadata from OpenAlex: doi, arxiv and its title.
56162

163+
### Download_pdf
164+
165+
Pertains to all the downloading of pdfs.
166+
167+
Downloaded_obj is a representation of downloaded papers which have not been processed yet.
168+
169+
Contains:
170+
171+
- Title
172+
- DOI
173+
- ArXiv
174+
- file_path
175+
- file_name
176+
177+
These objects are normally saved into a `downloaded_metadata.json`
178+
179+
180+
57181
### Extraction
58-
TODO
182+
183+
184+
59185
Tika scripts to open a pdf and extract its urls are also found witin this module.
60-
PaperObj is created once the downloadedObj's pdf has been processed to locate all its urls. Contains: doi, arxiv, title, file_path, urls.
61-
Finally, the necessary functions dowloading a repository and extracting its metadata with SOMEF
186+
187+
PaperObj is created once the downloadedObj's pdf has been processed to locate all its urls.
188+
Contains:
189+
- DOI
190+
- arXiv
191+
- Abstract
192+
- Title
193+
- File_path
194+
- File_name
195+
- URLs
196+
197+
Finally, the necessary functions downloading a repository and extracting its metadata with SOMEF
198+
199+
62200

63201
### Modelling
64-
Contains all assessment of bidirectionality and unidirectionality.
65-
Mainly receives a paperObj and a repository_metadata json.
202+
203+
Contains all assessment of bi-directionality and uni-directionality.
204+
205+
Receives a paperObj and a repository_metadata json.
206+
207+
66208

67209
### Object Creator
68-
This is the pipeline broken down into its main parts. Please look at [pipeline.py](./object_creator/pipeline.py) and [example.ipynb](./example/example.ipynb) to view the execution process.
210+
211+
This is the pipeline broken down into its main parts. Please look at [pipeline.py](./object_creator/pipeline.py) to view the execution process.
212+
213+
69214

70215
### Prediction
216+
71217
For assessment of the program against its corpus. The corpus can be found within [corpus.csv](./predicition/corpus.csv) and the f1 score obtained bidirectional: [corpus_eval_bidir.json](./predicition/corpus_eval_bidir.json) and the same for the unidirectional (_unidir)
72218

73219

220+
## Tests
221+
222+
Tests can be found in the `./tests` folder
74223

75-
## License
76-
77-
This project is licensed under the [MIT License](LICENSE).
224+
225+
## License
226+
227+
This project is licensed under the [MIT License](LICENSE).

setup.cfg

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[metadata]
2-
name = SSKG
3-
version = attr: SSKG.__version__
2+
name = RSEF
3+
version = attr: RSEF.__version__
44
author = Miguel Arroyo Márquez, Daniel Garijo
55
author_email = daniel.garijo@upm.es
66
description = TODO
@@ -16,7 +16,7 @@ package_dir =
1616
= src
1717
packages = find:
1818
include_package_data = True
19-
python_requires = >= 3.10.0
19+
python_requires = >= 3.9.0
2020
install_requires =
2121
somef >= 0.9.4
2222
arxiv
@@ -34,4 +34,4 @@ where = src
3434

3535
[options.entry_points]
3636
console_scripts =
37-
sskg = SSKG.__main__:cli
37+
rsef = RSEF.__main__:cli
Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,15 @@
1111
@click.version_option(__version__)
1212
def cli():
1313
"""
14-
███████ ███████ ██ ██ ██████ \n
15-
██ ██ ██ ██ ██ \n
16-
███████ ███████ █████ ██ ███ \n
17-
██ ██ ██ ██ ██ ██ \n
18-
███████ ███████ ██ ██ ██████ \n
14+
RRRRRRRRR SSSSSSSSS EEEEEEEEE FFFFFFFFF\n
15+
RRR RRR SSS SSS EEE FFF\n
16+
RRR RRR SSS EEE FFF\n
17+
RRRRRRRRR SSSSSSSSS EEEEEEE FFFFFFF\n
18+
RRR RRR SSSS EEE FFF\n
19+
RRR RRR SSS SSS EEE FFF\n
20+
RRR RRR SSSSSSSSS EEEEEEEEE FFF\n
1921
20-
Scientific Software Knowledge Graphs (SSKG)\n
22+
Research Software Extraction Framework (RSEF)\n
2123
Find and assess Research Software within Research papers.\n
2224
2325
Usage:\n
@@ -50,11 +52,12 @@ def cli():
5052
# exit(1)
5153

5254
@cli.command()
53-
@click.option('--input','-i', required=True, help="DOI or path to .txt list of DOIs", metavar='<name>')
54-
@click.option('--output','-o', default="output", show_default=True, help="Output csv file", metavar='<path>')
55+
@click.option('--input', '-i', required=True, help="DOI, path to .txt list of DOIs or path to processed_metadata.json",
56+
metavar='<name>')
57+
@click.option('--output', '-o', default="output", show_default=True, help="Output csv file", metavar='<path>')
5558
@click.option('--unidir', '-U', is_flag=True, default = False, help="Unidirectionality")
5659
@click.option('--bidir', '-B', is_flag=True, default = False, help="Bidirectionality")
57-
def assess(input, output,unidir,bidir):
60+
def assess(input, output, unidir, bidir):
5861
from .object_creator.pipeline import dois_txt_to_unidir_json, dois_txt_to_bidir_json, single_doi_pipeline_unidir, \
5962
single_doi_pipeline_bidir, papers_json_to_unidir_json, papers_json_to_bidir_json
6063
if unidir:
@@ -84,10 +87,11 @@ def assess(input, output,unidir,bidir):
8487

8588

8689
@cli.command()
87-
@click.option('--input','-i', required=True, help="DOI or path to .txt list of DOIs", metavar='<name>')
88-
@click.option('--output','-o', default="./", show_default=True, help="Output Directory ", metavar='<path>')
90+
@click.option('--input', '-i', required=True, help="DOI or path to .txt list of DOIs", metavar='<name>')
91+
@click.option('--output', '-o', default="./", show_default=True, help="Output Directory ", metavar='<path>')
8992
def download(input, output):
9093
from .object_creator.create_downloadedObj import doi_to_downloadedJson, dois_txt_to_downloadedJson
94+
from .utils.regex import str_to_doiID
9195
if input.endswith(".txt") and os.path.exists(input):
9296
dois_txt_to_downloadedJson(dois_txt=input, output_dir=output)
9397
else:
@@ -97,17 +101,18 @@ def download(input, output):
97101
print(e)
98102
return
99103
@cli.command()
100-
@click.option('--input','-i', required=True, help="DOI or path to .txt list of DOIs", metavar='<name>')
104+
@click.option('--input', '-i', required=True, help="DOI, path to .txt list of DOIs or path to downloaded_metadata.json",
105+
metavar='<name>')
101106
@click.option('--output','-o', default="./", show_default=True, help="Output Directory ", metavar='<path>')
102-
def process(input,output):
107+
def process(input, output):
103108
from .object_creator.downloaded_to_paperObj import dwnlddJson_to_paperJson, dwnldd_obj_to_paper_json
104109
from .object_creator.create_downloadedObj import pdf_to_downloaded_obj
105110

106111
if os.path.isdir(input):
107112
_aux_pdfs_to_pp_json(input= input, output= output)
108113
return
109114
if input.endswith(".json") and os.path.exists(input):
110-
dwnlddJson_to_paperJson(input,output)
115+
dwnlddJson_to_paperJson(input, output)
111116
if input.endswith(".pdf") and os.path.exists(input):
112117
#TODO
113118
dwnldd = pdf_to_downloaded_obj(pdf= input, output_dir= output)
@@ -117,6 +122,7 @@ def process(input,output):
117122
print("Error")
118123
return
119124

125+
120126
def _aux_pdfs_to_pp_json(input, output):
121127
from .object_creator.create_downloadedObj import pdf_to_downloaded_obj
122128
from .object_creator.downloaded_to_paperObj import dwnldd_obj_to_paper_dic

0 commit comments

Comments
 (0)