Skip to content

Commit 7da5117

Browse files
authored
Merge pull request #494 from allenai/053_upgrade
Update to latest spacy version
2 parents 3da29c2 + b4cef3d commit 7da5117

12 files changed

+42
-45
lines changed

Dockerfile

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ WORKDIR /work
1818
COPY requirements.in .
1919

2020
RUN pip install -r requirements.in
21-
RUN pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz
21+
RUN pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz
2222
RUN python -m spacy download en_core_web_sm
2323
RUN python -m spacy download en_core_web_md
2424

README.md

+9-9
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ pip install scispacy
1919
to install a model (see our full selection of available models below), run a command like the following:
2020

2121
```bash
22-
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz
22+
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz
2323
```
2424

2525
Note: We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install scispacy.
@@ -76,14 +76,14 @@ pip install CMD-V(to paste the copied URL)
7676

7777
| Model | Description | Install URL
7878
|:---------------|:------------------|:----------|
79-
| en_core_sci_sm | A full spaCy pipeline for biomedical data with a ~100k vocabulary. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz)|
80-
| en_core_sci_md | A full spaCy pipeline for biomedical data with a ~360k vocabulary and 50k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_md-0.5.1.tar.gz)|
81-
| en_core_sci_lg | A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz)|
82-
| en_core_sci_scibert | A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model. You may want to [use a GPU](https://spacy.io/usage#gpu) with this model. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_scibert-0.5.1.tar.gz)|
83-
| en_ner_craft_md| A spaCy NER model trained on the CRAFT corpus.|[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_craft_md-0.5.1.tar.gz)|
84-
| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus.| [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_jnlpba_md-0.5.1.tar.gz)|
85-
| en_ner_bc5cdr_md | A spaCy NER model trained on the BC5CDR corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bc5cdr_md-0.5.1.tar.gz)|
86-
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz)|
79+
| en_core_sci_sm | A full spaCy pipeline for biomedical data with a ~100k vocabulary. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz)|
80+
| en_core_sci_md | A full spaCy pipeline for biomedical data with a ~360k vocabulary and 50k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_md-0.5.3.tar.gz)|
81+
| en_core_sci_lg | A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_lg-0.5.3.tar.gz)|
82+
| en_core_sci_scibert | A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model. You may want to [use a GPU](https://spacy.io/usage#gpu) with this model. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_scibert-0.5.3.tar.gz)|
83+
| en_ner_craft_md| A spaCy NER model trained on the CRAFT corpus.|[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_craft_md-0.5.3.tar.gz)|
84+
| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus.| [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_jnlpba_md-0.5.3.tar.gz)|
85+
| en_ner_bc5cdr_md | A spaCy NER model trained on the BC5CDR corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_bc5cdr_md-0.5.3.tar.gz)|
86+
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_bionlp13cg_md-0.5.3.tar.gz)|
8787

8888

8989
## Additional Pipeline Components

RELEASE.md

+3-8
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,11 @@ Update the version in version.py.
1515

1616
#### Training new models
1717

18-
For the release, new models should be trained using the `scripts/pipeline.sh` and `scripts/ner_pipeline.sh` scripts, for the small, medium and large models, and specialized NER models. Remember to export the `ONTONOTES_PATH` and `ONTONOTES_PERCENT` environment variables to mix in the ontonotes training data.
18+
The entire pipeline can be run using `spacy project run all`. This will train and package all the models.
1919

20-
```
21-
bash scripts/pipeline.sh small
22-
bash scripts/pipeline.sh medium
23-
bash scripts/pipeline.sh large
24-
bash scripts/ner_pipeline.sh <path to medium base model>
25-
```
20+
The packages should then be uploaded to the `https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/{VERSION}` S3 bucket, and references to previous models (e.g in the readme and in the docs) should be updated. You can find all these places using `git grep <previous version>`.
2621

27-
these should then be uploaded to the `https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/{VERSION}` S3 bucket, and references to previous models (e.g in the readme and in the docs) should be updated. You can find all these places using `git grep <previous version>`.
22+
The scripts `install_local_packages.py`, `instal_remote_packages.py`, `print_out_metrics.py`, `smoke_test.py`, and `uninstall_local_packages.py` are useful for testing at each step of the process. Before uploading, `install_local_packages.py` and `smoke_test.py` can be used to make sure the packages are installable and do a quick check of output. `print_out_metrics.py` can then be used to easily get the metrics that need to be update in the README. Once the packages have been uploaded, `uninstall_local_packages.py`, `install_remote_packages.py`, and `smoke_test.py` can be used to ensure everything was uploaded correctly.
2823

2924
#### Merge a PR with the above changes
3025
Merge a PR with the above changes, and publish a release with a tag corresponding to the commit from the merged PR. This should trigger the publish github action, which will create the `scispacy` package and publish it to pypi.

configs/base_ner.cfg

+2-2
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ nO = null
4848
[components.ner.model.tok2vec.embed]
4949
@architectures = "spacy.MultiHashEmbed.v2"
5050
width = 96
51-
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE", "SPACY"]
52-
rows = [5000, 2500, 2500, 2500, 100]
51+
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
52+
rows = [5000, 1000, 2500, 2500]
5353
include_static_vectors = ${vars.include_static_vectors}
5454

5555
[components.ner.model.tok2vec.encode]

configs/base_ner_scibert.cfg

+2-2
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,8 @@ nO = null
4545
[components.ner.model.tok2vec.embed]
4646
@architectures = "spacy.MultiHashEmbed.v2"
4747
width = 96
48-
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE", "SPACY"]
49-
rows = [5000, 2500, 2500, 2500, 100]
48+
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
49+
rows = [5000, 1000, 2500, 2500]
5050
include_static_vectors = false
5151

5252
[components.ner.model.tok2vec.encode]

configs/base_parser_tagger.cfg

+2-2
Original file line numberDiff line numberDiff line change
@@ -73,8 +73,8 @@ factory = "tok2vec"
7373
[components.tok2vec.model.embed]
7474
@architectures = "spacy.MultiHashEmbed.v2"
7575
width = ${components.tok2vec.model.encode.width}
76-
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE", "SPACY"]
77-
rows = [5000, 2500, 2500, 2500, 100]
76+
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE", "SPACY", "IS_SPACE"]
77+
rows = [5000, 1000, 2500, 2500, 50, 50]
7878
include_static_vectors = ${vars.include_static_vectors}
7979

8080
[components.tok2vec.model.encode]

docs/index.md

+16-16
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,14 @@ pip install <Model URL>
1717

1818
| Model | Description | Install URL
1919
|:---------------|:------------------|:----------|
20-
| en_core_sci_sm | A full spaCy pipeline for biomedical data. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz)|
21-
| en_core_sci_md | A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_md-0.5.1.tar.gz)|
22-
| en_core_sci_scibert | A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_scibert-0.5.1.tar.gz)|
23-
| en_core_sci_lg | A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz)|
24-
| en_ner_craft_md| A spaCy NER model trained on the CRAFT corpus.|[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_craft_md-0.5.1.tar.gz)|
25-
| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus.| [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_jnlpba_md-0.5.1.tar.gz)|
26-
| en_ner_bc5cdr_md | A spaCy NER model trained on the BC5CDR corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bc5cdr_md-0.5.1.tar.gz)|
27-
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz)|
20+
| en_core_sci_sm | A full spaCy pipeline for biomedical data. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz)|
21+
| en_core_sci_md | A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_md-0.5.3.tar.gz)|
22+
| en_core_sci_scibert | A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_scibert-0.5.3.tar.gz)|
23+
| en_core_sci_lg | A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_lg-0.5.3.tar.gz)|
24+
| en_ner_craft_md| A spaCy NER model trained on the CRAFT corpus.|[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_craft_md-0.5.3.tar.gz)|
25+
| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus.| [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_jnlpba_md-0.5.3.tar.gz)|
26+
| en_ner_bc5cdr_md | A spaCy NER model trained on the BC5CDR corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_bc5cdr_md-0.5.3.tar.gz)|
27+
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_bionlp13cg_md-0.5.3.tar.gz)|
2828

2929

3030

@@ -34,18 +34,18 @@ Our models achieve performance within 3% of published state of the art dependenc
3434

3535
| model | UAS | LAS | POS | Mentions (F1) | Web UAS |
3636
|:---------------|:----|:------|:------|:---|:---|
37-
| en_core_sci_sm | 89.03| 87.00 | 98.13 | 67.87 | 87.42 |
38-
| en_core_sci_md | 89.73| 87.85 | 98.40 | 69.53 | 87.79 |
39-
| en_core_sci_lg | 89.75| 87.79 | 98.49 | 69.69 | 87.74 |
40-
| en_core_sci_scibert | 92.21| 90.65 | 98.86 | 68.01 | 92.58 |
37+
| en_core_sci_sm | 89.39| 87.41 | 98.32 | 68.00 | 87.65 |
38+
| en_core_sci_md | 90.23| 88.39 | 98.39 | 68.95 | 87.63 |
39+
| en_core_sci_lg | 89.98| 88.15 | 98.50 | 68.67 | 88.21 |
40+
| en_core_sci_scibert | 92.54| 91.02 | 98.89 | 67.90 | 92.85 |
4141

4242

4343
| model | F1 | Entity Types|
4444
|:---------------|:-----|:--------|
45-
| en_ner_craft_md | 76.75|GGP, SO, TAXON, CHEBI, GO, CL|
46-
| en_ner_jnlpba_md | 72.28| DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN |
47-
| en_ner_bc5cdr_md | 84.53| DISEASE, CHEMICAL|
48-
| en_ner_bionlp13cg_md | 76.57| AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE |
45+
| en_ner_craft_md | 77.56|GGP, SO, TAXON, CHEBI, GO, CL|
46+
| en_ner_jnlpba_md | 72.98| DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN |
47+
| en_ner_bc5cdr_md | 84.23| DISEASE, CHEMICAL|
48+
| en_ner_bionlp13cg_md | 77.36| AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE |
4949

5050

5151
### Example Usage

project.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ title: "scispaCy pipeline"
22
description: "All the steps needed in the scispaCy pipeline"
33

44
vars:
5-
version_string: "0.5.2"
5+
version_string: "0.5.3"
66
gpu_id: 0
77
freqs_loc_s3: "s3://ai2-s2-scispacy/data/gorc_subset.freqs"
88
freqs_loc_local: "assets/gorc_subset.freqs"

requirements.in

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
numpy
2-
spacy>=3.4.0,<3.5.0
2+
scipy<1.11
3+
spacy>=3.6.0,<3.7.0
34
spacy-lookups-data
45
pandas
56
requests>=2.0.0,<3.0.0

scispacy/version.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
_MAJOR = "0"
22
_MINOR = "5"
3-
_REVISION = "2"
3+
_REVISION = "3"
44

55
VERSION_SHORT = "{0}.{1}".format(_MAJOR, _MINOR)
66
VERSION = "{0}.{1}.{2}".format(_MAJOR, _MINOR, _REVISION)

scripts/install_remote_packages.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55

66
def main():
7-
s3_prefix = "https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/"
7+
s3_prefix = "https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/"
88
model_names = [
99
"en_core_sci_sm",
1010
"en_core_sci_md",

setup.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,8 @@
4141
packages=find_packages(exclude=["*.tests", "*.tests.*", "tests.*", "tests"]),
4242
license="Apache",
4343
install_requires=[
44-
"spacy>=3.4.0,<3.5.0",
44+
"spacy>=3.6.0,<3.7.0",
45+
"scipy<1.11",
4546
"requests>=2.0.0,<3.0.0",
4647
"conllu",
4748
"numpy",

0 commit comments

Comments
 (0)