Skip to content

Commit 0f7bb3f

Browse files
authored
Refactor alignment (#1848)
As a follow-up to #1826, this PR fully re-implements all external registry download and processing functions to use a unified data model normally I would split up code + data changes, but the changes in the model required them to be done concurrently
1 parent 4b2e14e commit 0f7bb3f

78 files changed

Lines changed: 195941 additions & 123617 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

exports/alignment/bartoc.tsv

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1651,6 +1651,7 @@ prefix name homepage description
16511651
18533 Skill Level http://vocabulary.curriculum.edu.au/skillLevel.html Skill level is defined as a function of the range and complexity of the set of tasks performed in a particular occupation. The greater the range and complexity of the set of tasks, the greater the skill level of an occupation.
16521652
18534 Language Modes http://vocabulary.curriculum.edu.au/languageModes.html Modes refer to the various forms of communication – listening, speaking, reading, viewing and writing.
16531653
18535 Australian Curriculum Element http://vocabulary.curriculum.edu.au/curriculumElement.html Elements used to organise the Australian Curriculum.
1654+
18536 Library of Congress Name Authority File http://id.loc.gov/authorities/names.html The Library of Congress Name Authority File (NAF) file provides authoritative data for names of persons, organizations, events, places, and titles. Its purpose is the identification of these entities and, through the use of such controlled vocabulary, to provide uniform access to bibliographic resources. Names descriptions also provide access to a controlled form of name through references from unused forms, e.g. a search under: Snodgrass, Quintus Curtius, 1835-1910 will lead users to the authoritative name for Mark Twain, which is, 'Twain, Mark, 1835-1910.' Names may also be used as subjects in bibliographic descriptions, so they may be combined with controlled values from subject heading schemes, such as LCSH. Library of Congress Names includes over 8 million descriptions created over many decades and according to different cataloging policies. LC Names is officially called the NACO Authority File and is a cooperative effort in which participants follow a common set of standards and guidelines.
16541655
18537 Library of Congress Children's Subject Headings http://id.loc.gov/authorities/childrensSubjects.html The Library of Congress Subject Headings Supplemental Vocabularies: Children’s Headings (LCSHAC) is a thesaurus which is used in conjunction with LCSH. It is not a self-contained vocabulary, but is instead designed to complement LCSH and provide tailored subject access to children and young adults when LCSH does not provide suitable terminology, form, or scope for children. LCSHAC records can be identified by the LCCN prefix 'sj'.
16551656
18538 Library of Congress Genre/Form Terms http://id.loc.gov/authorities/genreForms.html The Library of Congress Genre/Form Terms for Library and Archival Materials (LCGFT) is a thesaurus that describes what a work is versus what it is about. For instance, the subject heading Horror films, with appropriate subdivisions, would be assigned to a book about horror films. A cataloger assigning headings to the movie The Texas Chainsaw Massacre would also use Horror films, but it would be a genre/form term since the movie is a horror film, not a movie about horror films. The thesaurus combines both genres and forms. Form is defined as a characteristic of works with a particular format and/or purpose. A 'short' is a particular form, for example, as is 'animation.' Genre refers to categories of works that are characterized by similar plots, themes, settings, situations, and characters. Examples of genres are westerns and thrillers. In the term Horror films 'horror' is the genre and 'films' is the form.
16561657
18539 PBCoreAssetType Vocabulary http://pbcore.org/pbcore-controlled-vocabularies/pbcoreassettype-vocabulary/ pbcoreAssetType is a broad definition of the type of intellectual content being described. Asset types might include those without associated instantiations (a collection or series), or those with instantiations (programs, episodes, clips, etc.) Best practice: The asset type should broadly describe all related instantiations — for example, if an asset includes many instantiations representing different generations of a program, the asset type ‘program’ remains accurate for all of them.
@@ -2563,6 +2564,7 @@ prefix name homepage description
25632564
20389 Currency from IKMK http://uri.gbv.de/terminology/ikmk_waehrung/
25642565
2039 Sweden´s national term bank http://www.rikstermbanken.se/ At its grand opening, Rikstermbanken contained more than 50 000 term records, and it will continue to grow contentwise. Although it is a Swedish term bank, it does not only contain Swedish terms but also terms in other languages, e.g. English, French, German, and Finnish, to name a few. Terminologicentrum TNC (The Swedish Centre for Terminology) has designed the term bank and its software, and is also the responsible body for the term bank and its content. The Swedish Ministry of Industry, Employment and Communications has contributed financially to its realization. More than 70 organizations – mostly public bodies but also associations and private companies – have contributed as suppliers of terminology collections. The terminology in Rikstermbanken belongs to various domains – geology, economics, planning and construction, cleaning, and interior decoration. Most of TNC’s own glossaries are also included. Thus, Rikstermbanken hopefully represents the accurate and actual terminology usage of these various domains.
25652566
20390 Access type from IKMK http://uri.gbv.de/terminology/ikmk_zugangsart/
2567+
20391 The European Science Vocabulary https://op.europa.eu/web/eu-vocabularies/concept-scheme/-/resource?uri=http://data.europa.eu/8mn/euroscivoc/40c0f173-baa3-48a3-9fe6-d6e8fb366a00 EuroSciVoc is a multilingual taxonomy that represents all the main fields of science that were discovered from CORDIS content and organised through a semi-automatic process based on NLP techniques. It contains more than 1000 categories in 6 languages (English, French, German, Italian, Polish and Spanish) and each category is enriched with relevant keywords extracted from the textual description of CORDIS projects. EuroSciVoc is managed by the Publications Office of the EU, and is currently used by the CORDIS website. It is specifically developed as a reference vocabulary for the Open Science community and is aligned with Linked Open Data standards.
25662568
20392 EUropean Research Information Ontology https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/eurio&version=1.0 EURIO (EUropean Research Information Ontology) conceptualises, formally encodes and makes available in an open, structured and machine-readable format data about resarch projects funded by the EU's framework programmes for research and innovation.
25672569
20393 BLL-Thesaurus https://data.linguistik.de/bll/index.html The Bibliography of Linguistic Literature (BLL) is one of the most comprehensive linguistic bibliographies worldwide. It covers general linguistics with all its neighbouring disciplines and subdomains as well as English, German and Romance linguistics. The BLL dates back as far as 1971 and lists circa 500,000 bibliographic references. Furthermore, the BLL provides a hierarchically categorised thesaurus of domain-specific index terms in German and English. The BLL Thesaurus comprises more than 9,000 subject terms.
25682570
20395 https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/eurio EURIO (EUropean Research Information Ontology) conceptualises, formally encodes and makes available in an open, structured and machine-readable format data about resarch projects funded by the EU's framework programmes for research and innovation.
@@ -2675,6 +2677,7 @@ prefix name homepage description
26752677
20506 The Digitization of Gandharan Artefacts Thesaurus https://w3id.org/diga/terms The DiGA Thesaurus is a resource for the standard description of Gandharan Buddhist art. It draws upon the “Repertorio terminologico per la schedatura delle sculture dell’arte gandharica” published by Domenico Faccenna and Anna Filigenzi in 2007 (IsIAO, Rome). This printed resource has been revised, expanded and restructured to fit a digital use. It now covers the following categories: animals, architecture, ceremonial objects, decorative motifs, everyday objects, figures (general features), figures (protagonists), means of transports, musical instruments, narratives, the sculptor's work, vegetation, weapons. Concerning the new categories of ‘narratives’ and ‘figures’, these are based on a variety of complementary sources.
26762678
20507 The Digitization of Gandharan Artefacts Gazetteer https://w3id.org/diga/gazetteer The DiGA Gazetteer is a resource for the standard description of Gandharan Buddhist art.
26772679
20508 Analytical methods for geochemistry and cosmochemistry https://vocabs.ardc.edu.au/viewById/650 This concept scheme contains skos concepts for analysis methods used to produce observation results with information about the physical properties, chemical or isotopic composition, crystallography, or molecular structure of material samples. Based on spreadsheet compilation of method vocabularies from Geo.X, GEOROC, PetDB and OSIRIS-REx. Definitions added and updated based on web research, and SKOS serialization by S.M. Richard. Note that although there are high-level method categories for 'Physical property measurement', 'Geochronology technique', and 'Bioanalytical method', these are placeholders and only include a few examples that are relevant to analytical methods in geo- or cosmochemistry.
2680+
20509 Chemical Methods Ontology https://github.com/rsc-ontologies/rsc-cmo CHMO, the chemical methods ontology, describes methods used to collect data in chemical experiments, such as mass spectrometry and electron microscopy prepare and separate material for further analysis, such as sample ionisation, chromatography, and electrophoresis synthesise materials, such as epitaxy and continuous vapour deposition It also describes the instruments used in these experiments, such as mass spectrometers and chromatography columns. It is intended to be complementary to the Ontology for Biomedical Investigations (OBI).
26782681
20511 Thesaurus of Archaeological Terminology https://teater.aiscr.cz/ The Thesaurus of Archaeological Terminology is a web thesaurus aimed at making the terminology of archaeology and the related disciplines more available. TEATER is intended for a wide sphere of users ranging from the lay public to amateur archaeologists, beginning students of archaeology, librarians of not only archaeological institutions and professional archaeologists. The entries are available in an interactive interface that enables both full-text searching and hierarchical browsing. TEATER’s content is available in three language versions – Czech, English and German. Entries can be exported from TEATER in .json format or linked using their PURL. More information on the possibilities of use of the thesaurus by various groups of users, the search application, the structure and selection of the entries is available in the Help section.
26792682
20512 IAMO Library Classification
26802683
20513 IAMO Thesaurus

exports/alignment/cellosaurus.tsv

Lines changed: 34 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,35 @@
11
prefix name homepage category uri_format
2-
Abeomics Abeomics cell line products https://www.abeomics.com/ Cell line collections (Providers) https://www.abeomics.com/advanced-search-result?keywords=$1
3-
BioGRID_ORCS_Cell_line BioGRID Open Repository of CRISPR Screens cell lines https://orcs.thebiogrid.org/ CRISP screens repositories https://orcs.thebiogrid.org/Search?searchType=10&search=$1&organism=all
4-
cancercelllines cancercelllines.org - cancer cell line oncogenomic online resource https://cancercelllines.org/ Cell line databases/resources https://cancercelllines.org/cellline/?id=cellosaurus:$1
5-
CancerTools CancerTools.org https://www.cancertools.org/cell-lines Cell line collections (Providers) https://www.cancertools.org/cell-lines/$1
6-
Cosmic-CLP COSMIC Cell lines Project https://cancer.sanger.ac.uk/cell_lines Cell line databases/resources https://cancer.sanger.ac.uk/cell_lines/sample/overview?id=$1
7-
DSHB Developmental Studies Hybridoma Bank https://dshb.biology.uiowa.edu/ Cell line collections (Providers) https://dshb.biology.uiowa.edu/$1
8-
Evercyte Evercyte cell line products https://evercyte.com/ Cell line collections (Providers) https://evercyte.com/?s=$1
9-
FANTOM5_SSTAR Functional Annotation of the Mammalian Genome - Samples, Transcription initiation And Regulators https://fantom.gsc.riken.jp/5/sstar/Main_Page Gene expression databases https://fantom.gsc.riken.jp/5/sstar/FF:$1
10-
FCDI FujiFilm Cellular Dynamics, Inc https://www.fujifilmcdi.com/cirm-ipsc-products/ Cell line collections (Providers)
11-
FlyBase_Gene Drosophila genome database; gene entry https://flybase.org Organism-specific databases https://flybase.org/reports/$1.htm
12-
FlyBase_Strain Drosophila genome database; strain entry https://flybase.org Organism-specific databases https://flybase.org/reports/$1.htm
13-
FPbase Fluorescent Protein database https://www.fpbase.org/ Sequence databases https://www.fpbase.org/protein/$1
14-
GeneCopoeia GeneCopoeia cell line products https://www.genecopoeia.com/ Cell line collections (Providers) https://www.genecopoeia.com/product/search2/?s=$1
15-
Genomeditech Genomeditech cell line products https://en.genomeditech.com/product?id=9 Cell line collections (Providers) https://en.genomeditech.com/search?kwd=$1
16-
GenScript GenScript cell line products https://www.genscript.com/cell_lines.html Cell line collections (Providers) https://www.genscript.com/search?q=$1
17-
Hysigen Hysigen cell line collection https://hysigen.com/ Cell line collections (Providers) https://hysigen.com/$1.html
18-
IARC_TP53 IARC TP53 Database https://tp53.isb-cgc.org/explore_cl Polymorphism and mutation databases
19-
IBRC Iranian Biological Research Center cell line collection http://www.en.ibrc.ir/ Cell line collections (Providers)
20-
Innoprot Innoprot cell line products https://innoprot.com/ Cell line collections (Providers) https://innoprot.com/?s=$1&lang=en
21-
InvivoGen InvivoGen cell line products https://www.invivogen.com/cell-lines Cell line collections (Providers) https://www.invivogen.com/search?sq=$1
22-
IZSLER Istituto Zooprofilattico Sperimentale della Lombardia e dell'Emilia Romagna biobank http://www.ibvr.org/Services/CellCultures.aspx Cell line collections (Providers)
23-
KCB Kunming Cell Bank of Type Culture Collection http://www.kmcellbank.com/ Cell line collections (Providers)
24-
LINCS_HMS Harvard Medical School (HMS) LINCS Center http://lincs.hms.harvard.edu/db/cells/ Cell line databases/resources http://lincs.hms.harvard.edu/db/cells/$1
25-
MCCL Molecular Connection Cell Line ontology https://bioportal.bioontology.org/ontologies/MCCL Cell line databases/resources
26-
NCBI_Iran National Cell Bank of Iran https://en.pasteur.ac.ir/Department%20of%20Cell%20Bank Cell line collections (Providers)
27-
NCI-DTP NCI Development Therapeutics Program https://dtp.cancer.gov/repositories/ Cell line collections (Providers)
28-
NISES National Institute of Sericultural and Entomological Science Cell Database https://web.archive.org/web/20160709065305/https://www.gene.affrc.go.jp/ex-nises/NISESCells/CellindexE1.html Cell line collections (Providers)
29-
PubChem_Cell_line PubChem compound database; cell line pages https://pubchem.ncbi.nlm.nih.gov Chemistry resources https://pubchem.ncbi.nlm.nih.gov/cell/$1
30-
Revvity/PerkinElmer Revvity/PerkinElmer cell line collection https://www.revvity.com Cell line collections (Providers) https://www.revvity.com/ch-en/search?q=$1
31-
RIKEN_BRC_EPD RIKEN BRC Experimental Plant Division cell lines https://epd.brc.riken.jp/en/ Cell line collections (Providers) https://plant.rtc.riken.jp/resource/cell_line/cell_line_detail.html?brcno=%S
32-
Rockland Rockland cell line products https://www.rockland.com/categories/cell-lines-and-lysates/ Cell line collections (Providers) https://www.rockland.com/search/?searchString=$1
33-
RSCB Royan Stem Cell Bank https://web.archive.org/web/20201001144644/http://www.royaninstitute.org/cmsen/index.php?option=com_content&task=view&id=205&Itemid=40 Cell line collections (Providers)
34-
SKY/M-FISH/CGH SKY/M-FISH and CGH database https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd136/ Cell line databases/resources
35-
Ubigene Ubigene Biosciences cell line products https://www.ubigene.us/ Cell line collections (Providers) https://www.ubigene.us/product/?cate=0&mykeyword=$1
2+
Abeomics Abeomics cell line products https://www.abeomics.com/ https://www.abeomics.com/advanced-search-result?keywords=$1
3+
BioGRID_ORCS_Cell_line BioGRID Open Repository of CRISPR Screens cell lines https://orcs.thebiogrid.org/ https://orcs.thebiogrid.org/Search?searchType=10&search=$1&organism=all
4+
cancercelllines cancercelllines.org - cancer cell line oncogenomic online resource https://cancercelllines.org/ https://cancercelllines.org/cellline/?id=cellosaurus:$1
5+
CancerTools CancerTools.org https://www.cancertools.org/cell-lines https://www.cancertools.org/cell-lines/$1
6+
Cosmic-CLP COSMIC Cell lines Project https://cancer.sanger.ac.uk/cell_lines https://cancer.sanger.ac.uk/cell_lines/sample/overview?id=$1
7+
DSHB Developmental Studies Hybridoma Bank https://dshb.biology.uiowa.edu/ https://dshb.biology.uiowa.edu/$1
8+
Evercyte Evercyte cell line products https://evercyte.com/ https://evercyte.com/?s=$1
9+
FANTOM5_SSTAR Functional Annotation of the Mammalian Genome - Samples, Transcription initiation And Regulators https://fantom.gsc.riken.jp/5/sstar/Main_Page https://fantom.gsc.riken.jp/5/sstar/FF:$1
10+
FCDI FujiFilm Cellular Dynamics, Inc https://www.fujifilmcdi.com/cirm-ipsc-products/
11+
FlyBase_Gene Drosophila genome database; gene entry https://flybase.org https://flybase.org/reports/$1.htm
12+
FlyBase_Strain Drosophila genome database; strain entry https://flybase.org https://flybase.org/reports/$1.htm
13+
FPbase Fluorescent Protein database https://www.fpbase.org/ https://www.fpbase.org/protein/$1
14+
GeneCopoeia GeneCopoeia cell line products https://www.genecopoeia.com/ https://www.genecopoeia.com/product/search2/?s=$1
15+
Genomeditech Genomeditech cell line products https://en.genomeditech.com/product?id=9 https://en.genomeditech.com/search?kwd=$1
16+
GenScript GenScript cell line products https://www.genscript.com/cell_lines.html https://www.genscript.com/search?q=$1
17+
Hysigen Hysigen cell line collection https://hysigen.com/ https://hysigen.com/$1.html
18+
IARC_TP53 IARC TP53 Database https://tp53.isb-cgc.org/explore_cl
19+
IBRC Iranian Biological Research Center cell line collection http://www.en.ibrc.ir/
20+
Innoprot Innoprot cell line products https://innoprot.com/ https://innoprot.com/?s=$1&lang=en
21+
InvivoGen InvivoGen cell line products https://www.invivogen.com/cell-lines https://www.invivogen.com/search?sq=$1
22+
IZSLER Istituto Zooprofilattico Sperimentale della Lombardia e dell'Emilia Romagna biobank http://www.ibvr.org/Services/CellCultures.aspx
23+
KCB Kunming Cell Bank of Type Culture Collection http://www.kmcellbank.com/
24+
LINCS_HMS Harvard Medical School (HMS) LINCS Center http://lincs.hms.harvard.edu/db/cells/ http://lincs.hms.harvard.edu/db/cells/$1
25+
MCCL Molecular Connection Cell Line ontology https://bioportal.bioontology.org/ontologies/MCCL
26+
NCBI_Iran National Cell Bank of Iran https://en.pasteur.ac.ir/Department%20of%20Cell%20Bank
27+
NCI-DTP NCI Development Therapeutics Program https://dtp.cancer.gov/repositories/
28+
NISES National Institute of Sericultural and Entomological Science Cell Database https://web.archive.org/web/20160709065305/https://www.gene.affrc.go.jp/ex-nises/NISESCells/CellindexE1.html
29+
PubChem_Cell_line PubChem compound database; cell line pages https://pubchem.ncbi.nlm.nih.gov https://pubchem.ncbi.nlm.nih.gov/cell/$1
30+
Revvity/PerkinElmer Revvity/PerkinElmer cell line collection https://www.revvity.com https://www.revvity.com/ch-en/search?q=$1
31+
RIKEN_BRC_EPD RIKEN BRC Experimental Plant Division cell lines https://epd.brc.riken.jp/en/ https://plant.rtc.riken.jp/resource/cell_line/cell_line_detail.html?brcno=%S
32+
Rockland Rockland cell line products https://www.rockland.com/categories/cell-lines-and-lysates/ https://www.rockland.com/search/?searchString=$1
33+
RSCB Royan Stem Cell Bank https://web.archive.org/web/20201001144644/http://www.royaninstitute.org/cmsen/index.php?option=com_content&task=view&id=205&Itemid=40
34+
SKY/M-FISH/CGH SKY/M-FISH and CGH database https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd136/
35+
Ubigene Ubigene Biosciences cell line products https://www.ubigene.us/ https://www.ubigene.us/product/?cate=0&mykeyword=$1

0 commit comments

Comments
 (0)