Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
181 commits
Select commit Hold shift + click to select a range
3bacd04
Updated UMLS and RxNorm.
gaurav Aug 2, 2025
885cc94
First stab at writing out a metadata file for every compendium.
gaurav Jun 26, 2025
e800dfd
Added counts to metadata.
gaurav Jun 26, 2025
4db83fe
Added metadata files to build_* methods.
gaurav Jun 26, 2025
5d207f2
Added metadata YAMLs to Snakemake dependencies.
gaurav Jun 26, 2025
8abbb8b
First stab at metadata files.
gaurav Jun 30, 2025
c8ed59b
Fixed typo.
gaurav Jun 30, 2025
a5af518
Fixed build_anatomy_umls_relationships() params.
gaurav Jun 30, 2025
519641b
Added metadata for CellLine.
gaurav Jun 30, 2025
0c5bc87
Fixed some metadata output.
gaurav Jun 30, 2025
650ebb4
Fixed some issues.
gaurav Jun 30, 2025
2fa76f3
Fixed some cell_line bugs.
gaurav Jun 30, 2025
baf970b
First stab at adding metadata to Chemical.
gaurav Jun 30, 2025
48c9064
Turned off DRY_RUN for testing.
gaurav Jul 1, 2025
e97ac42
Added metadata outputs for cell_line and anatomy.
gaurav Jul 1, 2025
b444e1c
Fixed sources.
gaurav Jul 1, 2025
fe9ed81
First stab at Chemical concord metadata.
gaurav Jul 1, 2025
9adc3ff
Added check for type of name (in case people try passing in an object).
gaurav Jul 1, 2025
963197c
Fixed metadata.yaml output.
gaurav Jul 1, 2025
b49384b
Updated behavior of concord combinations.
gaurav Jul 1, 2025
640fbb7
Fixed write_concord_metadata() calls.
gaurav Jul 1, 2025
edd78ab
Added metadata to module targets.
gaurav Jul 1, 2025
54802a7
Added target to chemical.
gaurav Jul 1, 2025
4e9e1e7
Centralized UMLS build_sets() metadata generation.
gaurav Jul 1, 2025
2fc7053
Added provenance metadata to diseasephenotype module.
gaurav Jul 1, 2025
b424235
Added metadata to DrugChemical conflations.
gaurav Jul 1, 2025
da2dc1d
Added concords to the Gene module.
gaurav Jul 1, 2025
2009778
Added metadata for genefamilies.
gaurav Jul 2, 2025
a44eda1
Added metadata to a method we probably don't use any more.
gaurav Jul 2, 2025
188cdd0
Added MacromolecularComplex metadata requirement.
gaurav Jul 2, 2025
79a081a
Fixed metadata for MacromolecularComplex.
gaurav Jul 2, 2025
4682c4b
Added provenance metadata to process.
gaurav Jul 2, 2025
62370f0
Added metadata to taxon.
gaurav Jul 2, 2025
142357e
Added publication metadata.
gaurav Jul 2, 2025
4029519
Added metadata for module protein.
gaurav Jul 2, 2025
febac77
Turned DRY_RUN back on.
gaurav Jul 2, 2025
e500b1d
Oops, left off metadata.yaml file.
gaurav Jul 2, 2025
aa4f819
Fixed metadata.yaml for get_protein_pr_uniprotkb_relationships.
gaurav Jul 7, 2025
97772b9
Fixed YAML.
gaurav Jul 7, 2025
8c219cf
Fixed typo.
gaurav Jul 7, 2025
20caac4
Added syntax to ensure that most prov methods are keyword-based.
gaurav Jul 8, 2025
83d129d
Added concord_filename everywhere.
gaurav Jul 9, 2025
ac908b3
Fixed UMLS provenance metadata arguments.
gaurav Jul 9, 2025
3e42f31
Added missing name.
gaurav Jul 9, 2025
b4ca2de
Fixed untyped compendium metadata for chemicals.
gaurav Jul 10, 2025
4f7fc3a
Removed unnecessary counts.
gaurav Jul 11, 2025
b255b12
Fixed metadata file.
gaurav Jul 15, 2025
4553434
Added manual concord for DrugChemical.
gaurav Jul 15, 2025
8f4d5de
Update manual concord predicate count so it's in the right format.
gaurav Jul 15, 2025
3360e82
Improve build_anatomy_obo_relationships() metadata generation.
gaurav Aug 3, 2025
54bc8dd
Fixed name for metadata.
gaurav Aug 3, 2025
57bfcf3
Improved MeSH metadata.
gaurav Aug 3, 2025
7020b7f
Improved PubMed metadata.
gaurav Aug 3, 2025
21e220b
Removed unnecessary change.
gaurav Aug 3, 2025
63a05d0
Merge branch 'add-metadata' into babel-v1.11.0-2025aug2
gaurav Aug 3, 2025
c682498
Renamed duplicate functions so their purpose is clearer.
gaurav Aug 3, 2025
62629ce
Renamed write_ensembl_ids() to clarify what's going on.
gaurav Aug 3, 2025
e0f64ad
Removed redundant function.
gaurav Aug 3, 2025
e873cbe
Added a properties file for CHEBI.
gaurav Aug 3, 2025
1f3491e
First stab at a PropertyStore.
gaurav Aug 3, 2025
255ee46
Got property store working.
gaurav Aug 3, 2025
eb5b3db
Merge branch 'remove-redundant-ensembl-id-code' into babel-v1.11.0-20…
gaurav Aug 3, 2025
83c0a7c
Turned exception into warning, strip() line before splitting.
gaurav Aug 4, 2025
2afcbf7
Fixed bug in Snakemake file: dir() -> directory().
gaurav Aug 4, 2025
ce263ec
Setting PubChem Input encoding to ISO-8859.
gaurav Aug 4, 2025
686c215
Corrected encoding name.
gaurav Aug 4, 2025
5956184
Merge branch 'babel-v1.11.0-2025aug2' into add-chebi-secondary-ids
gaurav Aug 7, 2025
b9a55d9
Rewrote properties in a way that made more sense.
gaurav Aug 7, 2025
fb50c2b
Tweaked the data model further.
gaurav Aug 7, 2025
29c3fd2
Updated usage of chemical properties to match the new format.
gaurav Aug 7, 2025
b921ec0
First stab at incorporating HAS_ADDITIONAL_ID into write_compendium().
gaurav Aug 7, 2025
f322cb2
General code improvements.
gaurav Aug 7, 2025
ba0f4ff
Fixed calling ChEBI identifiers.
gaurav Aug 7, 2025
22ea646
Added support for creating the properties directory.
gaurav Aug 7, 2025
57b9bd2
Fixed typo in filename.
gaurav Aug 7, 2025
6d24ae8
Slightly improved error case.
gaurav Aug 10, 2025
9b4738c
Removed some redundant code, improved an exception.
gaurav Aug 10, 2025
d62a9f0
Added some memory tracking code into the protein compendium.
gaurav Aug 10, 2025
8ece61a
Fixed output so we don't just write out every single frozenset.
gaurav Aug 10, 2025
d3b5de4
Updated memory usage to GB.
gaurav Aug 10, 2025
3a8f196
Added humanfriendly.
gaurav Aug 10, 2025
17f8e2b
Updated requirements.lock.
gaurav Aug 10, 2025
ce3542b
Created get_logger() function to replace LoggingUtil.
gaurav Aug 10, 2025
abc3bdd
Standardized logging in node.py.
gaurav Aug 10, 2025
d4e83de
Documented a future improvement.
gaurav Aug 10, 2025
85e6b46
Reduced number of log outputs.
gaurav Aug 10, 2025
8d430f9
Fixed build_compendium() call.
gaurav Aug 10, 2025
21114c1
Merge branch 'debug-protein-compendium-slowdown' into standardized-lo…
gaurav Aug 10, 2025
3e37331
Improved log message.
gaurav Aug 11, 2025
3ea3426
Standardized logging of new protein compendium building code.
gaurav Aug 11, 2025
6a047db
Added more memory checks in various places.
gaurav Aug 11, 2025
7dca095
Increased the frequency of updates.
gaurav Aug 11, 2025
b9eb998
Tweaks.
gaurav Aug 11, 2025
cb57fdf
Make loggers singletons (sort of).
gaurav Aug 11, 2025
785430b
Improved logger, output.
gaurav Aug 11, 2025
9571698
Updated datefmt to be a bit more ISO8601ish.
gaurav Aug 11, 2025
eb81148
Only set up basicConfig() if we don't have any handlers.
gaurav Aug 11, 2025
d8e76f0
Improved logging.
gaurav Aug 11, 2025
f989836
Improved logging.
gaurav Aug 11, 2025
4a01e06
Can we save memory by using dict[list] instead of defaultdict[set]?
gaurav Aug 11, 2025
fac33b3
Some other improvements.
gaurav Aug 11, 2025
f6f3f11
Minor fixes.
gaurav Aug 11, 2025
f3b93ac
A semi-complete TSVDuckDBLoader implementation.
gaurav Aug 12, 2025
7e043bd
Replaced TaxonFactory by wrapping a TSVDuckDBLoader.
gaurav Aug 12, 2025
a812db0
Turned off some unnecessary log entries.
gaurav Aug 12, 2025
3d20a59
Commit database before querying it.
gaurav Aug 12, 2025
287d0be
Add index, optimize query.
gaurav Aug 12, 2025
32714a4
Changed DuckDB to be in-memory and sorted by curie1.
gaurav Aug 12, 2025
f89e377
Make curie1 case-insensitive.
gaurav Aug 12, 2025
99767ef
Fix: bug in the case-insensitive code.
gaurav Aug 12, 2025
c5635eb
Replaced DuckDB loader with an SQLite loader.
gaurav Aug 12, 2025
6eb2a3f
Added sqlite3 to requirements.
gaurav Aug 12, 2025
ea2dae3
We should make the index after loading the data.
gaurav Aug 12, 2025
2bce0b1
Simplified "test".
gaurav Aug 12, 2025
b9a7ffb
Improved SQLite loader.
gaurav Aug 12, 2025
9e1c7d5
Improved loader.
gaurav Aug 12, 2025
af1fd80
We should save the SQLite so we can query it later.
gaurav Aug 12, 2025
2339e69
Added a note.
gaurav Aug 12, 2025
1dc03bc
Set up overall TMPDIR setting so that SQLite temp files go there.
gaurav Aug 12, 2025
025f6e3
Slightly sped up prefix checks.
gaurav Aug 12, 2025
edc6ec5
Removed "sqlite3", which is a core module apparently.
gaurav Aug 13, 2025
1c18793
Improved seconds/clique display.
gaurav Aug 13, 2025
26abcd8
Fixed a bug in checking if a prefix has already been loaded.
gaurav Aug 13, 2025
37b6f08
Improved seconds/clique display.
gaurav Aug 13, 2025
9d58f69
Added a TMPDIR placed without our downloads directory.
gaurav Aug 12, 2025
448a500
Cache get_config() so we keep reparsing the YAML file.
gaurav Aug 13, 2025
f61abf2
Prevent factories from running get_config() on every node.
gaurav Aug 13, 2025
9e2d849
Added py-spy to requirements.
gaurav Aug 13, 2025
d938217
Fixed bug in accessible instance variable.
gaurav Aug 13, 2025
c086eee
Cleaned up code.
gaurav Aug 13, 2025
c456770
Merge branch 'standardized-logging' into update-factories-to-duckdb
gaurav Aug 14, 2025
2b8ee41
Merge branch 'update-factories-to-duckdb' into update-factories-to-sq…
gaurav Aug 14, 2025
8644adc
Add time elapsed to write_compendium().
gaurav Aug 14, 2025
eb1e24e
Increased sig digits for seconds/clique rate.
gaurav Aug 14, 2025
a0416eb
Fixed bug in TSVSQLiteLoader.
gaurav Aug 14, 2025
dcd9555
Made the write_compendium log a little more configurable.
gaurav Aug 14, 2025
f3af423
Made the write_compendium log a little more configurable.
gaurav Aug 14, 2025
a6293c4
Merge branch 'standardized-logging' into update-factories-to-duckdb
gaurav Aug 14, 2025
c8aea8b
Merge branch 'update-factories-to-duckdb' into update-factories-to-sq…
gaurav Aug 14, 2025
e809642
Replaced sources with tuples so that Property is hashable.
gaurav Aug 14, 2025
ab7df9c
Renamed HAS_ADDITIONAL_ID to HAS_ALTERNATIVE_ID.
gaurav Aug 14, 2025
c0edf6e
Added sources to test.
gaurav Aug 14, 2025
b6eb1ec
Made sources into a single string.
gaurav Aug 14, 2025
1f5576f
Fixed bug in converting sources into a list.
gaurav Aug 16, 2025
ed81818
Fixed bug in metadata_yamls where a single item was passed in.
gaurav Aug 16, 2025
c2403c3
write_concord_metadata() now handles single YAML input files.
gaurav Aug 16, 2025
297d47f
Modified behavior when node_factory.create_node() fails.
gaurav Aug 17, 2025
ec6dae3
Fixed input_yamls (prev dict, now just a list of filenames).
gaurav Aug 17, 2025
e07776b
Marked synoynms/Publication.txt as temp().
gaurav Aug 17, 2025
6b6a516
Improved error message when a directory content test fails.
gaurav Aug 17, 2025
d2ad9c8
Fixed bug in error output.
gaurav Aug 17, 2025
e8ed050
Moved DrugChemical metadata into metadata/
gaurav Aug 17, 2025
ec78408
Cleaned up code.
gaurav Aug 17, 2025
6a71101
Moved one configuration option into its own section.
gaurav Aug 18, 2025
45e8e8e
Ack no that's not a temp file.
gaurav Aug 18, 2025
675671e
Ack no that's not a temp file.
gaurav Aug 18, 2025
eca8284
Added GeneProteinConflated.txt.gz as an output.
gaurav Aug 19, 2025
6dbf5be
Tweaked code so it lines up with DrugChemicalConflated.
gaurav Aug 19, 2025
0546767
Added to various reports.
gaurav Aug 19, 2025
4cca517
Merge branch 'add-geneprotein-conflated-synonyms' into babel-v1.12.0
gaurav Aug 19, 2025
531692d
Fixed bug in handling taxa.
gaurav Aug 20, 2025
96eaa59
Merge branch 'add-geneprotein-conflated-synonyms' into babel-v1.12.0
gaurav Aug 21, 2025
675939f
Fixed some bugs in actually exporting the ChEBI alternate properties.
gaurav Aug 26, 2025
7a53ce6
Attempt to fix a bug in KEGG xrefs from ChEBI.
gaurav Aug 26, 2025
4ffb0ad
Added a UMLS-MeSH concord to proteins.
gaurav Aug 27, 2025
9495a6d
Added DRUGBANK mappings to UMLS/protein concords.
gaurav Aug 27, 2025
0fe0690
DuckDB files are temporary, but regenerating them is a pain.
gaurav Aug 27, 2025
14219c3
Added comments to document what chemicals.build_compendia() is doing.
gaurav Aug 27, 2025
d170020
Merge branch 'babel-v1.12.0' into add-umls-mesh-mappings-for-proteins
gaurav Aug 27, 2025
656032e
Fixed PropertyList count.
gaurav Aug 28, 2025
97fd066
Fixed PropertyList count.
gaurav Aug 28, 2025
3afc6f9
Improved log message.
gaurav Aug 28, 2025
30eb4e8
Fixed bug in PropertyList count.
gaurav Aug 28, 2025
a304e71
Merge branch 'babel-v1.12.0' into add-umls-mesh-mappings-for-proteins
gaurav Aug 28, 2025
472ff59
Added DRUGBANK as an extra prefix for the Protein compendia.
gaurav Aug 28, 2025
55fe9fe
Fixed minor bug in properties.
gaurav Aug 28, 2025
9d8a0ce
Fixed bug in getting the alternative ID.
gaurav Aug 28, 2025
2dafc67
Merge branch 'babel-v1.12.0' into add-umls-mesh-mappings-for-proteins
gaurav Aug 28, 2025
b5731ed
Major cleanup of additional CURIEs with labels and properties.
gaurav Aug 28, 2025
4a427be
Merge branch 'babel-v1.12.0' into add-umls-mesh-mappings-for-proteins
gaurav Aug 28, 2025
ece1da8
Added DuckDB to the Dockerfile so we can use the CLI.
gaurav Aug 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ RUN apt-get install -y screen
RUN apt-get install -y vim
RUN apt-get install -y rsync
RUN apt-get install -y jq
RUN apt-get install -y duckdb

# Create a non-root-user.
RUN adduser --home ${ROOT} --uid 1000 nru
Expand Down
5 changes: 5 additions & 0 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ include: "src/snakefiles/duckdb.snakefile"
include: "src/snakefiles/reports.snakefile"
include: "src/snakefiles/exports.snakefile"

# Some global settings.
import os
os.environ['TMPDIR'] = config['tmp_directory']

# Top-level rules.
rule all:
input:
# See rule all_outputs later in this file for how we generate all the outputs.
Expand Down
21 changes: 19 additions & 2 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,28 @@
# Overall inputs and outputs.
input_directory: input_data
download_directory: babel_downloads
intermediate_directory: babel_outputs/intermediate
output_directory: babel_outputs
tmp_directory: babel_downloads/tmp

# Versions that need to be updated on every release.
biolink_version: "4.2.6-rc5"
umls_version: "2024AB"
rxnorm_version: "03032025"
umls_version: "2025AA"
rxnorm_version: "07072025"
drugbank_version: "5-1-13"

#
# PROTEINS
#

# Chris Bizon prepared a list of UMLS/UniProtKB mappings which we download and use.
UMLS_UniProtKB_download_raw_url: "https://raw.githubusercontent.com/cbizon/UMLS_UniProtKB/refs/heads/main/outputs/UMLS_UniProtKB.tsv"

#
# The rest of these configs need to be cleaned up.
#


ncbi_files:
- gene2ensembl.gz
- gene_info.gz
Expand Down Expand Up @@ -145,6 +158,7 @@ protein_concords:
- PR
- NCIT_UniProtKB
- NCIT_UMLS
- UMLS
- UMLS_UniProtKB

protein_outputs:
Expand Down Expand Up @@ -282,6 +296,9 @@ chemical_outputs:
drugchemicalconflated_synonym_outputs:
- DrugChemicalConflated.txt

geneproteinconflated_synonym_outputs:
- GeneProteinConflated.txt

taxon_labels:
- NCBITaxon
- MESH
Expand Down
6 changes: 4 additions & 2 deletions requirements.lock
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ pronto==2.7.0
propcache==0.3.1
psutil==7.0.0
psycopg2-binary==2.9.10
PuLP==3.1.1
PuLP==2.7.0
pydantic==2.11.4
pydantic_core==2.33.2
PyJSG==0.11.10
Expand Down Expand Up @@ -128,7 +128,7 @@ ShExJSG==0.8.2
six==1.17.0
smart-open==7.1.0
smmap==5.0.2
snakemake==9.3.3
snakemake==7.32.4
snakemake-interface-common==1.17.4
snakemake-interface-executor-plugins==9.3.5
snakemake-interface-logger-plugins==1.2.3
Expand All @@ -142,10 +142,12 @@ SQLAlchemy==2.0.40
SQLAlchemy-Utils==0.38.3
sssom==0.4.15
sssom-schema==1.0.0
stopit==1.1.2
stringcase==1.2.0
tabulate==0.9.0
tenacity==8.5.0
throttler==1.2.2
toposort==1.10
tqdm==4.67.1
traitlets==5.14.3
types-python-dateutil==2.9.0.20241206
Expand Down
5 changes: 5 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,8 @@ duckdb
# on checking if it's online via http://httpstat.us/200, which is often offline. My branch of this
# https://github.com/gaurav/apybiomart/tree/change-check-url and changes that to https://example.org.
git+https://github.com/gaurav/apybiomart.git@change-check-url

# Added by Gaurav, Aug 2025 to check for memory information while Babel is running.
psutil
humanfriendly
py-spy
Loading