Skip to content

Conversation

@gaurav
Copy link
Collaborator

@gaurav gaurav commented Aug 18, 2025

Changes in this PR:

  • Rewrote part of the HAS_ADDITIONAL_ID logic, so that we add it later in the process so that we can label the additional identifiers without them getting re-sorted by node_factory.create_node().
  • Modified the description factory so that it processes lists of identifiers instead of a node.
  • Increased ephemeral storage for the Babel pod to 1G so we have a bit more spare capacity before we run out of ephemeral space.
  • Made DuckDB files non-temporary (so we don't have to regenerate them if we rebuild something), but added an rsync-to-server.sh script that excludes the DuckDB files when rsyncing the Babel Outputs to a directory.
  • KEGG.COMPOUND cross-references in ChEBI are now a list instead of a single ID, which need to be handled separately. Closes KEGG.COMPOUND CURIEs broken in Babel 2025aug17 #493 (7a53ce6).
  • Added unique counts for property lists.
  • Improved logging and memory usage for createcompendia/chemicals.py.
  • Added memory usage logging for exporters/kgx.py and in other parts of write_compendia().
  • Added some comments to config.yaml.

Important bug fixes in this release:

Filename 2025mar31 2025aug17 Diff % Diff
Count of CURIEs in all files 677,806,537 688,567,091 +10,760,554 1.59%
Count of cliques in all files 482,038,965 490,614,550 +8,575,585 1.78%
AnatomicalEntity 249,164 249,584 +420 0.17%
BiologicalProcess 68,110 67,929 -181 -0.27%
Cell 12,920 13,175 +255 1.97%
CellLine 38,810 38,810 0 0.00%
CellularComponent 14,676 14,696 +20 0.14%
ChemicalEntity 662,991 4,086,825 +3,423,834 516.42%
ChemicalMixture 529 527 -2 -0.38%
ComplexMolecularMixture 286 276 -10 -3.50%
Disease 628,736 632,330 +3,594 0.57%
Drug 359,131 360,953 +1,822 0.51%
Gene 75,520,611 79,427,652 +3,907,041 5.17%
GeneFamily 27,985 28,050 +65 0.23%
GrossAnatomicalStructure 15,655 15,709 +54 0.34%
MacromolecularComplex 1,258 1,258 0 0.00%
MolecularActivity 203,862 206,636 +2,774 1.36%
MolecularMixture 21,150,406 21,758,582 +608,176 2.88%
OrganismTaxon 3,455,969 3,543,867 +87,898 2.54%
Pathway 52,815 53,125 +310 0.59%
PhenotypicFeature 480,603 483,108 +2,505 0.52%
Polypeptide 188 167 -21 -11.17%
Protein 274,887,335 275,407,170 +519,835 0.19%
Publication 78,061,642 79,773,973 +1,712,331 2.19%
SmallMolecule 221,021,085 221,504,843 +483,758 0.22%
umls 891,770 897,846 +6,076 0.68%

@gaurav gaurav changed the base branch from master to fix-chemical-properties August 18, 2025 05:32
@gaurav gaurav changed the base branch from fix-chemical-properties to add-geneprotein-conflated-synonyms August 19, 2025 03:32
Base automatically changed from add-geneprotein-conflated-synonyms to master September 15, 2025 22:45
@gaurav gaurav marked this pull request as ready for review September 15, 2025 23:07
@gaurav gaurav merged commit a82ed7a into master Sep 15, 2025
@gaurav gaurav deleted the babel-v1.12.0 branch September 15, 2025 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KEGG.COMPOUND CURIEs broken in Babel 2025aug17

3 participants