Databases (#185)

jessicaw9910 · pre-commit-ci[bot] · claude · web-flow · commit 3c77619764b8 · 2025-12-09T19:11:59.000-05:00
* removed comment * removed kinase_schema.CollectionKinaseInfo * comment on PRKD2 and AlphaMissense * temporary scratch for aligning sequences to DiscoverX * implemented new class ChEMBLMolecule to query for molecule details * added xlrd to package dependencies to process Davis dataset * preliminary info for davis harmonization * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add check_molecules to ChEMBL; updated wrong ChEMBLMolecule argument * add check_molecules to ChEMBL; updated wrong ChEMBLMolecule argument * make rdkit a package dependency * cli for querying ChEMBL for dataset preprocessing * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * moved davis and pkis2 modules to datasets * changed error message for maybe_get_symbol_from_hgnc_search if custom_field provided * updates to pkis2 and davis datasets modules * removed commented out PR CIs for databases and schema * fixed chembl search error - default empty list not None * added adjudicate_kd_start and adjudicate_kd_end for dataset incorporation purposes * added docstring for bool_offset * allow for str_fasta to be used if need to hardcode for errors * removed pytest.mark.skip as NCBI API is currently running * added function to check if lipid kinase * specified input_is_hgnc_symbol default in docstring * added Pfam docstring * UniProtRefSeqProteinGET and query_uniprotbulk_api to uniprot module; modifies nf-rnaseq package tooling * fully working initial commit of discoverx module; construct to KD/KLIFS mapping outstanding * added verbose flag to the KinaseInfo functions rather than logging by default * added verbose flags * added and commented out pip install nf-rnaseq from github; uncomment for testing if in use * import only UniProtFASTA rather than entire uniprot module to avoid nf-rnaseq import errors; fix if want to test this functionality * uncommented nf-rnaseq * in progress datasets commit * used verbose flag for caplog tests * dict_refseq_indices working correctly * dict_construct_sequences finalized - use this to generate harmonized representations * generate the dataset csv files * process now contains all code necssary to generate different aligned input sequences * conformed to latest process module structure * added dataset csv CLI to pyproject.toml * added plotting functions for discoverx * upgrades for discoverx plotting * CLI script to generate poster dataset plots * plot both svg and PNG formats for all * added plot dynamic range to the plotting CLI, need to fix font size * fixed svg in plot_dynamic_range - font still looks a little off; added docstrings and fixed comment format * Fix test_pfam and test_ncbi to handle API 500 errors gracefully Handle RetryError exceptions when external APIs return 500 errors by skipping tests instead of failing. This prevents CI failures due to unpredictable external API availability. Changes: - Wrap test_pfam API calls in try-except block - Wrap test_ncbi API calls in try-except block - Skip tests with informative messages when 500 errors occur - Re-raise other exceptions to catch real issues 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Refactor plotting code and fix SVG font rendering issues This commit improves the plotting functionality by: 1. Creating a reusable save_plot() helper function to reduce code duplication 2. Fixing SVG font rendering issues by converting text to paths 3. Improving mathtext rendering for subscripts (K_d, log_10) Changes: - Add save_plot() function to handle saving both SVG and PNG formats - Replace repetitive save code in all 5 plotting functions - Change svg.fonttype from "none" to "path" for consistent rendering - Update mathtext from \mathregular to \mathrm for proper subscript rendering - Ensure plots render consistently in browsers, VS Code, and vector editors Benefits: - SVG files now render perfectly in all viewers without spacing/kerning issues - Reduced code duplication by ~60 lines - Easier maintenance with centralized save logic - Consistent behavior across all plotting functions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * absolute filepath for cwd instead of '.' * fixed KinaseMissenseMutations.dict_replace - only do this if key in original datast * make the checks and post_init optional in case loading from a CSV file for a cohort that requires a VPN - logger errors are now warnings; allow load_from_csv from an input str if loading from multiple dataframes (e.g., KinaseMissenseMutations ._df and ._df_filter); added pathfile_filter to KinaseMissenseMutations * updated databases for kw_only arg study_id in Mutations * fixed bug in dict_kinase_cbio in get_kinase_missense_mutations function - need to check if mkt_name is in dict_kinase_cbio rather than cbio_name * changed HGNC name and mismatch error logging --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com>
diff --git a/missense_kinase_toolkit/databases/mkt/databases/cbioportal.py b/missense_kinase_toolkit/databases/mkt/databases/cbioportal.py
@@ -459,9 +459,9 @@ def query_hgnc_gene_names(
                 )["uniprot_ids"][0][0]
                 dict_hgnc2uniprot[hgnc_name] = uniprot_id
             except Exception as e:
-                logger.error(f"Error retrieving Uniprot ID for {hgnc_name}: {e}")
-                list_err.append(hgnc_name)
-        logger.error(f"List errors:\n{list_err}")
+                list_err.append(f"{hgnc_name}: {e}")
+        str_errors = "\n".join(list_err)
+        logger.error(f"Errors retrieving HGNC gene names:\n{str_errors}")
 
         # replace any HGNC gene names in the dictionary
         for cbio_name, mkt_name in self.dict_replace.items():
@@ -585,7 +585,11 @@ def remove_mismatched_uniprot_mutations(
 
         # TODO: check non-mismatches for list_set_kinase_mismatch gene_hugoGeneSymbol
         set_kinase_mismatch = {i.split("_")[1] for i in list_mismatch + list_err}
-        logger.error(f"HGNC gene names with mismatches: {set_kinase_mismatch}")
+        str_errors = "\n".join(set_kinase_mismatch)
+        logger.error(
+            "HGNC gene names of kinases with mismatches between "
+            f"cBioPortal and canonical Uniprot sequences:\n {str_errors}"
+        )
         df_filtered = df.loc[
             ~df["gene_hugoGeneSymbol"].isin(set_kinase_mismatch), :
         ].reset_index(drop=True)