[Feature #289] Integrate PostgreSQL-based marker gene functionality #296

ke4 · 2025-08-11T13:06:28Z

Create a new DAO to fetch gene expressions from the new gxa_marker_gene table in our PostgreSQL.

Every specific genes in each assay group are clustered. The new table will contains the priority for each gene in the marker_gene_rank column. In each assay group the data should be ordered by that column.

The assay group IDs (assay field in the DB table or with other name samples in the given experiment) are passed (already calculated beforehand) into this service and we use them as a query filter. They are representing the column name in the heatmap table.

We should also select the top X gene for every assay groups based on the number of assay groups divided by 50 and round down to its integer value.

For example:

number of assay groups = 18 -> 50 % 18 = 2
In this case we are showing 36 rows (2 * 18)
number of assay groups: 9 -> 50 % 9 = 5
In this case we are showing 45 rows (5 * 9)
Note: Later on this value can come from a slider from the UI.
The json output should be exactly same format as the previous one, to prevent further changes in the frontend side.

Changes:

Add MarkerGeneDao to fetch marker gene profiles and counts from PostgreSQL when looking for Most Specific genes. - Updated BaselineExperimentProfilesService to conditionally use the new DAO for marker genes or the existing Solr-based implementations.
Included unit tests for MarkerGeneDao and updated service tests for comprehensive coverage.
Modified some related services / tests

Add `MarkerGeneDao` to fetch marker gene profiles and counts from PostgreSQL when `specific` flag in preferences is true. Updated `BaselineExperimentProfilesService` to conditionally use the new DAO for marker genes or the existing Solr-based implementations. Included unit tests for `MarkerGeneDao` and updated service tests for comprehensive coverage.

Introduced new database fixture and cleanup scripts for the `gxa_marker_gene` table to support integration testing. Updated relevant test classes to populate and clean this table during test setup and teardown phases. These changes enable testing scenarios that involve marker gene data.

Update `ExpressionUnit` enums to include a `getDatabaseValue()` method for consistent database representation. Adjust related service and DAO logic to use this method, ensuring accurate parameter handling. Enhance test coverage to validate new functionality.

Extracted reusable SQL fragments and modularized query-building logic for better readability and maintainability. Added helper methods to streamline SQL execution, parameter building, and results processing. Introduced stricter null constraints for method parameters to improve type safety.

…tion Detailed javadoc comments were added to DAO and service classes to explain their responsibilities, methods, and parameters. This improves code readability and helps developers understand the purpose and usage of each component more effectively.

Streamlined the test setup by introducing helper methods for initialization. Renamed mock variable names for clarity and updated to reflect switch from MarkerGeneDao to PostgresDao usage. Improved readability and maintained consistency across test cases.

Replaced all instances of 'TPM' with 'tpms' in the GXA marker gene fixture to ensure consistency with expected data formats. This change aligns the test data with the application's case sensitivity requirements.

Added logic to match assay names with assay groups using column headers, along with supporting utility methods for safer JSON handling and improved logging for unmatched cases.

…selineExperimentProfilesServiceTest for the updated code This update introduces mockColumnHeaders to ensure proper handling of assay group metadata in test cases.

… correct value from DB Replaced hardcoded strings ("TPM", "FPKM") with `ExpressionUnit.Absolute.Rna.getDatabaseValue()` in test cases. This ensures consistency with database values and reduces potential errors from string mismatches.

Bump the version from 37.1.1 to 37.2.0 in the Gradle build script. This update likely includes changes or improvements aligning with the new version.

Added filtering by assay names and optional gene ids passed in by params. Introduced a constant for maximum marker genes and adjusted logic accordingly.

Refactored SQL queries and parameters to handle gene filtering by both gene IDs and gene names. Introduced additional checks for empty gene queries and adjusted query construction accordingly. Updated expression unit enums to include lowercase variants for consistent handling.

Consolidated logic for fetching marker gene profiles by replacing `fetchSpecificGeneProfiles` with `fetchMarkerGeneProfiles`. Introduced helper methods for cleaner query parameter handling and improved readability. Updated affected tests to align with the refactored method signatures.

Simplified test methods by removing redundant comments, consolidating common logic, and improving naming for better clarity. Introduced helper methods and reused constants to reduce duplication and ensure consistency across tests.

Added a default SemanticQuery for gene queries in `RnaSeqBaselineRequestPreferences` initialization. This ensures consistent handling of gene-related queries during baseline test setups.

Simplified logic in `BaselineExperimentProfilesService` by removing specific gene search handling.

Add a safeguard in MarkerGeneDao to set a minimum marker gene rank limit of 1 when the calculated value is less than 1.

Replaced HashMap with LinkedHashMap to preserve insertion order. Updated SQL query to include ordering by assay names and descending expression levels. Adjusted query parameters to reflect the new ordering logic.

Enhance SQL query in `MarkerGeneDao` to exclude entries with null `marker_gene_rank` for improved data accuracy.

Replaced assay names with assay group IDs for improved consistency and clarity in query logic and data mapping. Updated SQL query, helper methods, and associated logic to reflect this change. Removed unused code and redundant logging for better maintainability.

Eliminated the redundant "factorValue"/"name" parameter in fetchMarkerGeneProfiles method calls across the codebase. Updated corresponding test cases to align with the method signature changes for consistency and accuracy.

Updated SQL queries to utilize `StringSubstitutor` for dynamic parameter substitution, improving readability and maintainability. Replaced positional parameters with named placeholders and refactored query parameter handling to generate a map of values for substitution. Removed redundant parameters in `executeQuery` as a substitution simplifies query preparation.

Changed RNA expression unit enums from "fpkm" to "fpkms" and "tpm" to "tpms" to reflect pluralization. This ensures consistency and clarity in naming conventions.

Replaced references to `assay names` with `assay_group_id` for consistency with the updated data structure. Adjusted related test logic and method calls accordingly to reflect this change.

Replaced PostgreSQL's `ARRAY_POSITION` function with Java-based sorting to improve flexibility and maintainability. Updated the SQL query to remove `ARRAY_POSITION` and added a utility method to sort results by assay order in Java. This ensures consistent ordering of marker genes based on assay group IDs.

Updated the SQL schema to include a new `assay_id` column in the marker gene table and adjusted the test data accordingly. Refactored test constants and methods in `BaselineExperimentProfilesServiceTest` to align with the updated schema.

app/src/main/java/uk/ac/ebi/atlas/experimentpage/baseline/profiles/MarkerGeneDao.java

sandsebi

Added my comments.

Replaced JdbcTemplate with NamedParameterJdbcTemplate for better SQL parameter handling and readability. Updated related methods, queries, and tests to accommodate the change. This ensures improved maintainability and reduces the risk of SQL injection.

sandsebi

Looks good 👍

ke4 self-assigned this Aug 11, 2025

ke4 added enhancement New feature or request high priority labels Aug 11, 2025

ke4 added 26 commits August 12, 2025 10:13

Update expression unit to lowercase 'tpms' in test fixture

1279c20

Replaced all instances of 'TPM' with 'tpms' in the GXA marker gene fixture to ensure consistency with expected data formats. This change aligns the test data with the application's case sensitivity requirements.

Get assay iD from column headers by assay name

de4a25a

Added logic to match assay names with assay groups using column headers, along with supporting utility methods for safer JSON handling and improved logging for unmatched cases.

Add mockColumnHeaders to align test cases in MarkerGeneDaoTest and Ba…

bed298f

…selineExperimentProfilesServiceTest for the updated code This update introduces mockColumnHeaders to ensure proper handling of assay group metadata in test cases.

Update project version to 37.2.0

5c1b85d

Bump the version from 37.1.1 to 37.2.0 in the Gradle build script. This update likely includes changes or improvements aligning with the new version.

Refactor marker gene SQL queries that match the modified requirements

e6214a6

Added filtering by assay names and optional gene ids passed in by params. Introduced a constant for maximum marker genes and adjusted logic accordingly.

Set gene query in request preferences for baseline tests

68f557a

Added a default SemanticQuery for gene queries in `RnaSeqBaselineRequestPreferences` initialization. This ensures consistent handling of gene-related queries during baseline test setups.

Refine gene profile fetching logic and Upgrade dependencies

e28dca4

Simplified logic in `BaselineExperimentProfilesService` by removing specific gene search handling.

Ensure minimum marker gene rank limit

3bc19ef

Add a safeguard in MarkerGeneDao to set a minimum marker gene rank limit of 1 when the calculated value is less than 1.

Order marker genes by assay names and expression level

296c8af

Replaced HashMap with LinkedHashMap to preserve insertion order. Updated SQL query to include ordering by assay names and descending expression levels. Adjusted query parameters to reflect the new ordering logic.

Filter marker genes without a rank

f26aa94

Enhance SQL query in `MarkerGeneDao` to exclude entries with null `marker_gene_rank` for improved data accuracy.

Remove unused parameter from fetchMarkerGeneProfiles calls

375f4c8

Eliminated the redundant "factorValue"/"name" parameter in fetchMarkerGeneProfiles method calls across the codebase. Updated corresponding test cases to align with the method signature changes for consistency and accuracy.

Update RNA expression units from singular to plural

e137e79

Changed RNA expression unit enums from "fpkm" to "fpkms" and "tpm" to "tpms" to reflect pluralization. This ensures consistency and clarity in naming conventions.

Update test to use assay_group_id instead of assay names

dbad376

Replaced references to `assay names` with `assay_group_id` for consistency with the updated data structure. Adjusted related test logic and method calls accordingly to reflect this change.

ke4 marked this pull request as ready for review September 8, 2025 16:01

ke4 requested review from amnonkhen and sandsebi September 8, 2025 16:01

ke4 linked an issue Sep 8, 2025 that may be closed by this pull request

Load most-specific gene expressions for bulk baseline experiments from new Postgres table #289

Closed

6 tasks

sandsebi reviewed Sep 9, 2025

View reviewed changes

app/src/main/java/uk/ac/ebi/atlas/experimentpage/baseline/profiles/MarkerGeneDao.java Show resolved Hide resolved

sandsebi reviewed Sep 9, 2025

View reviewed changes

app/src/main/java/uk/ac/ebi/atlas/experimentpage/baseline/profiles/MarkerGeneDao.java Show resolved Hide resolved

sandsebi reviewed Sep 9, 2025

View reviewed changes

sandsebi approved these changes Sep 11, 2025

View reviewed changes

ke4 merged commit bae6fef into develop Sep 11, 2025
2 checks passed

ke4 deleted the feature/integrate_marker_gene_table branch September 11, 2025 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature #289] Integrate PostgreSQL-based marker gene functionality #296

[Feature #289] Integrate PostgreSQL-based marker gene functionality #296

Uh oh!

ke4 commented Aug 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

sandsebi left a comment

Uh oh!

sandsebi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Feature #289] Integrate PostgreSQL-based marker gene functionality #296

[Feature #289] Integrate PostgreSQL-based marker gene functionality #296

Uh oh!

Conversation

ke4 commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sandsebi left a comment

Choose a reason for hiding this comment

Uh oh!

sandsebi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ke4 commented Aug 11, 2025 •

edited

Loading