Add mutation domain classes and use cases #11650

i-am-leslie · 2025-08-01T19:08:13Z

Migrate Mutation to Clean Architecture with ClickHouse Support

This PR refactors mutation data endpoints from legacy architecture to clean architecture with ClickHouse support, as part of the broader backend migration to ClickHouse and clean architecture patterns

ARCHITECTURAL CHANGES

Usecase Layer

FetchAllMetaMutationsInProfileUseCase- Use case for retrieving MetaMutation data
FetchAllMutationsInProfileUseCase- Use case for retrieving Mutation data
GetMutationDataUseCases- provides a centralized way to access and utilize the use cases

Repository

MutationRepository- Interface that defines the methods for retrieving data from the repository

Infrastructure

ClickhouseMutationMapper- Maps the repository methods to the ClickhouseMutationMapper.xml file
ClickhouseMutationRepository- Implements the MutationRepository methods used to communicate with the database using ClickhouseMutationMapper.

Rest Layer

ColumnMutationController- New @Profile("clickhouse") rest controller supporting retrieval of Mutation and MutationMeta data

MapStruct Integration

MutationMapper: Main mapper with computed uniqueKey fields
AlleleSpecificCopyNumberMapper: Nested object transformation
GeneMapper : Nested object transformation

###Simple Flow Diagram

Limitations

In the effort to reduce the number of joins and ensure the population of all required fields needed for the Mutation object in SUMMARY and DETAILED projections, I faced the following limitation:
Since not all fields for Mutation are available in the derived tables, some joins to base tables (e.g., mutation, mutation_event) were still necessary.
The number of joins was reduced:
SUMMARY projection – reduced to 4 joins
DETAILED projection – reduced to 6 joins
To achieve this, the queries start from the derived table and join back to the base tables. However, this approach sometimes introduces a cartesian effect, causing duplicate rows in the query results. This duplication does not occur consistently, but it is reproducible in certain cases.

The reason for this is that we have no unique field in genomic_event_derived that we can correlate to the mutation table. Due to this reason, using the legacy SQL is the best approach right now

Apart from this query-level limitation, the application layer migration to Clean Architecture was completed successfully. The repository, use cases, and controllers are fully migrated, and the system runs with ClickHouse support. The outstanding work is primarily around optimizing the SQL.

Reviewer Notes
Please feel free to review the overall structure and approach. Feedback on the direction, naming, or anything architectural is welcome, especially in the use case logic, queries and dependency setup.

src/main/java/org/cbioportal/application/rest/vcolumnstore/ColumnMutationController.java

src/main/java/org/cbioportal/domain/mutation/usecase/GetMutationDataUseCases.java

fuzhaoyuan · 2025-08-04T18:12:44Z

src/main/java/org/cbioportal/domain/mutation/usecase/GetMutationDataUseCases.java

+ * @param fetchAllMutationsInProfileUseCase
+ */
+public record GetMutationDataUseCases(
+    FetchAllMetaMutationsInProfileUseCase fetchAllMetaMutationsInProfileUseCase,


The use case names will need to be changed but we can wait until later

fuzhaoyuan · 2025-08-04T19:11:35Z

src/main/java/org/cbioportal/domain/mutation/Mutation.java

+    String variantAllele, 
+    String uniqueSampleKey, 
+    String uniquePatientKey
+    ) implements Serializable {


I noticed we have a Clean ArchCancerStudyMetadata that transitioned from legacy CancerStudy that looks like this. Do you think you can reference on how that endpoint GetCancerStudyMetadataUseCase is implemented?

… method

…the database

…s seperate method

i-am-leslie · 2025-08-17T20:55:33Z

src/main/java/org/cbioportal/shared/MutationSearchCriteria.java

I am not 100% sure this is where I should keep this file, but I used this record to hold all search criteria for the endpoint I am currently transitioning to clean architecture. I would like some advice on this?

i-am-leslie · 2025-08-17T21:11:39Z

...a/org/cbioportal/infrastructure/repository/clickhouse/mutation/ClickhouseMutationMapper.java

The snpOnly parameter is currently not used. In the legacy implementation, it was always hardcoded to false, and in the MyBatis SQL, it was only checked for true, meaning the condition was never satisfied. Should we clean it up or leave it?

… for grouping profiles and removing duplicates

alisman · 2025-09-06T22:29:47Z

src/main/resources/mappers/clickhouse/mutation/ClickhouseMutationDataMapper.xml

+
+<!--    For now only works for projection SUMMARY not DETAILED PROJECTION -->
+    <sql id="from">
+        FROM mutation


hi @i-am-leslie, these joins are indeed going to slow things down. in clickhouse we are trying to use the *_derived tables as much as possible because these contain much of the data denormalized (pre JOINED). I would look at genomic_event_derived and filter for mutation. It already should have fields from mutation_event and gene. You can also join on sample_derived, which will already have necessary patient data. If you find that there are missing fields, we should add them to derived tables and i'm happy to help with that.

Thanks for your review on my PR, right now, it is difficult to support the SUMMARY and DETAILED projections for mutations. Using genomic_event_derived alone doesn’t give me all the fields required, so I end up falling back to heavy multi-joins (mutation, mutation_event, driver_annotation, etc.) which is what we are trying to avoid.

Details i get from genomic_event_derieved needed for Detailed and Summary Projection

molecularProfileId

sampleId

patientId

entrezGeneId

studyId
From mutation

mutationStatus
From mutation_event

variantType

mutaionType
9)variant_type
From Gene (Only for Detailed projection )

hugoGeneSymbol
From alteration_driver_annotation

driverFilter

driverTiersFilter

Details needed for data completeness of mutation that are not present in genomic_event_derived

Needed from mutation:

center

validation_status

tumor_alt_count

tumor_ref_count

normal_alt_count

normal_ref_count

amino_acid_change

annotation_json
Needed from mutation_event:

chr

start_position

end_position

reference_allele

tumor_seq_allele

protein_change

ncbi_build

refseq_mrna_id

protein_pos_start

protein_pos_end

keyword
Needed from alteration_driver_annotation:

driver_filter_annotation

driver_tiers_filter_annotation
Needed from gene: (Detailed projection)

gene.type
Needed from allele_specific_copy_number:(Detailed projection)

ascnIntegerCopyNumber

ascnMethod

ccfExpectedCopiesUpper

ccfExpectedCopies

clonal

minorCopyNumber

expectedAltCopies

totalCopyNumber

Would it make sense to extend genomic_event_derived to include these fields (at least the ones needed for SUMMARY / DETAILED projections)?

Alternatively, should we consider a new mutation_derived table that flattens mutation + mutation_event + driver_annotation?

Happy to adjust my query approach, but right now it seems impossible to avoid multiple joins without these fields in the derived tables. Also, details about each projection can be found here https://docs.google.com/document/d/1DYG0xdx_GM8pJq-GgNJz2Q0SZNOi4M4VNqNCAn-5Qzg/edit?tab=t.0

…e information to clickhouse for faster processing and reducing the risk of doing redundant joins

…dequetely tested

i-am-leslie force-pushed the master-clean-arch-mutation branch from a0e4508 to 9668cbf Compare August 2, 2025 14:00

fuzhaoyuan assigned i-am-leslie Aug 4, 2025

fuzhaoyuan reviewed Aug 4, 2025

View reviewed changes

src/main/java/org/cbioportal/application/rest/vcolumnstore/ColumnMutationController.java Outdated Show resolved Hide resolved

fuzhaoyuan reviewed Aug 4, 2025

View reviewed changes

src/main/java/org/cbioportal/application/rest/vcolumnstore/ColumnMutationController.java Outdated Show resolved Hide resolved

fuzhaoyuan reviewed Aug 4, 2025

View reviewed changes

src/main/java/org/cbioportal/domain/mutation/usecase/GetMutationDataUseCases.java Outdated Show resolved Hide resolved

fuzhaoyuan reviewed Aug 4, 2025

View reviewed changes

i-am-leslie added 16 commits August 13, 2025 20:12

Add mutation domain classes and use cases

8e5a38c

Made modifications to the usecases and added methods to the repository

8fa7c95

Working on the infrastructure layer

8ea0ec0

Moved the mutation infrastructure to the right folder

d93a153

finished up the mapper class for the infrastructure

dc7b0ce

Added rest endpoint and a record class for all use cases

c9f8e85

Added rest endpoint and a record class for all use cases

930b9ab

Added rest endpoint and a record class for all use cases

adb2887

Refractored the controller layer adhering to clean arch principles

b74f90d

Started with creating the sql for clickhouse repo mutation

8ed06a1

Added test for ferchAllMutationsProfileUseCase logic

0af5d3c

Added test for ferchAllMutationsProfileUseCase logic

f94e5f7

Added test for ferchAllMutationsProfileUseCase logic

832c7e6

Finshed testing the utility class and usecase logic

66e5822

cleaned up some classes

7ae555d

Wrote the sql for getMetaMutation use case. Starting with getMutation…

088adf8

… method

i-am-leslie force-pushed the master-clean-arch-mutation branch from 2203c1e to 088adf8 Compare August 14, 2025 00:12

i-am-leslie added 3 commits August 13, 2025 21:48

Changed the naming for mutation controller

d41a619

Fixed some parameters for the datamapper to control information from …

cc99918

…the database

Refractored the mapper class method to handle each projection with it…

75272dc

…s seperate method

i-am-leslie commented Aug 17, 2025

View reviewed changes

Rough SQL for SUMMARY projection note, need to crosscheck

d2ed0e0

i-am-leslie added 2 commits August 26, 2025 14:38

Refactored repository layer to use molecularProfileCaseIdentifierUtil…

f561c7a

… for grouping profiles and removing duplicates

Updates

5485aa4

i-am-leslie force-pushed the master-clean-arch-mutation branch 2 times, most recently from 70f29a2 to 5485aa4 Compare September 1, 2025 18:34

i-am-leslie added 8 commits September 1, 2025 17:05

fixed up summary query

eadf94d

Almost done with queries need to confirm some results

e7b7949

Finished up summary and detailed projection wroks now

c270509

Created dto's and maooers for data received from the clincal data mapper

b63e093

corrected field variantAllele

064a995

Refactored code to make use of projectionType

a8c51f3

Trying to adjust the queries to make use of clickhouse strengths

89fe9cc

put more comments for description

585b426

alisman reviewed Sep 6, 2025

View reviewed changes

i-am-leslie added 5 commits September 13, 2025 16:11

Trying to optimize the query for clickhouse by batch sending the whol…

1750883

…e information to clickhouse for faster processing and reducing the risk of doing redundant joins

Fixed the query to work for just molecularProfileId is provided

6c0f485

Done with the query optimization need to add comments

af6b53c

Finsished with the comments, Looking through the test to ensure its a…

129b5f6

…dequetely tested

Finished, i believe sql should still be checked for correectness

02d0b8e

i-am-leslie requested review from alisman and fuzhaoyuan September 28, 2025 16:09

i-am-leslie marked this pull request as ready for review September 28, 2025 16:10

Added condition to filter out null in mutation columns

9638e9f

i-am-leslie marked this pull request as draft September 30, 2025 12:04

i-am-leslie added 6 commits October 1, 2025 11:13

Tried controlling the result of the cartesian product

c6276db

Refactored Id to remove arguments the proejcting did not need

dfdc218

Started with the e2e test

5de7fda

working on the e2e test

52f3d98

e2e test complete awaiting review

90026ce

Added comments

7e35468

i-am-leslie mentioned this pull request Oct 13, 2025

Master clean arch mutation endpoint with legacy query #11750

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add mutation domain classes and use cases #11650

Add mutation domain classes and use cases #11650

Uh oh!

i-am-leslie commented Aug 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fuzhaoyuan Aug 4, 2025

Uh oh!

fuzhaoyuan Aug 4, 2025

Uh oh!

i-am-leslie Aug 17, 2025

Uh oh!

i-am-leslie Aug 17, 2025 •

edited

Loading

Uh oh!

alisman Sep 6, 2025

Uh oh!

i-am-leslie Sep 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Add mutation domain classes and use cases #11650

Are you sure you want to change the base?

Add mutation domain classes and use cases #11650

Uh oh!

Conversation

i-am-leslie commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Migrate Mutation to Clean Architecture with ClickHouse Support

ARCHITECTURAL CHANGES

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fuzhaoyuan Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

fuzhaoyuan Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

i-am-leslie Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

i-am-leslie Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alisman Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

i-am-leslie Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Details i get from genomic_event_derieved needed for Detailed and Summary Projection

Details needed for data completeness of mutation that are not present in genomic_event_derived

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

i-am-leslie commented Aug 1, 2025 •

edited

Loading

i-am-leslie Aug 17, 2025 •

edited

Loading

i-am-leslie Sep 7, 2025 •

edited

Loading