-
-
Notifications
You must be signed in to change notification settings - Fork 727
Add mutation domain classes and use cases #11650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add mutation domain classes and use cases #11650
Conversation
a0e4508
to
9668cbf
Compare
src/main/java/org/cbioportal/application/rest/vcolumnstore/ColumnMutationController.java
Outdated
Show resolved
Hide resolved
src/main/java/org/cbioportal/application/rest/vcolumnstore/ColumnMutationController.java
Outdated
Show resolved
Hide resolved
src/main/java/org/cbioportal/domain/mutation/usecase/GetMutationDataUseCases.java
Outdated
Show resolved
Hide resolved
* @param fetchAllMutationsInProfileUseCase | ||
*/ | ||
public record GetMutationDataUseCases( | ||
FetchAllMetaMutationsInProfileUseCase fetchAllMetaMutationsInProfileUseCase, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case names will need to be changed but we can wait until later
String variantAllele, | ||
String uniqueSampleKey, | ||
String uniquePatientKey | ||
) implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed we have a Clean ArchCancerStudyMetadata
that transitioned from legacy CancerStudy
that looks like this. Do you think you can reference on how that endpoint GetCancerStudyMetadataUseCase
is implemented?
2203c1e
to
088adf8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure this is where I should keep this file, but I used this record to hold all search criteria for the endpoint I am currently transitioning to clean architecture. I would like some advice on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snpOnly parameter is currently not used. In the legacy implementation, it was always hardcoded to false, and in the MyBatis SQL, it was only checked for true, meaning the condition was never satisfied. Should we clean it up or leave it?
… for grouping profiles and removing duplicates
70f29a2
to
5485aa4
Compare
|
||
<!-- For now only works for projection SUMMARY not DETAILED PROJECTION --> | ||
<sql id="from"> | ||
FROM mutation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @i-am-leslie, these joins are indeed going to slow things down. in clickhouse we are trying to use the *_derived tables as much as possible because these contain much of the data denormalized (pre JOINED). I would look at genomic_event_derived and filter for mutation. It already should have fields from mutation_event and gene. You can also join on sample_derived, which will already have necessary patient data. If you find that there are missing fields, we should add them to derived tables and i'm happy to help with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review on my PR, right now, it is difficult to support the SUMMARY and DETAILED projections for mutations. Using genomic_event_derived alone doesn’t give me all the fields required, so I end up falling back to heavy multi-joins (mutation, mutation_event, driver_annotation, etc.) which is what we are trying to avoid.
Details i get from genomic_event_derieved needed for Detailed and Summary Projection
- molecularProfileId
- sampleId
- patientId
- entrezGeneId
- studyId
From mutation - mutationStatus
From mutation_event - variantType
- mutaionType
9)variant_type
From Gene (Only for Detailed projection ) - hugoGeneSymbol
From alteration_driver_annotation - driverFilter
- driverTiersFilter
Details needed for data completeness of mutation that are not present in genomic_event_derived
Needed from mutation:
- center
- validation_status
- tumor_alt_count
- tumor_ref_count
- normal_alt_count
- normal_ref_count
- amino_acid_change
- annotation_json
Needed from mutation_event: - chr
- start_position
- end_position
- reference_allele
- tumor_seq_allele
- protein_change
- ncbi_build
- refseq_mrna_id
- protein_pos_start
- protein_pos_end
- keyword
Needed from alteration_driver_annotation: - driver_filter_annotation
- driver_tiers_filter_annotation
Needed from gene: (Detailed projection) - gene.type
Needed from allele_specific_copy_number:(Detailed projection) - ascnIntegerCopyNumber
- ascnMethod
- ccfExpectedCopiesUpper
- ccfExpectedCopies
- clonal
- minorCopyNumber
- expectedAltCopies
- totalCopyNumber
Would it make sense to extend genomic_event_derived to include these fields (at least the ones needed for SUMMARY / DETAILED projections)?
Alternatively, should we consider a new mutation_derived table that flattens mutation + mutation_event + driver_annotation?
Happy to adjust my query approach, but right now it seems impossible to avoid multiple joins without these fields in the derived tables. Also, details about each projection can be found here https://docs.google.com/document/d/1DYG0xdx_GM8pJq-GgNJz2Q0SZNOi4M4VNqNCAn-5Qzg/edit?tab=t.0
…e information to clickhouse for faster processing and reducing the risk of doing redundant joins
Fix #11640
Migrate Mutation to Clean Architecture with ClickHouse Support
This PR refactors mutation data endpoints from legacy architecture to clean architecture with ClickHouse support, as part of the broader backend migration to ClickHouse and clean architecture patterns
ARCHITECTURAL CHANGES
Usecase Layer
FetchAllMetaMutationsInProfileUseCase- Use case for retrieving MetaMutation data
FetchAllMutationsInProfileUseCase- Use case for retrieving Mutation data
GetMutationDataUseCases- provides a centralized way to access and utilize the use cases
Repository
MutationRepository- Interface that defines the methods for retrieving data from the repository
Infrastructure
ClickhouseMutationMapper- Maps the repository methods to the ClickhouseMutationMapper.xml file
ClickhouseMutationRepository- Implements the MutationRepository methods used to communicate with the database using ClickhouseMutationMapper.
Rest Layer
ColumnMutationController- New @Profile("clickhouse") rest controller supporting retrieval of Mutation and MutationMeta data
MapStruct Integration
###Simple Flow Diagram

Limitations
In the effort to reduce the number of joins and ensure the population of all required fields needed for the Mutation object in SUMMARY and DETAILED projections, I faced the following limitation:
Since not all fields for Mutation are available in the derived tables, some joins to base tables (e.g., mutation, mutation_event) were still necessary.
The number of joins was reduced:
SUMMARY projection – reduced to 4 joins
DETAILED projection – reduced to 6 joins
To achieve this, the queries start from the derived table and join back to the base tables. However, this approach sometimes introduces a cartesian effect, causing duplicate rows in the query results. This duplication does not occur consistently, but it is reproducible in certain cases.
The reason for this is that we have no unique field in genomic_event_derived that we can correlate to the mutation table. Due to this reason, using the legacy SQL is the best approach right now
Apart from this query-level limitation, the application layer migration to Clean Architecture was completed successfully. The repository, use cases, and controllers are fully migrated, and the system runs with ClickHouse support. The outstanding work is primarily around optimizing the SQL.
Reviewer Notes
Please feel free to review the overall structure and approach. Feedback on the direction, naming, or anything architectural is welcome, especially in the use case logic, queries and dependency setup.