Skip to content

Conversation

i-am-leslie
Copy link

@i-am-leslie i-am-leslie commented Aug 1, 2025

Fix #11640

Migrate Mutation to Clean Architecture with ClickHouse Support

This PR refactors mutation data endpoints from legacy architecture to clean architecture with ClickHouse support, as part of the broader backend migration to ClickHouse and clean architecture patterns

ARCHITECTURAL CHANGES

Usecase Layer

FetchAllMetaMutationsInProfileUseCase- Use case for retrieving MetaMutation data
FetchAllMutationsInProfileUseCase- Use case for retrieving Mutation data
GetMutationDataUseCases- provides a centralized way to access and utilize the use cases

Repository

MutationRepository- Interface that defines the methods for retrieving data from the repository

Infrastructure

ClickhouseMutationMapper- Maps the repository methods to the ClickhouseMutationMapper.xml file
ClickhouseMutationRepository- Implements the MutationRepository methods used to communicate with the database using ClickhouseMutationMapper.

Rest Layer

ColumnMutationController- New @Profile("clickhouse") rest controller supporting retrieval of Mutation and MutationMeta data

MapStruct Integration

  • MutationMapper: Main mapper with computed uniqueKey fields
  • AlleleSpecificCopyNumberMapper: Nested object transformation
  • GeneMapper : Nested object transformation

###Simple Flow Diagram
Screenshot 2025-08-17 at 5 37 49 PM

Limitations

In the effort to reduce the number of joins and ensure the population of all required fields needed for the Mutation object in SUMMARY and DETAILED projections, I faced the following limitation:
Since not all fields for Mutation are available in the derived tables, some joins to base tables (e.g., mutation, mutation_event) were still necessary.
The number of joins was reduced:
SUMMARY projection – reduced to 4 joins
DETAILED projection – reduced to 6 joins
To achieve this, the queries start from the derived table and join back to the base tables. However, this approach sometimes introduces a cartesian effect, causing duplicate rows in the query results. This duplication does not occur consistently, but it is reproducible in certain cases.

The reason for this is that we have no unique field in genomic_event_derived that we can correlate to the mutation table. Due to this reason, using the legacy SQL is the best approach right now

Apart from this query-level limitation, the application layer migration to Clean Architecture was completed successfully. The repository, use cases, and controllers are fully migrated, and the system runs with ClickHouse support. The outstanding work is primarily around optimizing the SQL.

Reviewer Notes
Please feel free to review the overall structure and approach. Feedback on the direction, naming, or anything architectural is welcome, especially in the use case logic, queries and dependency setup.

* @param fetchAllMutationsInProfileUseCase
*/
public record GetMutationDataUseCases(
FetchAllMetaMutationsInProfileUseCase fetchAllMetaMutationsInProfileUseCase,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case names will need to be changed but we can wait until later

String variantAllele,
String uniqueSampleKey,
String uniquePatientKey
) implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed we have a Clean ArchCancerStudyMetadata that transitioned from legacy CancerStudy that looks like this. Do you think you can reference on how that endpoint GetCancerStudyMetadataUseCase is implemented?

@i-am-leslie i-am-leslie force-pushed the master-clean-arch-mutation branch from 2203c1e to 088adf8 Compare August 14, 2025 00:12
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure this is where I should keep this file, but I used this record to hold all search criteria for the endpoint I am currently transitioning to clean architecture. I would like some advice on this?

Copy link
Author

@i-am-leslie i-am-leslie Aug 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snpOnly parameter is currently not used. In the legacy implementation, it was always hardcoded to false, and in the MyBatis SQL, it was only checked for true, meaning the condition was never satisfied. Should we clean it up or leave it?

@i-am-leslie i-am-leslie force-pushed the master-clean-arch-mutation branch 2 times, most recently from 70f29a2 to 5485aa4 Compare September 1, 2025 18:34

<!-- For now only works for projection SUMMARY not DETAILED PROJECTION -->
<sql id="from">
FROM mutation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @i-am-leslie, these joins are indeed going to slow things down. in clickhouse we are trying to use the *_derived tables as much as possible because these contain much of the data denormalized (pre JOINED). I would look at genomic_event_derived and filter for mutation. It already should have fields from mutation_event and gene. You can also join on sample_derived, which will already have necessary patient data. If you find that there are missing fields, we should add them to derived tables and i'm happy to help with that.

Copy link
Author

@i-am-leslie i-am-leslie Sep 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review on my PR, right now, it is difficult to support the SUMMARY and DETAILED projections for mutations. Using genomic_event_derived alone doesn’t give me all the fields required, so I end up falling back to heavy multi-joins (mutation, mutation_event, driver_annotation, etc.) which is what we are trying to avoid.

Details i get from genomic_event_derieved needed for Detailed and Summary Projection

  1. molecularProfileId
  2. sampleId
  3. patientId
  4. entrezGeneId
  5. studyId
    From mutation
  6. mutationStatus
    From mutation_event
  7. variantType
  8. mutaionType
    9)variant_type
    From Gene (Only for Detailed projection )
  9. hugoGeneSymbol
    From alteration_driver_annotation
  10. driverFilter
  11. driverTiersFilter

Details needed for data completeness of mutation that are not present in genomic_event_derived

Needed from mutation:

  • center
  • validation_status
  • tumor_alt_count
  • tumor_ref_count
  • normal_alt_count
  • normal_ref_count
  • amino_acid_change
  • annotation_json
    Needed from mutation_event:
  • chr
  • start_position
  • end_position
  • reference_allele
  • tumor_seq_allele
  • protein_change
  • ncbi_build
  • refseq_mrna_id
  • protein_pos_start
  • protein_pos_end
  • keyword
    Needed from alteration_driver_annotation:
  • driver_filter_annotation
  • driver_tiers_filter_annotation
    Needed from gene: (Detailed projection)
  • gene.type
    Needed from allele_specific_copy_number:(Detailed projection)
  • ascnIntegerCopyNumber
  • ascnMethod
  • ccfExpectedCopiesUpper
  • ccfExpectedCopies
  • clonal
  • minorCopyNumber
  • expectedAltCopies
  • totalCopyNumber

Would it make sense to extend genomic_event_derived to include these fields (at least the ones needed for SUMMARY / DETAILED projections)?

Alternatively, should we consider a new mutation_derived table that flattens mutation + mutation_event + driver_annotation?

Happy to adjust my query approach, but right now it seems impossible to avoid multiple joins without these fields in the derived tables. Also, details about each projection can be found here https://docs.google.com/document/d/1DYG0xdx_GM8pJq-GgNJz2Q0SZNOi4M4VNqNCAn-5Qzg/edit?tab=t.0

@i-am-leslie i-am-leslie marked this pull request as ready for review September 28, 2025 16:10
@i-am-leslie i-am-leslie marked this pull request as draft September 30, 2025 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate mutations endpoint to clean architecture with ClickHouse support

3 participants