-
Notifications
You must be signed in to change notification settings - Fork 0
Description
There was a discussion in Slack that a general API that could handle management of materialized views related to the genotype_call table would be useful.
In the discussion I suggested that an API to manage the following could reduce duplicated effort and allow us all to benefit for optimization for large data sets.
- Create materialized view tables and indices (support partitioning in a number of different ways and allow other tools to define the materialized view they use)
- Create the genotype_call table itself (not yet included in core Chado) for a consistent definition and indices.
- New Tripal 4 BioTasks for sync'ing these materialized views. BioTasks have more fine-grained control over locks and provide the flexibility needed to optimize this process (i.e. multiple queries, chunking of data to be added, truncating existing data, etc.)
- Extension of the new Tripal DBX to provide simplified querying which is aware of partitions.
In order to support the very different needs of different tools using the genotype_call table, this API would provide a means for tools to describe the materialized view name, columns, composition, queries, indices, optimization approaches, etc to be used by the API. It is understood that a single best practice materialized view for this type of data is not possible as with large datasets it is important to cater to the specific composition of the data and needs of the tool in order to be performant.
This API is NOT trying to force us all to use the same materialized views or even optimization approaches. Rather it is trying to provide all tools with a set of optimization approaches which can be selected from to support each tool optimally.