Skip to content

VishaalChandrasekar0203/Data_Pedigree_Analysis

Repository files navigation

Data_Pedigree_Analysis

Tracking Data lineage / Origin

Metadata Analysis

This technique involves examining metadata associated with data assets to understand their origins, transformations, and relationships[4].

  • Source Identification: Tools analyze metadata to trace data origins across multiple systems[4].
  • Transformation Mapping: Changes in data structure, format, or content are recorded based on metadata changes[4].
  • End-to-end Tracking: Metadata is used to provide a comprehensive view of data from creation to final use[4].

Graph-based Analysis

This method uses graph theory to represent data lineage as a network of interconnected nodes and edges.

  • Node Representation: Data elements, transformations, and systems are represented as nodes in the graph[6].
  • Edge Representation: Relationships and data flows between nodes are represented as edges[6].
  • Path Analysis: Graph algorithms are used to trace data paths and identify dependencies[6].

Pattern-based Lineage

This approach focuses on identifying recurring patterns in data transformations[2].

  • Pattern Recognition: Common data transformation patterns are identified and cataloged[2].
  • Pattern Matching: New data flows are analyzed to match known patterns[2].
  • Lineage Inference: Data lineage is inferred based on recognized patterns, allowing for efficient tracking of multiple datasets[2].

Automated Lineage Tracking

This methodology relies on automated tools to continuously monitor and update data lineage information[4].

  • Real-time Monitoring: Tools track data flows and transformations as they occur[4].
  • Automated Documentation: Changes in data structure, location, or content are automatically recorded[4].
  • Integration with Data Pipelines: Lineage tracking is integrated directly into data processing workflows[4].

Statistical Framework

This approach uses statistical methods to analyze and predict data lineage, particularly useful in complex scenarios like genetic studies[5].

  • Local Coverage: Uses inferred inheritance vectors to measure genotype-imputation ability in specific regions of interest[5].
  • Genome-wide Coverage: Utilizes pedigree structure to compute lineage metrics across the entire genome[5].
  • Subject Selection Optimization: Statistical methods are used to identify the most efficient subjects for sequencing in pedigree studies[5].

By employing these techniques and methodologies, organizations can gain a comprehensive understanding of their data's journey, ensuring data quality, compliance, and effective decision-making based on reliable information.

Citations: https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2013.00189/full https://segment.com/blog/data-lineage/ https://www.bioconductor.org/packages/devel/bioc/vignettes/FamAgg/inst/doc/FamAgg.html https://www.acceldata.io/blog/how-to-use-data-lineage-tools-for-tracking-data-transformations https://pmc.ncbi.nlm.nih.gov/articles/PMC3928665/ https://www.ardoq.com/knowledge-hub/data-lineage https://pmc.ncbi.nlm.nih.gov/articles/PMC4757949/ https://www.softwareag.com/en_corporate/resources/data-integration/article/data-lineage.html

About

Tracking Data lineage / Origin

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages