Source to Target Validator

The primary aim of these Python modules is to provide automated tests that validate source data and target data. That is, the modules will automatically compare source data and target data and provide an anomly report that summarizes the differences or lack thereof between the two. Note that these modules only cover structured data.

These modules are intended to be run on Google Cloud data sources such as Google Cloud Storage and Google BigQuery. However, this does not mean that the validator is confined to GCP. It is easily extendable to Amazon Web Services (AWS), Microsoft Azure, and virtually any platform with open source Python client libraries.

The following parameters will be tested:

Row count - the number of rows in the table
Schema - the name and data type of each column
Duplicate rows - two or more rows that are exactly the same
Diff - entry by entry comparison between two tables

Pre-requisites

There are 2 ways to run the modules: (1) Locally and (2) within a cloud environment

Running the modules locally a. Please follow the appropriate authorization steps. For Google Cloud, the steps can be found here: https://cloud.google.com/docs/authentication/getting-started. b. Make sure to save the authenticator json key on the same folder as your project. c. Run this command on terminal: $env:GOOGLE_APPLICATION_CREDENTIALS="<filename_of_authenticator_key>.json"
Running the modules within a cloud environment a. Ensure that you have IAM permissions to run Python scripts via Cloud Shell. Note that these modules can also be run within Cloud Functions in the event that testing needs to be automated and scalable.

Running the modules

Currently, there are three modules:

gcs_to_gcs - compares a source file on GCS with a target file on GCS
gcs_to_bq - compares a source file on GCS with a target table on BQ
bq_to_bq - comparies a source table on bq with a target table on BQ

To run a module, the command pattern is as follows: python main.py [path to config file]

Config

The config file is a YAML file stored locally. It contains the list of jobs to be run. Each job should contain three parameters:

module to run
sources - a list of all the source data
targets - a list of all the target data
mode - either 'default' or 'strict'. this is relevant to compare_schema wherein the 'default' mode only considers the names of the columns and 'strict' mode also considers the data type of each column.

A sample config file is included as part of the repository.

Anomaly Report

After running the commands, an anomaly report will be generated locally. It will be in the form of a .txt file with the following file format: anomaly_report_[timestamp].txt

The anomaly report will contain a summary of all the aforementioned tests. Additionally, it will also contain a full diff -- meaning it will include the cell-by-cell comparison of the source data and target data.

Information contained within the report includes:

source data
target data
row count check results
schema check results - it also includes the "mode". mode can either be default or strict. default mode means it will only check if the column names are the same and ignore the data types. strict mode will also consider the data type of each column.
duplicate check results
diff check results

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
bq_to_bq.py		bq_to_bq.py
common.py		common.py
config.yml		config.yml
gcs_to_bq.py		gcs_to_bq.py
gcs_to_gcs.py		gcs_to_gcs.py
logger.py		logger.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Source to Target Validator

Pre-requisites

Running the modules

Config

Anomaly Report

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

K-Winkles/dsutils-gcp

Folders and files

Latest commit

History

Repository files navigation

Source to Target Validator

Pre-requisites

Running the modules

Config

Anomaly Report

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages