This code is intended to assist in building CSV-schema-validating workflows. It is intended to be used to validate CSVs adhering to source of truth requirements used in the Hydration and Rollout Manager (HRM) workflow. It makes heavy use of Pydantic to do the heavy lifting of validating schemas, providing a few things out of the box:
- An opinionated Pydantic model structure to define and validate the schema of arbitrary user-defined CSV data AND HRM-required data
- A CLI workflow (for CI/CD and local dev)
- Ability to load external, user-provided schema models and check an aribtray CSV file against them
- Optional coercion of data into a prescriptive, normalized structure serialization, dumping this as an output file
In order to write models that integrate with this module and CLI, you should (ideally) have some experience with Python 3.12. You will need to develop comfort writing Pydantic models by becoming familiar with the docs – start here and then read the concepts.
- Python 3.12+
- Pydantic
Ensure you're using Python 3.12.
The software (in src/csv_validator) is a module that may be installed using setuptools. To install the CLI to a
virtual environment, do the following:
# create virtualenv
python3 -m venv .
source bin/activate
# install requirements
python3 -m pip install -r requirements.txt
# alternatively - from root, install directly (if not developing further)
python3 -m pip install .
# then invoke the CLI one of two ways:
validate_csv --help
python3 -m csv_validator --helpNote: If you are developing this further, see Development below.
See the Dockerfile for installing this module into a container.
The main library code is in the csv_validator module. The model.py
file contains a basic BaseCluster model from which all models may (or rather should) subclass.
It does a few things for free:
- It checks required fields, such as cluster_name,cluster_group, andcluster_tags
- Provides (very) basic definitions of valid cluster groups and tags which should (and likely must) be extended
- It likely must be extended because the basic set provides an example list of cluster groups
- For an example of how to extend the ClusterBasevia subclassing, see the example model: models/example_model.py,class NewBase, line 82.
 
There are example models included that demonstrate how to consume and extend this tool:
Each of these models has a set of both valid and invalid CSVs which may be used to demonstrate the functionality of the
model. CSVs live in sources_of_truth. As an example, to validate a valid "cluster registry" CSV and to see how it
behaves with an invalid CSV, run the following:
Valid:
$ validate_csv -v -m models/cluster_registry.py sources_of_truth/cluster_registry/cluster_reg_sot_valid.csv
INFO    CSV is validInvalid:
$ validate_csv -v -m models/cluster_registry.py sources_of_truth/cluster_registry/cluster_reg_sot_invalid.csv WARNING Optional field 'platform_repository_sync_interval' is not present in sources_of_truth/cluster_registry/cluster_reg_sot_invalid.csv
WARNING Optional field 'platform_repository_branch' is not present in sources_of_truth/cluster_registry/cluster_reg_sot_invalid.csv
WARNING Optional field 'workload_repository_branch' is not present in sources_of_truth/cluster_registry/cluster_reg_sot_invalid.csv
ERROR   Line 2, cluster US75911CLS01, column 'cluster_group', error: 'Input should be 'prod-us', 'nonprod-us', 'prod-au' or 'nonprod-au'', received 'prod-foobar'
ERROR   Line 3, cluster US41273, column 'workload_repository_sync_interval', error: 'String should match pattern '^[0-9]*[hms]$'', received '300foo'
ERROR   Line 4, cluster US64150CLS01, column 'cluster_group', error: 'Input should be 'prod-us', 'nonprod-us', 'prod-au' or 'nonprod-au'', received 'not-real'
ERROR   Line 5, cluster US21646CLS01   , column 'cluster_tags', error: 'Input should be '24/7', 'corp', 'drivethru', 'drivethruduallane' or 'donotupgrade'', received 'ThisIsInvalid'
ERROR   Line 8, cluster name unknown, column 'cluster_name', error: 'String should have at least 1 character', received ''
ERROR   Line 13, cluster AU98342CLS01, column 'cluster_group', error: 'Input should be 'prod-us', 'nonprod-us', 'prod-au' or 'nonprod-au'', received 'prod-nz'
ERROR   Line 15, cluster AU73291CLS01, column 'cluster_tags', error: 'Input should be '24/7', 'corp', 'drivethru', 'drivethruduallane' or 'donotupgrade'', received 'ThisTagIsInvalid'Each of the models has a source of truth counterpart including both valid and invalid data for testing and demonstration purposes.
For a functional example of extending the base, see the example
model: models/example_model.py, class NewBase.
The base model, class BaseCluster, provides a reference implementation of a Pydantic model for the required columns
in any source of truth CSV file, including columns:
- cluster_name
- cluster_group
- cluster_tags
The base model valid groups and tags are stubbed and incomplete. In order to use them fully, an engineer must perform either of the below steps:
- Update the ValudClusterGroupsandValidTagsenumerations in thecsv_validatormodule in place, or
- Subclass BaseCluster, implementing new enumerations, fully articulating all the possible valid tags and cluster groups. An example of this is demonstrated in models/example_model.py.
There are numerous examples of how one may write their own models:
- models/example_model.py shows how to create a new BaseCluster by subclassing it to customize valid groups and tags, then uses it to validate a number of example CSV columns each with unique properties
- models/cluster_registry.py which subclasses and is based on the BaseClustermodel ( in models/cluster_registry.py)
- models/platform.py which subclasses and is based on the BaseClustermodel ( in models/cluster_registry.py)
This CSV validator is a Python module, providing a CLI, workflow, and things to import and use in model development. Notice its utilization in the examples - one may import from, extend, and update this library. The goal was (and is) enabling users to write extensible, flexible, consumable Python code. It is expected that you utilize it in the way that best suits your needs. In order to fully maximize the capability of this app, familiarity with Python is greatly beneficial.
Note: this presumes you have followed the installation instructions above and have the CLI (in the csv_validator
module) installed into your development environment.
- Start by updating csv_validator/model.py. The BaseClusterclass refers to enumerations,ValidClusterGroupsandValidTags. These are stubbed. Enumerate (quite literally) the values that are valid for your use case.
class ValidClusterGroups(enum.StrEnum):
    """ Contains all values considered to be valid cluster group names """
    prod_us = 'prod-us'
    nonprod_us = 'nonprod-us'
    prod_au = 'prod-au'
    nonprod_au = 'nonprod-au'
    # adding groups
    prod_ca = 'prod-ca'
    prod_mx = 'prox-mx'
    prod_uk = 'prod-uk'
    ...Do the same for tags:
class ValidTags(enum.StrEnum):
    """ Contains all values considered to be valid tags """
    TwentyFourSeven = '24/7'
    DriveThru = 'drivethru'
    Corp = 'corp'
    DoNotUpgrade = 'donotupgrade'
    # adding tags
    Franchise = 'franchise'
    ...- Create a model python file - in this example custom.py. Import theBaseCluster,pydantic.
import pydantic
from csv_validator.model import BaseCluster- Create a custom model. Be sure it uses the class name SourceOfTruthModeland that it subclassesBaseCluster
class SourceOfTruthModel(BaseCluster):
    my_field: strThis creates a new required column called my_field using a type str.
Validating schema models are simply Pydantic models. There is no magic to the model - it must simply be valid usage of Pydantic in Python code - it's that simple. For more information, refer to the Pydantic documentation on models
- Provide a CSV that provides this column, while adhering to the model BaseCluster. Remember, we're using inheritance here to combine theBaseClusterwith your customSourceOfTruthModel. Let's usecustom.csvas the name.
cluster_name,cluster_group,cluster_tags,my_field
my-cluster,prod-us,"corp",this is my field
- Run the CLI importing the model (custom.py) against the CSV
validate_csv -m path/to/custom.py path/to/custom.csvIt should exit 0 if okay - you can run -v for extra verbosity to verify!
$ validate_csv -m path/to/custom.py path/to/custom.csv
INFO    CSV is validInstall the module in editable mode with dev dependencies:
pip install -e .[dev]Need to wipe your virtualenv to install from scratch?
# ensure you're in the virtualenv
source bin/activate
pip uninstall -y csv_validator
pip uninstall -y -r <(pip freeze)
pip install -e .[dev]Pylint and mypy checks are expected to pass:
pylint src
mypy srcRun unit tests from the repository root.
Note: Ensure tests are run with csv_validator installed to current Python environment/virtualenv. See installation section.
python3 -m unittest tests/*.py -vBuild the container:
docker build --pull --no-cache  -t csv-validator .Test the container:
$ docker run -it csv-validator --help
usage: validate_csv [-h] [-m MODULE_OR_PYTHON_FILE] [-o output_source_of_truth.csv] [-v] SOURCE_OF_TRUTH.CSV
Validate source of truth CSV schemas and data using built-in and dynamically-imported validation models
...An error similar to the following may occur while executing the command python3 -m pip install if you have multiple
python versions installed.
ERROR: Package hydrate requires a different Python: 3.11.9 not in >=3.12
If so, execute the following command to point python3 to the correct version.
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.12 1