This Git repository contains the dbt code for the data quality course at Centrale Supélec.
├── analyses/ # Analytical SQL files
├── config/
│ └── profiles.yml # dbt profiles configuration
├── data/ # Raw data files
│ ├── departements.csv
│ ├── dictionnaire_donnees_naissances.csv
│ └── donnees_naissances_2021.csv
├── database/ # DuckDB database files (created at runtime)
├── macros/ # SQL macros for reusable code
├── models/ # dbt models
│ ├── departements.sql
│ ├── donnees_naissances_2021.sql
│ └── schema.yml
├── sample_data/ # Sample/test data
├── seeds/ # CSV files to be loaded as tables
├── snapshots/ # Snapshot models for SCD
├── tests/ # Custom data tests
├── dbt_project.yml # dbt project configuration
├── packages.yml # dbt package dependencies
└── pyproject.toml # Python dependencies managed by uv
This project uses uv for Python dependency management and virtual environment handling.
-
Install uv (if not already installed): Follow the instructions in the official documentation
-
Clone the repository and navigate to the project directory:
git clone <repository-url> cd cours_dq
-
Install Python dependencies:
uv sync
-
Install dbt packages:
uv run dbt deps
Create the database directory and output directory (CSV exports of created tables) with:
mkdir database
mkdir output_exportsThis project uses DuckDB as the data warehouse and the connection configuration is stored in config/profiles.yml. The database file will be created automatically in the database/ directory when you first run dbt. The output created when running a DBT model will be exported as a CSV file in output_exports.
Export the profiles directory configuration to avoid specifying it in every command:
export DBT_PROFILES_DIR=configIf you prefer not to set the environment variable, you can specify the profiles directory in each command:
uv run dbt run --profiles-dir config
uv run dbt test --profiles-dir config- Debug configuration:
uv run dbt debug- Validates your dbt configuration and database connection - Compile models:
uv run dbt compile- Compiles SQL templates without executing them - Run all models:
uv run dbt run- Executes all SQL models and creates/updates tables and views in the database - Test data quality:
uv run dbt test- Runs all defined tests to validate data quality and model assumptions - Generate documentation:
uv run dbt docs generate- Creates project documentation - Serve documentation:
uv run dbt docs serve- Serves documentation on a local web server
This project analyzes French birth data from 2021, focusing on data quality practices and techniques taught in the CentraleSupélec data quality course.
You can explore the data by :
- looking at the data sample :
sample_data/extrait_donnees_naissances_2021.csv - adapting the Python script querying the database :
uv run python analyses/explore_database.py
Task: Add data quality tests to ensure data integrity
- Add a
uniquetest for theIDcolumn in thebirth_data_2021source to ensure each record has a unique identifier - Add
not_nulltests for critical columns that should never be empty (e.g.,ID,SEXE,MNAIS)
Implementation: Add these tests to the models/schema.yml file under the appropriate column definitions.
Task: Add range validation tests for age columns using the dbt_utils.accepted_range test. See data dictionnary for accepted values.
Implementation: Use the dbt_utils.accepted_range test with min_value: <MIN> and max_value: <MAX> for each age column.
Task: Add tests to validate year columns (e.g., ANAIS, AMAR, ARECC, ARECM, ARECP) data type
Implementation: Use the appropriate data type test for a year column
Bonus: Validate year values using dbt_utils.accepted_range with appropriate min/max for year columns
Task: Verify referential integrity between birth data and department reference table
Implementation: Use the relationships test to ensure foreign key constraints are maintained between the birth data and department reference data.
Based on the results of the tests, create a new model donnees_naissances_2021_corrected.sql containing only correct data and run the same tests against it (they should all pass)
- Learn more about dbt in the docs
- Check out Discourse for commonly asked questions and answers
- Join the chat on Slack for live discussions and support
- Find dbt events near you
- Check out the blog for the latest news on dbt's development and best practices