Data Quality Course - dbt Project

This Git repository contains the dbt code for the data quality course at Centrale Supélec.

Git repository structure

├── analyses/           # Analytical SQL files
├── config/
│   └── profiles.yml   # dbt profiles configuration
├── data/              # Raw data files
│   ├── departements.csv
│   ├── dictionnaire_donnees_naissances.csv
│   └── donnees_naissances_2021.csv
├── database/          # DuckDB database files (created at runtime)
├── macros/            # SQL macros for reusable code
├── models/            # dbt models
│   ├── departements.sql
│   ├── donnees_naissances_2021.sql
│   └── schema.yml
├── sample_data/       # Sample/test data
├── seeds/             # CSV files to be loaded as tables
├── snapshots/         # Snapshot models for SCD
├── tests/             # Custom data tests
├── dbt_project.yml    # dbt project configuration
├── packages.yml       # dbt package dependencies
└── pyproject.toml     # Python dependencies managed by uv

Installation and Setup

Prerequisites

This project uses uv for Python dependency management and virtual environment handling.

Setup Instructions

Install uv (if not already installed): Follow the instructions in the official documentation
Clone the repository and navigate to the project directory:
```
git clone <repository-url>
cd cours_dq
```
Install Python dependencies:
```
uv sync
```
Install dbt packages:
```
uv run dbt deps
```

Database Configuration

Create the database directory and output directory (CSV exports of created tables) with:

mkdir database
mkdir output_exports

This project uses DuckDB as the data warehouse and the connection configuration is stored in config/profiles.yml. The database file will be created automatically in the database/ directory when you first run dbt. The output created when running a DBT model will be exported as a CSV file in output_exports.

Usage

Environment Configuration

Export the profiles directory configuration to avoid specifying it in every command:

export DBT_PROFILES_DIR=config

Alternative Usage (with explicit profiles directory)

If you prefer not to set the environment variable, you can specify the profiles directory in each command:

uv run dbt run --profiles-dir config
uv run dbt test --profiles-dir config

Common dbt Commands

Debug configuration: uv run dbt debug - Validates your dbt configuration and database connection
Compile models: uv run dbt compile - Compiles SQL templates without executing them
Run all models: uv run dbt run - Executes all SQL models and creates/updates tables and views in the database
Test data quality: uv run dbt test - Runs all defined tests to validate data quality and model assumptions
Generate documentation: uv run dbt docs generate - Creates project documentation
Serve documentation: uv run dbt docs serve - Serves documentation on a local web server

Project Overview

This project analyzes French birth data from 2021, focusing on data quality practices and techniques taught in the CentraleSupélec data quality course.

You can explore the data by :

looking at the data sample : sample_data/extrait_donnees_naissances_2021.csv
adapting the Python script querying the database : uv run python analyses/explore_database.py

Exercises

Exercise 1 - Add uniqueness and nullable tests

Task: Add data quality tests to ensure data integrity

Add a unique test for the ID column in the birth_data_2021 source to ensure each record has a unique identifier
Add not_null tests for critical columns that should never be empty (e.g., ID, SEXE, MNAIS)

Implementation: Add these tests to the models/schema.yml file under the appropriate column definitions.

Exercise 2 - Add value range tests

Task: Add range validation tests for age columns using the dbt_utils.accepted_range test. See data dictionnary for accepted values.

Implementation: Use the dbt_utils.accepted_range test with min_value: <MIN> and max_value: <MAX> for each age column.

Exercise 3 - Add data type tests

Task: Add tests to validate year columns (e.g., ANAIS, AMAR, ARECC, ARECM, ARECP) data type

Implementation: Use the appropriate data type test for a year column

Bonus: Validate year values using dbt_utils.accepted_range with appropriate min/max for year columns

Exercise 4 - Add relationship test

Task: Verify referential integrity between birth data and department reference table

Implementation: Use the relationships test to ensure foreign key constraints are maintained between the birth data and department reference data.

Bonus

Based on the results of the tests, create a new model donnees_naissances_2021_corrected.sql containing only correct data and run the same tests against it (they should all pass)

Resources:

Learn more about dbt in the docs
Check out Discourse for commonly asked questions and answers
Join the chat on Slack for live discussions and support
Find dbt events near you
Check out the blog for the latest news on dbt's development and best practices

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Quality Course - dbt Project

Git repository structure

Installation and Setup

Prerequisites

Setup Instructions

Database Configuration

Usage

Environment Configuration

Alternative Usage (with explicit profiles directory)

Common dbt Commands

Project Overview

Exercises

Exercise 1 - Add uniqueness and nullable tests

Exercise 2 - Add value range tests

Exercise 3 - Add data type tests

Exercise 4 - Add relationship test

Bonus

Resources:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
analyses		analyses
config		config
data		data
macros		macros
models		models
sample_data		sample_data
seeds		seeds
snapshots		snapshots
tests		tests
.gitignore		.gitignore
README.md		README.md
dbt_project.yml		dbt_project.yml
package-lock.yml		package-lock.yml
packages.yml		packages.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

sicara/cours_dq

Folders and files

Latest commit

History

Repository files navigation

Data Quality Course - dbt Project

Git repository structure

Installation and Setup

Prerequisites

Setup Instructions

Database Configuration

Usage

Environment Configuration

Alternative Usage (with explicit profiles directory)

Common dbt Commands

Project Overview

Exercises

Exercise 1 - Add uniqueness and nullable tests

Exercise 2 - Add value range tests

Exercise 3 - Add data type tests

Exercise 4 - Add relationship test

Bonus

Resources:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages