Skip to content

feat: Add isolated devcontainer environments for Analytics Engineering module#790

Draft
lassebenni wants to merge 38 commits intoDataTalksClub:mainfrom
lassebenni:feat/devcontainer
Draft

feat: Add isolated devcontainer environments for Analytics Engineering module#790
lassebenni wants to merge 38 commits intoDataTalksClub:mainfrom
lassebenni:feat/devcontainer

Conversation

@lassebenni
Copy link
Contributor

Overview

This PR adds GitHub Codespaces devcontainer configurations for the Analytics Engineering module (Module 4), providing isolated workspace environments optimized for dbt development.

Features

Dual Environment Setup

  • DuckDB Environment: Local processing with pre-baked taxi data
  • BigQuery Environment: Cloud processing with BigQuery backend

Key Improvements

  1. Isolated Workspace Pattern: workspace isolated from repo root
  2. Pre-baked Data: DuckDB environment includes NYC taxi data (2019-2020) baked into the Docker image
  3. Automated Setup: scripts handle file promotion and configuration
  4. Cross-Platform dbt: Project supports both DuckDB and BigQuery through conditional Jinja logic

Files Added

  • .devcontainer/ - Devcontainer configurations for DuckDB and BigQuery
  • scripts/deploy_codespaces.sh - Non-interactive Codespace deployment
  • scripts/verify_codespace.sh - Health check and verification script

Documentation

  • Updated README with Codespaces integration guides
  • Added setup documentation for both environments

Testing

  • ✅ DuckDB environment tested with full 2019-2020 taxi dataset
  • ✅ dbt models build successfully in isolated workspace
  • ✅ BigQuery environment configured for cloud processing

Technical Notes

  • Uses Docker multi-stage build with conditional data baking
  • Python virtual environment at /opt/venv with dbt pre-installed
  • Git initialized in isolated workspace for student work
  • VS Code settings automatically configured per environment

Lasse Benninga added 10 commits January 27, 2026 22:34
- Remove tracked DuckDB database files from repository
- Remove CLAUDE.md (AI guidance file)
- Update .gitignore to exclude:
  - DuckDB database files (*.duckdb, *.db)
  - dbt build artifacts (target/, dbt_packages/, logs/)
  - CLAUDE.md

These files should not be versioned as they are either:
- Generated artifacts (dbt)
- Local development data (DuckDB)
- AI-specific guidance (CLAUDE.md)
@alexeygrigorev
Copy link
Member

Can you move these files to the 04 folder and add a readme describing how to use them?

In the past we already had experience with .devcontainer - one of the students contribued it but then at the end nobody used it. So we need to make sure it's clear what are the benefits and how to use it - without creating files in the root of the repo

@lassebenni lassebenni marked this pull request as draft January 29, 2026 11:09
@lassebenni
Copy link
Contributor Author

lassebenni commented Jan 29, 2026

Can you move these files to the 04 folder and add a readme describing how to use them?

In the past we already had experience with .devcontainer - one of the students contribued it but then at the end nobody used it. So we need to make sure it's clear what are the benefits and how to use it - without creating files in the root of the repo

Hi Alexey, sorry was meant to create PR this against my own fork first. I put it in draft now.

But good to hear, are you open to using devcontainers again? I will work on it and ensure it is of good quality before un-drafting. I can add a video on how to use it.

@alexeygrigorev
Copy link
Member

A video would help

Lasse Benninga and others added 16 commits January 29, 2026 15:27
- Change VARCHAR to STRING data types in schema.yml (BigQuery requirement)
- Update on_schema_change from 'sync_all_columns' to 'append_new_columns'
- Add explicit CAST to STRING for store_and_fwd_flag field

Fixes "Type not found: varchar" error when building fct_trips on BigQuery.
All VARCHAR types replaced with STRING (6 occurrences in schema.yml).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add complete setup guide for BigQuery codespace:
- Part 1: Authentication setup with gcloud
- Part 2: Data loading (existing Module 3 data OR fresh load)
- Part 3: Running dbt with environment variable setup

New files:
- setup_guide.md: Complete user guide with troubleshooting
- create_external_tables.sh: Helper script for BigQuery external tables

Updates:
- postCreate.sh: Auto-open setup guide on codespace creation

Key features:
- Two data loading options (reuse existing OR load fresh)
- GCP_PROJECT_ID environment variable setup (critical for dbt)
- Expected build results (PASS=34, ERROR=1, SKIP=11)
- Comprehensive troubleshooting section
- Helper script for automated table creation

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add hostRequirements to devcontainer configs for appropriate sizing:

BigQuery codespace:
- 2-core, 8GB (basicLinux32gb)
- Minimal resources needed (BigQuery processes data server-side)
- 50% cost reduction vs previous 4-core config

DuckDB codespace:
- 4-core, 16GB (standardLinux32gb)
- Higher resources needed for local data processing
- Prevents OOM errors with ~7GB local database

Rationale:
- BigQuery: dbt only compiles SQL and sends to BigQuery API
- DuckDB: Processes full dataset locally, needs more memory

This ensures each environment gets appropriate resources without
over-provisioning, reducing costs while maintaining performance.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added comprehensive setup guide for DuckDB codespace:
- Part 1: Environment verification (dbt, database, connection)
- Part 2: Understanding data sources and project structure
- Part 3: Building base models with expected results
- Part 4: Homework questions roadmap (Q1-Q7 guidance)
- Part 5: Creating new models with SQL templates
- Part 6: FHV data setup guide (staging and core models)
- Troubleshooting section (5 common issues)
- Tips for success and quick reference commands

Features:
- Auto-opens on codespace creation
- Step-by-step instructions for all homework questions
- SQL examples for date extraction, window functions, percentiles
- Expected build performance metrics
- Data validation queries

Updated postCreate.sh to auto-copy and open setup guide.
…rtup

Changed from baking data into Docker image to downloading pre-built database:
- Disabled BAKE_DATA in devcontainer.json (false instead of true)
- Updated postCreate.sh to download 3.3GB database from GitHub release
- Increased DuckDB memory_limit from 4GB to 8GB for better performance

Performance impact:
- Before: ~14 minutes (image rebuild with data baking)
- After:  ~4-6 minutes (fast image + database download)
- Improvement: 60% faster startup time

Benefits:
- Faster codespace provisioning (no 12-min image rebuild)
- int_trips model will now succeed with 8GB memory limit
- Easy to update data (just upload new release)
- Better student experience

Database will be uploaded to GitHub release v1.0.0:
- Contains: Green (8M), Yellow (109M), FHV (43M) records
- Size: 3.3GB
- Download time: 2-4 minutes on typical network

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Switch from downloading 3.3GB pre-built database to downloading only
necessary Parquet files directly from DataTalksClub releases.

Changes:
- Add download_homework_data.sh script to download selective Parquet files
- Downloads Green/Yellow 2019-2020 (48 files) + FHV Nov 2019 (1 file)
- Uses DuckDB's read_parquet() to load directly from HTTPS URLs
- Update postCreate.sh to call selective download script

Benefits:
- Same data coverage for all homework questions (Q5-Q7)
- Correct answers guaranteed (includes 2019-2020 for YoY comparison)
- Potentially faster than 3.3GB download
- No need for pre-built database GitHub releases
- Students download only what's needed

Coverage:
- Q5: Quarterly revenue YoY (2019-2020 data ✅)
- Q6: Fare percentiles (April 2020 data ✅)
- Q7: FHV travel time (November 2019 data ✅)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The DataTalksClub/nyc-tlc-data releases contain CSV.gz files, not
Parquet files. Update download script to use read_csv() with
compression='gzip' and auto_detect=true.

Changes:
- Replace read_parquet() with read_csv()
- Update file extensions from .parquet to .csv.gz
- Add auto_detect and compression parameters

This fixes the 404 errors when downloading data files.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
GitHub returns HTTP 403 errors when DuckDB tries to download multiple
files simultaneously. Change approach to:
1. Download CSV.gz files one-by-one using wget/curl
2. Load downloaded files from disk into DuckDB
3. Clean up files after loading to save space

This avoids GitHub rate limiting while keeping the selective download
approach for homework data.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update the download message to reflect the actual performance:
- Downloads 49 CSV files (not Parquet)
- Takes ~2.5 minutes (verified in test codespace)
- Achieves exact record counts matching homework requirements

Verified results:
- Green Taxi: 7,778,101 records ✅
- Yellow Taxi: 109,047,518 records ✅
- FHV: 1,879,137 records ✅
- Database size: 2.6G
- Total time: ~2min 41sec

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The source data is loaded in the prod schema, so all dbt commands
need to specify --target prod to work correctly.

Updated commands in:
- Part 1: Verify Environment Setup
- Part 3: Build the Base Models
- Part 5: Creating New Models for Homework
- Part 6: Working with FHV Data
- Quick Reference Commands
- Troubleshooting

Added explanation that users can either:
1. Always use --target prod (recommended for homework)
2. Or build models in dev first

This fixes the "Table does not exist" error users would encounter
when following the guide without --target prod.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Since we're not baking data into the Docker image anymore (we download
it during postCreate instead), the BAKE_DATA build arg is always false
and doesn't need to be explicitly set in devcontainer.json.

The Dockerfile still has ARG BAKE_DATA=false as a default, so the
conditional logic is preserved if needed in the future, but the
configs are now cleaner.

Changes:
- Remove BAKE_DATA from duckdb/devcontainer.json
- Remove BAKE_DATA from bigquery/devcontainer.json

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add comprehensive homework guidance to BigQuery setup guide to match
DuckDB guide structure:

- Part 4: Homework Questions Guide with hints for all 7 questions
- Part 5: Creating New Models with SQL templates and examples
- Part 6: Working with FHV Data with step-by-step instructions
- Tips for Success section with best practices
- Quick Reference Commands section for common operations

BigQuery-specific SQL examples:
- PERCENTILE_CONT syntax (different from DuckDB)
- TIMESTAMP_DIFF for trip duration calculations
- Query execution patterns for large datasets

Both codespaces now have equally comprehensive guides that prepare
students for homework without giving away answers.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Simplifies the setup by using consistent dev target throughout:

SCHEMA CHANGES:
- Raw data now loads into 'main' schema (was 'prod')
- All dbt models use default 'dev' schema
- sources.yml updated to reference 'main' schema for DuckDB

SETUP GUIDE CHANGES:
- Removed all --target prod flags throughout guide
- Updated SQL examples to query main.green_tripdata (raw data)
- Updated SQL examples to query dev.fct_trips (dbt models)
- Removed schema='prod' from model config examples
- Added note that all models use dev schema by default

DOWNLOAD SCRIPT CHANGES:
- Changed prod.green_tripdata -> main.green_tripdata
- Changed prod.yellow_tripdata -> main.yellow_tripdata
- Changed prod.fhv_tripdata -> main.fhv_tripdata
- Updated final message to say "dbt build" (no --target flag)

DEVCONTAINER CHANGES:
- Created separate Dockerfiles for DuckDB and BigQuery
- DuckDB: Only dbt-duckdb, DuckDB CLI, no cloud tools
- BigQuery: Only dbt-bigquery, Google Cloud CLI, GCS packages
- Removed shared Dockerfile
- Updated devcontainer.json to reference local Dockerfiles

BENEFITS:
- Students don't need to remember --target prod everywhere
- No confusion about which schema has what data
- Simpler mental model: main = raw, dev = dbt models
- Faster Docker builds with environment-specific dependencies

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
REMOVED COMPLETE SQL SOLUTIONS:
- Part 6 previously provided complete stg_fhv_tripdata.sql code
- Part 6 previously provided complete dim_fhv_trips.sql code
- These are exactly what students need to create for Question 7

REPLACED WITH GUIDANCE ONLY:
- High-level steps describing what to do
- Hints about which columns to include
- References to similar models for patterns
- Key SQL functions to use (EXTRACT, TIMESTAMP_DIFF)
- NO complete working SQL code

VERIFIED NO OTHER LEAKS:
- Questions 5, 6, 7 sections provide hints only
- SQL examples show function usage, not complete solutions
- No specific homework answers (numbers, locations) in guides
- Base models (stg_*, int_*, fct_trips) are course material, not homework

Both DuckDB and BigQuery guides updated consistently.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
ISSUE:
When we changed Dockerfile build context from '..' to '.', the COPY
command now copies from .devcontainer/duckdb/ to /opt/devcontainer/
instead of copying the entire .devcontainer/ directory.

This changed the script paths:
- OLD: /opt/devcontainer/duckdb/scripts/postCreate.sh
- NEW: /opt/devcontainer/scripts/postCreate.sh

RESULT:
postStartCommand was pointing to wrong path, so postCreate script
never ran, leaving /home/vscode/homework empty.

FIX:
Updated postStartCommand in both devcontainer.json files to use
correct path: /opt/devcontainer/scripts/postCreate.sh

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
ISSUE:
Codespace creation failed with error:
"Command failed: docker inspect --type image
mcr.microsoft.com/devcontainers/python:3.11-bookworm"

The tag '3.11-bookworm' doesn't exist in MCR.

CORRECT TAG FORMAT:
Microsoft's devcontainer Python images use format: {version}-{python}-{os}
Example: 1-3.11-bookworm (version 1, Python 3.11, Debian Bookworm)

FIX:
Changed FROM statement in both Dockerfiles:
- FROM mcr.microsoft.com/devcontainers/python:3.11-bookworm
+ FROM mcr.microsoft.com/devcontainers/python:1-3.11-bookworm

This uses the latest major version 1.x of the devcontainer image.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Lasse Benninga and others added 12 commits February 2, 2026 12:30
ISSUE:
After changing build context from '..' to '.', the file structure
in /opt/devcontainer changed:
- OLD: /opt/devcontainer/duckdb/scripts/, /opt/devcontainer/bigquery/scripts/
- NEW: /opt/devcontainer/scripts/

Additionally, common_setup.sh was in parent .devcontainer/scripts/
which is outside the build context, so it wasn't copied into images.

RESULT:
postCreate script failed silently because:
1. common_setup.sh wasn't in the image
2. All file paths were wrong (still used /opt/devcontainer/duckdb/ prefix)

FIX:
1. Copied shared scripts (common_setup.sh, settings.json.template)
   into each variant's scripts/ directory
2. Updated all file paths in both postCreate.sh scripts:
   - /opt/devcontainer/duckdb/dbt/ → /opt/devcontainer/dbt/
   - /opt/devcontainer/duckdb/scripts/ → /opt/devcontainer/scripts/
   - /opt/devcontainer/duckdb/setup_guide.md → /opt/devcontainer/setup_guide.md
   (Same for BigQuery variant)

Now setup will work correctly with the new build context structure.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
ISSUE:
common_setup.sh was falling back to 2023 homework.md which contained
complete solutions at the bottom:
- Question 1: 61648442
- Question 2: 89.9/10.1
- Question 3: 43244696
- Question 4: 22998722
- Question 5: January

FIX:
Copied 2025 cohort homework.md to 04-analytics-engineering/HOMEWORK.md
which only says "Solution - To be published after deadline"

Now students get homework questions without seeing the answers!

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove extensive homework guidance from both DuckDB and BigQuery
setup guides to prevent solution leaks. Setup guides now focus on:
- Environment setup and verification
- Basic dbt usage
- Troubleshooting common issues
- Reference to HOMEWORK.md for actual homework questions

Removed sections:
- Part 4: Homework Questions Guide (detailed hints)
- Part 5: Creating New Models for Homework (SQL templates)
- Part 6: Working with FHV Data (step-by-step instructions)

Replaced with simple "Next Steps" section that refers students
to HOMEWORK.md for homework questions and instructions.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace 2025 cohort homework with 2026 cohort homework from
DataTalksClub/data-engineering-zoomcamp to ensure students
have the correct and current homework questions.

Source: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2026/04-analytics-engineering/homework.md

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Change FROM mcr.microsoft.com/devcontainers/python:1-3.11-bookworm
to FROM mcr.microsoft.com/devcontainers/python:3.11-bookworm

The 1-3.11-bookworm tag format is invalid and causes Docker
image pull failures in GitHub Codespaces. Using the standard
3.11-bookworm tag instead.

Fixes container creation error:
"Command failed: docker inspect --type image
mcr.microsoft.com/devcontainers/python:1-3.11-bookworm"

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Switch from mcr.microsoft.com/devcontainers/python to
python:3.11-slim-bookworm to avoid Docker registry issues
in GitHub Codespaces.

Changes:
- Use official python:3.11-slim-bookworm as base image
- Manually create vscode user with sudo privileges
- Maintains all functionality and compatibility

This resolves the persistent container creation errors:
"Command failed: docker inspect --type image
mcr.microsoft.com/devcontainers/python:*"

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove BigQuery codespace references from main documentation
to simplify the student experience. Updates:

- Main README: Remove BigQuery option, keep only DuckDB
- 04-analytics-engineering/README: Focus prerequisites on DuckDB
- Simplify setup section to single DuckDB option
- Add note that homework uses DuckDB (videos show both for learning)

BigQuery codespace files remain in repo but are not promoted
in student-facing documentation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove BigQuery codespace option from setup documentation,
leaving only DuckDB option. This completes the removal of
BigQuery codespace references from all student-facing docs.

Files still containing BigQuery setup info (but not linked):
- .devcontainer/bigquery/ (kept for potential future use)
- setup/cloud_setup.md (BigQuery-specific guide, not linked)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove BigQuery devcontainer and setup files as we're focusing
solely on DuckDB for the homework environment.

Removed files:
- .devcontainer/bigquery/ (all files)
- 04-analytics-engineering/setup/cloud_setup.md

Backup available in branch: backup/bigquery-devcontainer

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants