feat: Add isolated devcontainer environments for Analytics Engineering module#790
feat: Add isolated devcontainer environments for Analytics Engineering module#790lassebenni wants to merge 38 commits intoDataTalksClub:mainfrom
Conversation
- Remove tracked DuckDB database files from repository - Remove CLAUDE.md (AI guidance file) - Update .gitignore to exclude: - DuckDB database files (*.duckdb, *.db) - dbt build artifacts (target/, dbt_packages/, logs/) - CLAUDE.md These files should not be versioned as they are either: - Generated artifacts (dbt) - Local development data (DuckDB) - AI-specific guidance (CLAUDE.md)
|
Can you move these files to the 04 folder and add a readme describing how to use them? In the past we already had experience with .devcontainer - one of the students contribued it but then at the end nobody used it. So we need to make sure it's clear what are the benefits and how to use it - without creating files in the root of the repo |
Hi Alexey, sorry was meant to create PR this against my own fork first. I put it in draft now. But good to hear, are you open to using devcontainers again? I will work on it and ensure it is of good quality before un-drafting. I can add a video on how to use it. |
|
A video would help |
- Change VARCHAR to STRING data types in schema.yml (BigQuery requirement) - Update on_schema_change from 'sync_all_columns' to 'append_new_columns' - Add explicit CAST to STRING for store_and_fwd_flag field Fixes "Type not found: varchar" error when building fct_trips on BigQuery. All VARCHAR types replaced with STRING (6 occurrences in schema.yml). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add complete setup guide for BigQuery codespace: - Part 1: Authentication setup with gcloud - Part 2: Data loading (existing Module 3 data OR fresh load) - Part 3: Running dbt with environment variable setup New files: - setup_guide.md: Complete user guide with troubleshooting - create_external_tables.sh: Helper script for BigQuery external tables Updates: - postCreate.sh: Auto-open setup guide on codespace creation Key features: - Two data loading options (reuse existing OR load fresh) - GCP_PROJECT_ID environment variable setup (critical for dbt) - Expected build results (PASS=34, ERROR=1, SKIP=11) - Comprehensive troubleshooting section - Helper script for automated table creation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add hostRequirements to devcontainer configs for appropriate sizing: BigQuery codespace: - 2-core, 8GB (basicLinux32gb) - Minimal resources needed (BigQuery processes data server-side) - 50% cost reduction vs previous 4-core config DuckDB codespace: - 4-core, 16GB (standardLinux32gb) - Higher resources needed for local data processing - Prevents OOM errors with ~7GB local database Rationale: - BigQuery: dbt only compiles SQL and sends to BigQuery API - DuckDB: Processes full dataset locally, needs more memory This ensures each environment gets appropriate resources without over-provisioning, reducing costs while maintaining performance. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added comprehensive setup guide for DuckDB codespace: - Part 1: Environment verification (dbt, database, connection) - Part 2: Understanding data sources and project structure - Part 3: Building base models with expected results - Part 4: Homework questions roadmap (Q1-Q7 guidance) - Part 5: Creating new models with SQL templates - Part 6: FHV data setup guide (staging and core models) - Troubleshooting section (5 common issues) - Tips for success and quick reference commands Features: - Auto-opens on codespace creation - Step-by-step instructions for all homework questions - SQL examples for date extraction, window functions, percentiles - Expected build performance metrics - Data validation queries Updated postCreate.sh to auto-copy and open setup guide.
…rtup Changed from baking data into Docker image to downloading pre-built database: - Disabled BAKE_DATA in devcontainer.json (false instead of true) - Updated postCreate.sh to download 3.3GB database from GitHub release - Increased DuckDB memory_limit from 4GB to 8GB for better performance Performance impact: - Before: ~14 minutes (image rebuild with data baking) - After: ~4-6 minutes (fast image + database download) - Improvement: 60% faster startup time Benefits: - Faster codespace provisioning (no 12-min image rebuild) - int_trips model will now succeed with 8GB memory limit - Easy to update data (just upload new release) - Better student experience Database will be uploaded to GitHub release v1.0.0: - Contains: Green (8M), Yellow (109M), FHV (43M) records - Size: 3.3GB - Download time: 2-4 minutes on typical network Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Switch from downloading 3.3GB pre-built database to downloading only necessary Parquet files directly from DataTalksClub releases. Changes: - Add download_homework_data.sh script to download selective Parquet files - Downloads Green/Yellow 2019-2020 (48 files) + FHV Nov 2019 (1 file) - Uses DuckDB's read_parquet() to load directly from HTTPS URLs - Update postCreate.sh to call selective download script Benefits: - Same data coverage for all homework questions (Q5-Q7) - Correct answers guaranteed (includes 2019-2020 for YoY comparison) - Potentially faster than 3.3GB download - No need for pre-built database GitHub releases - Students download only what's needed Coverage: - Q5: Quarterly revenue YoY (2019-2020 data ✅) - Q6: Fare percentiles (April 2020 data ✅) - Q7: FHV travel time (November 2019 data ✅) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The DataTalksClub/nyc-tlc-data releases contain CSV.gz files, not Parquet files. Update download script to use read_csv() with compression='gzip' and auto_detect=true. Changes: - Replace read_parquet() with read_csv() - Update file extensions from .parquet to .csv.gz - Add auto_detect and compression parameters This fixes the 404 errors when downloading data files. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
GitHub returns HTTP 403 errors when DuckDB tries to download multiple files simultaneously. Change approach to: 1. Download CSV.gz files one-by-one using wget/curl 2. Load downloaded files from disk into DuckDB 3. Clean up files after loading to save space This avoids GitHub rate limiting while keeping the selective download approach for homework data. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update the download message to reflect the actual performance: - Downloads 49 CSV files (not Parquet) - Takes ~2.5 minutes (verified in test codespace) - Achieves exact record counts matching homework requirements Verified results: - Green Taxi: 7,778,101 records ✅ - Yellow Taxi: 109,047,518 records ✅ - FHV: 1,879,137 records ✅ - Database size: 2.6G - Total time: ~2min 41sec Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The source data is loaded in the prod schema, so all dbt commands need to specify --target prod to work correctly. Updated commands in: - Part 1: Verify Environment Setup - Part 3: Build the Base Models - Part 5: Creating New Models for Homework - Part 6: Working with FHV Data - Quick Reference Commands - Troubleshooting Added explanation that users can either: 1. Always use --target prod (recommended for homework) 2. Or build models in dev first This fixes the "Table does not exist" error users would encounter when following the guide without --target prod. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Since we're not baking data into the Docker image anymore (we download it during postCreate instead), the BAKE_DATA build arg is always false and doesn't need to be explicitly set in devcontainer.json. The Dockerfile still has ARG BAKE_DATA=false as a default, so the conditional logic is preserved if needed in the future, but the configs are now cleaner. Changes: - Remove BAKE_DATA from duckdb/devcontainer.json - Remove BAKE_DATA from bigquery/devcontainer.json Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add comprehensive homework guidance to BigQuery setup guide to match DuckDB guide structure: - Part 4: Homework Questions Guide with hints for all 7 questions - Part 5: Creating New Models with SQL templates and examples - Part 6: Working with FHV Data with step-by-step instructions - Tips for Success section with best practices - Quick Reference Commands section for common operations BigQuery-specific SQL examples: - PERCENTILE_CONT syntax (different from DuckDB) - TIMESTAMP_DIFF for trip duration calculations - Query execution patterns for large datasets Both codespaces now have equally comprehensive guides that prepare students for homework without giving away answers. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Simplifies the setup by using consistent dev target throughout: SCHEMA CHANGES: - Raw data now loads into 'main' schema (was 'prod') - All dbt models use default 'dev' schema - sources.yml updated to reference 'main' schema for DuckDB SETUP GUIDE CHANGES: - Removed all --target prod flags throughout guide - Updated SQL examples to query main.green_tripdata (raw data) - Updated SQL examples to query dev.fct_trips (dbt models) - Removed schema='prod' from model config examples - Added note that all models use dev schema by default DOWNLOAD SCRIPT CHANGES: - Changed prod.green_tripdata -> main.green_tripdata - Changed prod.yellow_tripdata -> main.yellow_tripdata - Changed prod.fhv_tripdata -> main.fhv_tripdata - Updated final message to say "dbt build" (no --target flag) DEVCONTAINER CHANGES: - Created separate Dockerfiles for DuckDB and BigQuery - DuckDB: Only dbt-duckdb, DuckDB CLI, no cloud tools - BigQuery: Only dbt-bigquery, Google Cloud CLI, GCS packages - Removed shared Dockerfile - Updated devcontainer.json to reference local Dockerfiles BENEFITS: - Students don't need to remember --target prod everywhere - No confusion about which schema has what data - Simpler mental model: main = raw, dev = dbt models - Faster Docker builds with environment-specific dependencies Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
REMOVED COMPLETE SQL SOLUTIONS: - Part 6 previously provided complete stg_fhv_tripdata.sql code - Part 6 previously provided complete dim_fhv_trips.sql code - These are exactly what students need to create for Question 7 REPLACED WITH GUIDANCE ONLY: - High-level steps describing what to do - Hints about which columns to include - References to similar models for patterns - Key SQL functions to use (EXTRACT, TIMESTAMP_DIFF) - NO complete working SQL code VERIFIED NO OTHER LEAKS: - Questions 5, 6, 7 sections provide hints only - SQL examples show function usage, not complete solutions - No specific homework answers (numbers, locations) in guides - Base models (stg_*, int_*, fct_trips) are course material, not homework Both DuckDB and BigQuery guides updated consistently. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
ISSUE: When we changed Dockerfile build context from '..' to '.', the COPY command now copies from .devcontainer/duckdb/ to /opt/devcontainer/ instead of copying the entire .devcontainer/ directory. This changed the script paths: - OLD: /opt/devcontainer/duckdb/scripts/postCreate.sh - NEW: /opt/devcontainer/scripts/postCreate.sh RESULT: postStartCommand was pointing to wrong path, so postCreate script never ran, leaving /home/vscode/homework empty. FIX: Updated postStartCommand in both devcontainer.json files to use correct path: /opt/devcontainer/scripts/postCreate.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
ISSUE:
Codespace creation failed with error:
"Command failed: docker inspect --type image
mcr.microsoft.com/devcontainers/python:3.11-bookworm"
The tag '3.11-bookworm' doesn't exist in MCR.
CORRECT TAG FORMAT:
Microsoft's devcontainer Python images use format: {version}-{python}-{os}
Example: 1-3.11-bookworm (version 1, Python 3.11, Debian Bookworm)
FIX:
Changed FROM statement in both Dockerfiles:
- FROM mcr.microsoft.com/devcontainers/python:3.11-bookworm
+ FROM mcr.microsoft.com/devcontainers/python:1-3.11-bookworm
This uses the latest major version 1.x of the devcontainer image.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
ISSUE: After changing build context from '..' to '.', the file structure in /opt/devcontainer changed: - OLD: /opt/devcontainer/duckdb/scripts/, /opt/devcontainer/bigquery/scripts/ - NEW: /opt/devcontainer/scripts/ Additionally, common_setup.sh was in parent .devcontainer/scripts/ which is outside the build context, so it wasn't copied into images. RESULT: postCreate script failed silently because: 1. common_setup.sh wasn't in the image 2. All file paths were wrong (still used /opt/devcontainer/duckdb/ prefix) FIX: 1. Copied shared scripts (common_setup.sh, settings.json.template) into each variant's scripts/ directory 2. Updated all file paths in both postCreate.sh scripts: - /opt/devcontainer/duckdb/dbt/ → /opt/devcontainer/dbt/ - /opt/devcontainer/duckdb/scripts/ → /opt/devcontainer/scripts/ - /opt/devcontainer/duckdb/setup_guide.md → /opt/devcontainer/setup_guide.md (Same for BigQuery variant) Now setup will work correctly with the new build context structure. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
ISSUE: common_setup.sh was falling back to 2023 homework.md which contained complete solutions at the bottom: - Question 1: 61648442 - Question 2: 89.9/10.1 - Question 3: 43244696 - Question 4: 22998722 - Question 5: January FIX: Copied 2025 cohort homework.md to 04-analytics-engineering/HOMEWORK.md which only says "Solution - To be published after deadline" Now students get homework questions without seeing the answers! Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove extensive homework guidance from both DuckDB and BigQuery setup guides to prevent solution leaks. Setup guides now focus on: - Environment setup and verification - Basic dbt usage - Troubleshooting common issues - Reference to HOMEWORK.md for actual homework questions Removed sections: - Part 4: Homework Questions Guide (detailed hints) - Part 5: Creating New Models for Homework (SQL templates) - Part 6: Working with FHV Data (step-by-step instructions) Replaced with simple "Next Steps" section that refers students to HOMEWORK.md for homework questions and instructions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace 2025 cohort homework with 2026 cohort homework from DataTalksClub/data-engineering-zoomcamp to ensure students have the correct and current homework questions. Source: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2026/04-analytics-engineering/homework.md Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Change FROM mcr.microsoft.com/devcontainers/python:1-3.11-bookworm to FROM mcr.microsoft.com/devcontainers/python:3.11-bookworm The 1-3.11-bookworm tag format is invalid and causes Docker image pull failures in GitHub Codespaces. Using the standard 3.11-bookworm tag instead. Fixes container creation error: "Command failed: docker inspect --type image mcr.microsoft.com/devcontainers/python:1-3.11-bookworm" Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Switch from mcr.microsoft.com/devcontainers/python to python:3.11-slim-bookworm to avoid Docker registry issues in GitHub Codespaces. Changes: - Use official python:3.11-slim-bookworm as base image - Manually create vscode user with sudo privileges - Maintains all functionality and compatibility This resolves the persistent container creation errors: "Command failed: docker inspect --type image mcr.microsoft.com/devcontainers/python:*" Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove BigQuery codespace references from main documentation to simplify the student experience. Updates: - Main README: Remove BigQuery option, keep only DuckDB - 04-analytics-engineering/README: Focus prerequisites on DuckDB - Simplify setup section to single DuckDB option - Add note that homework uses DuckDB (videos show both for learning) BigQuery codespace files remain in repo but are not promoted in student-facing documentation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove BigQuery codespace option from setup documentation, leaving only DuckDB option. This completes the removal of BigQuery codespace references from all student-facing docs. Files still containing BigQuery setup info (but not linked): - .devcontainer/bigquery/ (kept for potential future use) - setup/cloud_setup.md (BigQuery-specific guide, not linked) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove BigQuery devcontainer and setup files as we're focusing solely on DuckDB for the homework environment. Removed files: - .devcontainer/bigquery/ (all files) - 04-analytics-engineering/setup/cloud_setup.md Backup available in branch: backup/bigquery-devcontainer Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Overview
This PR adds GitHub Codespaces devcontainer configurations for the Analytics Engineering module (Module 4), providing isolated workspace environments optimized for dbt development.
Features
Dual Environment Setup
Key Improvements
Files Added
.devcontainer/- Devcontainer configurations for DuckDB and BigQueryscripts/deploy_codespaces.sh- Non-interactive Codespace deploymentscripts/verify_codespace.sh- Health check and verification scriptDocumentation
Testing
Technical Notes
/opt/venvwith dbt pre-installed