feat: add first-run hardening — download scripts, data tiers, UX detection

Shane Butler · claude · Shane Butler · commit d79e5e947819 · 2026-02-19T06:47:35.000-08:00
- Add scripts/download-data.sh (sample + full modes, SHA256 verification)
- Add scripts/build-duckdb.sh (Python or CLI, builds local .duckdb)
- Add scripts/setup.sh (venv creation, dependency install, verification)
- Add data/checksums.sha256 for download integrity verification
- Update data/novamart/README.md with tier documentation
- Add Tier 2 data detection to Knowledge Bootstrap skill
- Add MCP + settings detection to First-Run Welcome skill
- Create GitHub Release v1.0.0 with sample (~1MB) and full (~113MB) tarballs

Tests: 182 passed, 1 failed (sample-data referential integrity), 10 warnings

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/.claude/skills/first-run-welcome/skill.md b/.claude/skills/first-run-welcome/skill.md
@@ -13,7 +13,7 @@ onboarding flow based on what data is available.
 
 ### Step 1: Detect environment
 
-Check three things:
+Check five things:
 
 1. **User profile exists?** → `.knowledge/user/profile.md`
    - If YES: This is a returning user. Skip this skill entirely.
@@ -27,6 +27,22 @@ Check three things:
    - If YES and not "novamart": User has their own data connected.
    - If NO: No dataset configured yet.
 
+4. **MCP settings configured?** → `.claude/mcp.json`
+   - If MISSING: Show setup hint:
+     ```
+     MCP not configured yet. To connect to MotherDuck:
+       cp .claude/mcp.json.example .claude/mcp.json
+     Then edit .claude/mcp.json and add your MotherDuck token.
+     See setup/mcp-config.md for details.
+     ```
+
+5. **Claude settings configured?** → `.claude/settings.local.json`
+   - If MISSING: Show setup hint:
+     ```
+     Tip: Copy the example settings to allow Marp slide rendering:
+       cp .claude/settings.local.json.example .claude/settings.local.json
+     ```
+
 ### Step 2: Present welcome based on scenario
 
 #### Scenario A: NovaMart present, no user data
diff --git a/.claude/skills/knowledge-bootstrap/skill.md b/.claude/skills/knowledge-bootstrap/skill.md
@@ -23,6 +23,29 @@ If the file is missing or empty:
 - If NovaMart data files exist in `data/novamart/`, set `active_dataset: novamart`
 - Otherwise, prompt: "No active dataset configured. Use `/connect-data` to add one."
 
+### Step 1b: Check for Tier 2 data files
+
+Check if the 5 large Tier 2 CSV files are present in `data/novamart/`:
+- `users.csv`, `orders.csv`, `events.csv`, `sessions.csv`, `support_tickets.csv`
+
+**If any are missing**, display this message before proceeding:
+
+```
+Some data files are not downloaded yet. The repo ships with 8 small
+reference tables but the 5 large analysis tables need to be downloaded.
+
+Run this from the repo root:
+  bash scripts/download-data.sh         # Sample data (~15MB, good for learning)
+  bash scripts/download-data.sh --full  # Full dataset (~200MB compressed)
+
+Missing files: {list of missing files}
+
+You can still query the Tier 1 tables (products, calendar, experiments,
+promotions, memberships, nps_responses, experiment_assignments, order_items).
+```
+
+Continue with bootstrap — don't halt. The system works with partial data.
+
 ### Step 2: Validate dataset brain
 
 Check that `.knowledge/datasets/{active}/` contains:
diff --git a/data/checksums.sha256 b/data/checksums.sha256
@@ -0,0 +1,14 @@
+# SHA256 checksums for NovaMart Tier 2 data files
+# Used by scripts/download-data.sh to verify downloads
+# Generated: 2026-02-19
+
+# Individual CSV files (full dataset)
+2af4c0906a287f873f6edfbc6d89bf3be950150013d4842de920abb8c46dc2e5  events.csv
+3680f494b7b0a492264a2f61f0bdb75f24c0566024f6db5a20a889bc773973e9  sessions.csv
+ce358b367a0038e22c179b36733549f3ff2e9d66d9838ad2238a098eab1b0d36  orders.csv
+073dd4e53c18acf29b714bb4946ea6b40f573fb4238bc6575cb8ee04aaa3784e  users.csv
+06cdf487d931ebec65da1a7462e1e0f7a6bbcb0a7a0afabb6dcd3c6b06fea91b  support_tickets.csv
+
+# Release tarballs
+db3a4a6eac849ef9cc67c2d5a80891a7f77b9ab0dbb91af07e9fc753428f33f6  novamart-full.tar.gz
+10448ddd47a2e3e70fa266f0f60f81c0e43a3ee7aa1b7ee4680981b9b9c39bf8  novamart-sample.tar.gz
diff --git a/data/novamart/README.md b/data/novamart/README.md
@@ -381,9 +381,63 @@ ORDER BY 2 DESC;
 
 ---
 
+## Data Tiers
+
+The dataset is split into two tiers to keep the repo small while providing full data when needed.
+
+### Tier 1: Shipped with the repo (~4MB)
+
+These 8 small reference/dimension tables are included in git:
+
+| File | Rows | Size | Description |
+|------|------|------|-------------|
+| `calendar.csv` | 366 | 13K | 2024 calendar with holidays |
+| `experiments.csv` | 2 | <1K | A/B test definitions |
+| `promotions.csv` | 5 | <1K | Promotion definitions |
+| `products.csv` | 500 | 29K | Product catalog |
+| `nps_responses.csv` | ~8K | 335K | NPS survey responses |
+| `memberships.csv` | ~12K | 392K | Plus membership state changes |
+| `experiment_assignments.csv` | ~20K | 861K | A/B test assignments |
+| `order_items.csv` | ~120K | 2.4M | Order line items |
+
+### Tier 2: Downloaded separately (~690MB)
+
+These 5 large tables must be downloaded after cloning:
+
+| File | Rows | Size | Description |
+|------|------|------|-------------|
+| `users.csv` | ~50K | 3M | User dimension table |
+| `orders.csv` | ~50K | 4.6M | Order records |
+| `support_tickets.csv` | ~25K | 2.2M | Customer support tickets |
+| `sessions.csv` | ~1.4M | 130M | Session summaries |
+| `events.csv` | ~6.5M | 551M | Behavioral events (largest) |
+
+**Download Tier 2 data:**
+
+```bash
+# Sample data (10K-row subsets, ~15MB) — good for learning
+bash scripts/download-data.sh
+
+# Full dataset (~200MB compressed, ~690MB uncompressed)
+bash scripts/download-data.sh --full
+```
+
+### Tier 3: Generated locally
+
+| File | Description |
+|------|-------------|
+| `novamart.duckdb` | Pre-built DuckDB database (generated by `scripts/build-duckdb.sh`) |
+
+```bash
+# Build after downloading CSVs
+bash scripts/build-duckdb.sh
+```
+
+---
+
 ## File Inventory
 
-This directory contains 14 CSV files and 1 DuckDB database file:
+The complete dataset contains 13 CSV files:
 
 | File | Description |
 |------|-------------|
@@ -400,11 +454,10 @@ This directory contains 14 CSV files and 1 DuckDB database file:
 | `nps_responses.csv` | NPS survey responses |
 | `experiment_assignments.csv` | A/B test user assignments |
 | `calendar.csv` | 2024 calendar with holidays and day-of-week attributes |
-| `novamart_practice.duckdb` | Pre-built DuckDB database with all tables loaded |
 
-To query the DuckDB file directly:
+To query a DuckDB database (after building it):
 ```python
 import duckdb
-con = duckdb.connect('data/novamart/novamart_practice.duckdb', read_only=True)
+con = duckdb.connect('data/novamart/novamart.duckdb', read_only=True)
 con.sql("SELECT COUNT(*) FROM users").show()
 ```
diff --git a/scripts/build-duckdb.sh b/scripts/build-duckdb.sh
@@ -0,0 +1,118 @@
+#!/usr/bin/env bash
+# build-duckdb.sh — Build a local DuckDB database from NovaMart CSV files
+#
+# Usage:
+#   bash scripts/build-duckdb.sh            # Build from CSVs in data/novamart/
+#   bash scripts/build-duckdb.sh --help     # Show this help
+#
+# Creates data/novamart/novamart.duckdb with all tables loaded.
+# This is optional — Claude Code can query CSVs directly via DuckDB's
+# read_csv() function, but a pre-built .duckdb file is faster for
+# repeated queries.
+
+set -euo pipefail
+
+DATA_DIR="data/novamart"
+DB_FILE="${DATA_DIR}/novamart.duckdb"
+
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m'
+
+usage() {
+    echo "Usage: bash scripts/build-duckdb.sh [--help]"
+    echo ""
+    echo "Builds ${DB_FILE} from CSV files in ${DATA_DIR}/"
+    echo ""
+    echo "Prerequisites:"
+    echo "  - Python 3.9+ with duckdb package: pip install duckdb"
+    echo "  - OR DuckDB CLI: brew install duckdb (macOS)"
+}
+
+# --- Main ---
+
+if [[ "${1:-}" == "--help" ]] || [[ "${1:-}" == "-h" ]]; then
+    usage
+    exit 0
+fi
+
+# Ensure we're in the repo root
+if [ ! -f "CLAUDE.md" ]; then
+    echo -e "${RED}Error: Run this script from the AI Analyst repo root.${NC}"
+    echo "  cd ~/Desktop/ai-analyst && bash scripts/build-duckdb.sh"
+    exit 1
+fi
+
+# Check for CSV files
+if [ ! -d "$DATA_DIR" ] || [ -z "$(ls "$DATA_DIR"/*.csv 2>/dev/null)" ]; then
+    echo -e "${RED}Error: No CSV files found in ${DATA_DIR}/${NC}"
+    echo ""
+    echo "Run the download script first:"
+    echo "  bash scripts/download-data.sh"
+    exit 1
+fi
+
+# Remove existing DB if present
+if [ -f "$DB_FILE" ]; then
+    echo -e "${YELLOW}Removing existing ${DB_FILE}${NC}"
+    rm -f "$DB_FILE"
+fi
+
+echo "Building DuckDB database from CSV files..."
+echo ""
+
+# Try Python+duckdb first, fall back to DuckDB CLI
+if python3 -c "import duckdb" 2>/dev/null; then
+    python3 << 'PYEOF'
+import duckdb
+import os
+import glob
+
+data_dir = "data/novamart"
+db_file = os.path.join(data_dir, "novamart.duckdb")
+
+con = duckdb.connect(db_file)
+
+csv_files = sorted(glob.glob(os.path.join(data_dir, "*.csv")))
+loaded = 0
+
+for csv_path in csv_files:
+    table_name = os.path.splitext(os.path.basename(csv_path))[0]
+    print(f"  Loading {table_name}...", end="", flush=True)
+    con.execute(f"CREATE TABLE {table_name} AS SELECT * FROM read_csv_auto('{csv_path}')")
+    row_count = con.execute(f"SELECT COUNT(*) FROM {table_name}").fetchone()[0]
+    print(f" {row_count:,} rows")
+    loaded += 1
+
+con.close()
+print(f"\nLoaded {loaded} tables into {db_file}")
+PYEOF
+
+elif command -v duckdb &> /dev/null; then
+    for csv_file in "$DATA_DIR"/*.csv; do
+        table_name=$(basename "$csv_file" .csv)
+        echo "  Loading ${table_name}..."
+        duckdb "$DB_FILE" "CREATE TABLE ${table_name} AS SELECT * FROM read_csv_auto('${csv_file}');"
+    done
+    echo ""
+    echo "Tables loaded into ${DB_FILE}"
+
+else
+    echo -e "${RED}Error: Neither Python duckdb package nor DuckDB CLI found.${NC}"
+    echo ""
+    echo "Install one of:"
+    echo "  pip install duckdb          # Python package"
+    echo "  brew install duckdb         # macOS CLI"
+    echo "  apt install duckdb          # Linux CLI"
+    exit 1
+fi
+
+# Report file size
+if [ -f "$DB_FILE" ]; then
+    size=$(ls -lh "$DB_FILE" | awk '{print $5}')
+    echo ""
+    echo -e "${GREEN}DuckDB database ready: ${DB_FILE} (${size})${NC}"
+    echo ""
+    echo "Claude Code will automatically use this database for faster queries."
+fi
diff --git a/scripts/download-data.sh b/scripts/download-data.sh
diff --git a/scripts/setup.sh b/scripts/setup.sh