Reorganize example 2 and 3 data scripts and add untracked to .gitignore

dgketchum · dgketchum · commit ae8f0641c606 · 2026-01-29T11:51:16.000-07:00
diff --git a/.gitignore b/.gitignore
@@ -170,45 +170,28 @@ cython_debug/
 #.idea/
 *cache/
 
-/examples/1_Boulder/data/gis/mt_sid_boulder_gfid.cpg
-/examples/1_Boulder/data/gis/mt_sid_boulder_gfid.dbf
-/examples/1_Boulder/data/gis/mt_sid_boulder_gfid.json
-/examples/1_Boulder/data/gis/mt_sid_boulder_gfid.prj
-/examples/1_Boulder/data/gis/mt_sid_boulder_gfid.shp
-/examples/1_Boulder/data/gis/mt_sid_boulder_gfid.shx
-/examples/1_Boulder/data/plot_timeseries/
-/examples/1_Boulder/data/landsat/
-/examples/1_Boulder/data/met_timeseries/
-/examples/1_Boulder/data/properties/
-/examples/1_Boulder/data/snodas/
-/examples/1_Boulder/data/tutorial_properties.json
-/examples/1_Boulder/data/prepped_input.json
-/examples/2_Fort_Peck/data/gis/flux_fields_gfid.json
-/examples/2_Fort_Peck/data/landsat/
-/examples/2_Fort_Peck/data/met_timeseries/
-/examples/2_Fort_Peck/data/properties/
-/examples/2_Fort_Peck/data/snodas/
-/examples/2_Fort_Peck/data/plot_timeseries/
-/examples/2_Fort_Peck/data/US-FPe_daily_data.csv
-/examples/2_Fort_Peck/data/pestrun/
-/examples/3_Crane/data/gis/flux_fields_gfid.json
-/examples/3_Crane/data/landsat/
-/examples/3_Crane/data/met_timeseries/
-/examples/3_Crane/data/properties/
-/examples/3_Crane/data/snodas/
-/examples/3_Crane/data/plot_timeseries/
-/examples/3_Crane/data/US-FPe_daily_data.csv
-/examples/3_Crane/data/pestrun/
-/examples/4_Flux_Network/data/snodas/
-/examples/4_Flux_Network/data/properties/
-/examples/4_Flux_Network/data/met_timeseries/
-/examples/4_Flux_Network/data/landsat/
-/examples/4_Flux_Network/data/plot_timeseries/
-/examples/4_Flux_Network/data/bias_correction_tif/
-/examples/4_Flux_Network/data/pestrun/
+# Example data directories: ignore everything except tracked inputs
+/examples/1_Boulder/data/*
+!/examples/1_Boulder/data/gis/
+!/examples/1_Boulder/data/bias_correction_tif/
+
+/examples/2_Fort_Peck/data/*
+!/examples/2_Fort_Peck/data/gis/
+!/examples/2_Fort_Peck/data/prepped_input.zip
+!/examples/2_Fort_Peck/data/US-FPe_daily_data.zip
+
+/examples/3_Crane/data/*
+!/examples/3_Crane/data/gis/
+!/examples/3_Crane/data/prepped_input.zip
+
+/examples/4_Flux_Network/data/*
+!/examples/4_Flux_Network/data/gis/
+
+# Generated shapefiles and provenance in gis dirs
+/examples/*/data/gis/*_gfid*
+/examples/*/data/gis/shapefile_provenance.txt
+
 /examples/logs/
 
-examples/6_Flux_International/data/landsat/
-examples/6_Flux_International/data/ecostress/
 # Diagnostic scratch work
 examples/diagnostics/
diff --git a/examples/2_Fort_Peck/01_uncalibrated_model.ipynb b/examples/2_Fort_Peck/01_uncalibrated_model.ipynb
@@ -4,51 +4,7 @@
    "cell_type": "markdown",
    "id": "cell-intro",
    "metadata": {},
-   "source": [
-    "# Calibration Tutorial - Fort Peck, MT - Unirrigated Flux Plot\n",
-    "\n",
-    "## Step 1: Uncalibrated Model Run\n",
-    "\n",
-    "This tutorial focuses on calibrating SWIM-RS for a single unirrigated plot: a 3-pixel buffer around FluxNet's US-FPe eddy covariance station from John Volk's Flux ET benchmark dataset. The flux station provides independent observations of both meteorology and ET flux, allowing us to validate our model.\n",
-    "\n",
-    "This notebook demonstrates:\n",
-    "1. Loading pre-built model input data from a SwimContainer\n",
-    "2. Running the uncalibrated SWIM model\n",
-    "3. Comparing model output with flux tower observations\n",
-    "\n",
-    "**Reference:** This example is based on John Volk's flux footprint study:\n",
-    "- Paper: https://www.sciencedirect.com/science/article/pii/S0168192323000011\n",
-    "- Data: https://www.sciencedirect.com/science/article/pii/S2352340923003931\n",
-    "\n",
-    "---\n",
-    "\n",
-    "### Data Pipeline\n",
-    "\n",
-    "**Input Data:** The `data/2_Fort_Peck.swim/` container stores pre-computed input data, so you can get started right away. If you want to build or rebuild the data for this example, we have provided scripts for reproduction:\n",
-    "\n",
-    "1. **Extract data** from Earth Engine and GridMET:\n",
-    "   ```bash\n",
-    "   cd data/\n",
-    "   python extract_data.py           # Extract US-FPe only (default)\n",
-    "   python extract_data.py --help    # See all options\n",
-    "   ```\n",
-    "\n",
-    "2. **Sync from bucket** after EE tasks complete:\n",
-    "   ```bash\n",
-    "   gsutil -m rsync -r gs://wudr/2_Fort_Peck/ ./data/\n",
-    "   ```\n",
-    "\n",
-    "3. **Build model inputs** using the container:\n",
-    "   ```bash\n",
-    "   cd data/\n",
-    "   python build_inputs.py           # Build container\n",
-    "   python build_inputs.py --rebuild # Force rebuild from scratch\n",
-    "   ```\n",
-    "\n",
-    "The container (`data/2_Fort_Peck.swim/`) stores all ingested data with provenance tracking.\n",
-    "\n",
-    "---"
-   ]
+   "source": "# Calibration Tutorial - Fort Peck, MT - Unirrigated Flux Plot\n\n## Step 1: Uncalibrated Model Run\n\nThis tutorial focuses on calibrating SWIM-RS for a single unirrigated plot: a 3-pixel buffer around FluxNet's US-FPe eddy covariance station from John Volk's Flux ET benchmark dataset. The flux station provides independent observations of both meteorology and ET flux, allowing us to validate our model.\n\nThis notebook demonstrates:\n1. Loading pre-built model input data from a SwimContainer\n2. Running the uncalibrated SWIM model\n3. Comparing model output with flux tower observations\n\n**Reference:** This example is based on John Volk's flux footprint study:\n- Paper: https://www.sciencedirect.com/science/article/pii/S0168192323000011\n- Data: https://www.sciencedirect.com/science/article/pii/S2352340923003931\n\n---\n\n### Data Pipeline\n\n**Input Data:** The `data/2_Fort_Peck.swim/` container stores pre-computed input data, so you can get started right away. If you want to build or rebuild the data for this example, we have provided scripts for reproduction:\n\n1. **Extract data** from Earth Engine and GridMET:\n   ```bash\n   python extract_data.py           # Extract US-FPe only (default)\n   python extract_data.py --help    # See all options\n   ```\n\n2. **Sync from bucket** after EE tasks complete:\n   ```bash\n   gsutil -m rsync -r gs://wudr/2_Fort_Peck/ ./data/\n   ```\n\n3. **Build model inputs** using the container:\n   ```bash\n   python build_inputs.py           # Build container\n   python build_inputs.py --rebuild # Force rebuild from scratch\n   ```\n\nThe container (`data/2_Fort_Peck.swim/`) stores all ingested data with provenance tracking.\n\n---"
   },
   {
    "cell_type": "code",
@@ -166,40 +122,7 @@
     }
    },
    "outputs": [],
-   "source": [
-    "# Example: Query data directly from the SwimContainer\n",
-    "\n",
-    "container_path = os.path.join(data, \"2_Fort_Peck.swim\")\n",
-    "\n",
-    "if os.path.exists(container_path):\n",
-    "    container = SwimContainer.open(container_path, mode=\"r\")\n",
-    "\n",
-    "    # List available fields\n",
-    "    print(f\"Fields in container: {container.field_uids}\")\n",
-    "\n",
-    "    # Get all time series for a single field using field_timeseries\n",
-    "    ts_df = container.query.field_timeseries(\"US-FPe\")\n",
-    "    print(f\"\\nTime series shape: {ts_df.shape}\")\n",
-    "    print(f\"Variables: {list(ts_df.columns)[:10]}...\")\n",
-    "\n",
-    "    # Query specific data using dataframe with zarr paths\n",
-    "    # Path structure: remote_sensing/{type}/{instrument}/{model}/{mask}\n",
-    "    ndvi_df = container.query.dataframe(\"remote_sensing/ndvi/landsat/inv_irr\", fields=[\"US-FPe\"])\n",
-    "    print(f\"\\nNDVI observations: {ndvi_df.notna().sum().values[0]}\")\n",
-    "\n",
-    "    etf_df = container.query.dataframe(\n",
-    "        \"remote_sensing/etf/landsat/ssebop/inv_irr\", fields=[\"US-FPe\"]\n",
-    "    )\n",
-    "    print(f\"ETf observations: {etf_df.notna().sum().values[0]}\")\n",
-    "\n",
-    "    # Show container status\n",
-    "    print(\"\\n\" + container.query.status())\n",
-    "\n",
-    "    container.close()\n",
-    "else:\n",
-    "    print(f\"Container not found at {container_path}\")\n",
-    "    print(\"Run: cd data && python build_inputs.py --rebuild\")"
-   ]
+   "source": "# Example: Query data directly from the SwimContainer\n\ncontainer_path = os.path.join(data, \"2_Fort_Peck.swim\")\n\nif os.path.exists(container_path):\n    container = SwimContainer.open(container_path, mode=\"r\")\n\n    # List available fields\n    print(f\"Fields in container: {container.field_uids}\")\n\n    # Get all time series for a single field using field_timeseries\n    ts_df = container.query.field_timeseries(\"US-FPe\")\n    print(f\"\\nTime series shape: {ts_df.shape}\")\n    print(f\"Variables: {list(ts_df.columns)[:10]}...\")\n\n    # Query specific data using dataframe with zarr paths\n    # Path structure: remote_sensing/{type}/{instrument}/{model}/{mask}\n    ndvi_df = container.query.dataframe(\"remote_sensing/ndvi/landsat/inv_irr\", fields=[\"US-FPe\"])\n    print(f\"\\nNDVI observations: {ndvi_df.notna().sum().values[0]}\")\n\n    etf_df = container.query.dataframe(\n        \"remote_sensing/etf/landsat/ssebop/inv_irr\", fields=[\"US-FPe\"]\n    )\n    print(f\"ETf observations: {etf_df.notna().sum().values[0]}\")\n\n    # Show container status\n    print(\"\\n\" + container.query.status())\n\n    container.close()\nelse:\n    print(f\"Container not found at {container_path}\")\n    print(\"Run: python build_inputs.py --rebuild\")"
   },
   {
    "cell_type": "markdown",
@@ -936,4 +859,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}
diff --git a/examples/3_Crane/01_uncalibrated_model.ipynb b/examples/3_Crane/01_uncalibrated_model.ipynb
@@ -4,38 +4,7 @@
    "cell_type": "markdown",
    "id": "cell-intro",
    "metadata": {},
-   "source": [
-    "# Calibration Tutorial - Crane, OR - Irrigated Flux Plot\n",
-    "\n",
-    "## Step 1: Uncalibrated Model Run\n",
-    "\n",
-    "This tutorial focuses on calibrating SWIM-RS for a single irrigated alfalfa plot at the S2 flux station in Crane, Oregon. Unlike the unirrigated Fort Peck example, this site is actively irrigated.\n",
-    "\n",
-    "This notebook demonstrates:\n",
-    "1. Loading pre-built model input data from a SwimContainer\n",
-    "2. Running the uncalibrated SWIM model\n",
-    "3. Comparing model output with the open source members of the OpenET ensemble (PT-JPL, SIMS, SSEBop, geeSEBAL)\n",
-    "4. Validation against flux tower observations using multiple metrics (R², r, RMSE, bias)\n",
-    "\n",
-    "### Data Pipeline\n",
-    "\n",
-    "**Input Data:** The `data/3_Crane.swim/` container stores pre-computed input data.\n",
-    "\n",
-    "The full data workflow uses two scripts and can be re-run if needed:\n",
-    "\n",
-    "1. **`extract_data.py`** - Extracts raw data from Earth Engine and GridMET to CSV/parquet files\n",
-    "2. **`build_inputs.py`** - Processes extracted data through SwimContainer\n",
-    "\n",
-    "To reproduce the input data from scratch:\n",
-    "\n",
-    "```bash\n",
-    "cd data\n",
-    "python extract_data.py    # Extract from EE/GridMET (requires authentication)\n",
-    "python build_inputs.py    # Build container\n",
-    "```\n",
-    "\n",
-    "See `data/extract_data.py` for extraction options and `data/build_inputs.py` for container workflow details."
-   ]
+   "source": "# Calibration Tutorial - Crane, OR - Irrigated Flux Plot\n\n## Step 1: Uncalibrated Model Run\n\nThis tutorial focuses on calibrating SWIM-RS for a single irrigated alfalfa plot at the S2 flux station in Crane, Oregon. Unlike the unirrigated Fort Peck example, this site is actively irrigated.\n\nThis notebook demonstrates:\n1. Loading pre-built model input data from a SwimContainer\n2. Running the uncalibrated SWIM model\n3. Comparing model output with the open source members of the OpenET ensemble (PT-JPL, SIMS, SSEBop, geeSEBAL)\n4. Validation against flux tower observations using multiple metrics (R², r, RMSE, bias)\n\n### Data Pipeline\n\n**Input Data:** The `data/3_Crane.swim/` container stores pre-computed input data.\n\nThe full data workflow uses two scripts and can be re-run if needed:\n\n1. **`extract_data.py`** - Extracts raw data from Earth Engine and GridMET to CSV/parquet files\n2. **`build_inputs.py`** - Processes extracted data through SwimContainer\n\nTo reproduce the input data from scratch:\n\n```bash\npython extract_data.py    # Extract from EE/GridMET (requires authentication)\npython build_inputs.py    # Build container\n```\n\nSee `extract_data.py` for extraction options and `build_inputs.py` for container workflow details."
   },
   {
    "cell_type": "code",
@@ -255,38 +224,7 @@
     }
    },
    "outputs": [],
-   "source": [
-    "# Query container data (optional - requires build_inputs.py to have been run)\n",
-    "\n",
-    "container_path = os.path.join(data, \"3_Crane.swim\")\n",
-    "\n",
-    "if os.path.exists(container_path):\n",
-    "    container = SwimContainer.open(container_path, mode=\"r\")\n",
-    "\n",
-    "    # List available fields\n",
-    "    print(f\"Fields in container: {container.field_uids}\")\n",
-    "\n",
-    "    # Get all time series for a single field using field_timeseries\n",
-    "    ts_df = container.query.field_timeseries(\"S2\")\n",
-    "    print(f\"\\nTime series shape: {ts_df.shape}\")\n",
-    "    print(f\"Variables: {list(ts_df.columns)[:10]}...\")\n",
-    "\n",
-    "    # Query specific data using dataframe with zarr paths\n",
-    "    # Path structure: remote_sensing/{type}/{instrument}/{model}/{mask}\n",
-    "    ndvi_df = container.query.dataframe(\"remote_sensing/ndvi/landsat/irr\", fields=[\"S2\"])\n",
-    "    print(f\"\\nNDVI observations: {ndvi_df.notna().sum().values[0]}\")\n",
-    "\n",
-    "    etf_df = container.query.dataframe(\"remote_sensing/etf/landsat/ssebop/irr\", fields=[\"S2\"])\n",
-    "    print(f\"ETf observations: {etf_df.notna().sum().values[0]}\")\n",
-    "\n",
-    "    # Show container status\n",
-    "    print(\"\\n\" + container.query.status())\n",
-    "\n",
-    "    container.close()\n",
-    "else:\n",
-    "    print(f\"Container not found at {container_path}\")\n",
-    "    print(\"Run: cd data && python build_inputs.py\")"
-   ]
+   "source": "# Query container data (optional - requires build_inputs.py to have been run)\n\ncontainer_path = os.path.join(data, \"3_Crane.swim\")\n\nif os.path.exists(container_path):\n    container = SwimContainer.open(container_path, mode=\"r\")\n\n    # List available fields\n    print(f\"Fields in container: {container.field_uids}\")\n\n    # Get all time series for a single field using field_timeseries\n    ts_df = container.query.field_timeseries(\"S2\")\n    print(f\"\\nTime series shape: {ts_df.shape}\")\n    print(f\"Variables: {list(ts_df.columns)[:10]}...\")\n\n    # Query specific data using dataframe with zarr paths\n    # Path structure: remote_sensing/{type}/{instrument}/{model}/{mask}\n    ndvi_df = container.query.dataframe(\"remote_sensing/ndvi/landsat/irr\", fields=[\"S2\"])\n    print(f\"\\nNDVI observations: {ndvi_df.notna().sum().values[0]}\")\n\n    etf_df = container.query.dataframe(\"remote_sensing/etf/landsat/ssebop/irr\", fields=[\"S2\"])\n    print(f\"ETf observations: {etf_df.notna().sum().values[0]}\")\n\n    # Show container status\n    print(\"\\n\" + container.query.status())\n\n    container.close()\nelse:\n    print(f\"Container not found at {container_path}\")\n    print(\"Run: python build_inputs.py\")"
   },
   {
    "cell_type": "markdown",
@@ -964,4 +902,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}