fix: Exclude zenodo-upload-nationwide from package build (#42)

ericscheier · web-flow · commit df56efcf4ba7 · 2025-11-20T04:03:27.000-05:00
* fix: Remove all private workflows when publishing to public

Updated publish-to-public workflow to remove all private-only
workflow files (auto-release.yml, auto-release.yaml, auto-tag-on-version-bump.yml)
in addition to publish-to-public.yml itself.

This prevents these workflows from running in the public repo where
they would fail due to missing secrets and environments.

* fix: Exclude zenodo-upload-nationwide from package build

Fixes CRAN WARNING: "Files not of a type allowed in a 'data' directory"

The data/ directory in R packages is reserved for R data objects
(.rda, .RData). The zenodo-upload-nationwide directory contains
CSV/gzip files for Zenodo deployment and should not be included
in the package build.

Changes:
- Added ^data/zenodo-upload-nationwide$ to .Rbuildignore
- Added data/zenodo-upload-nationwide/ to .gitignore

This resolves the critical CRAN warning, leaving only the optional
qpdf warning (PDF compression tool).

R CMD check results after fix:
- 0 errors ✓
- 1 warning (qpdf - optional)
- 2 notes (expected: new submission + AGPL license)

* feat: Add WORDLIST for technical terms and acronyms

Whitelists domain-specific terminology to fix spelling check issues:
- Technical acronyms (ACS, AMI, FPL, EROI, NER, etc.)
- Software/tooling terms (OpenEI, Zenodo, dplyr, tidyverse)
- Methodology-specific notation (Nh, EB, etc.)
- Place names (Carrboro, Hillsborough)
- Author names (Kittner)
- File formats and technical terms

This resolves all spelling warnings for legitimate technical
terminology while maintaining spell-checking for actual typos.

* chore: Bump version to 0.5.7

CRAN readiness release with final compliance fixes:
- Excluded data/zenodo-upload-nationwide/ from package build
- Added WORDLIST for technical terms and acronyms
- Fixed publish-to-public workflow

* docs: Add per-state caching architecture proposal

Proposes improved caching strategy to avoid re-downloading all 51
states when only a few are missing.

* fix: Install orcidlink LaTeX package for Windows builds

Adds installation of orcidlink.sty package required by JSS vignette.
This fixes Windows R CMD check failures with missing LaTeX dependency.

* Revert "fix: Install orcidlink LaTeX package for Windows builds"

This reverts commit 43e4f7a987be03814456d7bca4183c0c9ede8eeb.

* fix: Install orcidlink LaTeX package for JSS vignette

Uses standard Rscript command to install the orcidlink LaTeX package
required by the JSS (Journal of Statistical Software) vignette format.

This fixes vignette building on all platforms, especially Windows.

* Revert "fix: Install orcidlink LaTeX package for JSS vignette"

This reverts commit 3113fa5655aefc0f57689488cebd1cece148a9fc.

* fix: Install orcidlink LaTeX package after R dependencies

The previous attempts failed because tinytex::tlmgr_install() was called
before the tinytex R package was installed. Moving this step to after
setup-r-dependencies ensures the tinytex package is available.

This fixes the JSS vignette build failure on Windows (and all platforms).
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -156,6 +156,7 @@
 ^STATUS\.md$
 ^zenodo-upload$
 ^zenodo-upload-nationwide$
+^data/zenodo-upload-nationwide$
 ^cleanup_conflicts\.R$
 ^helpers\.R$
 ^ratios\.R$
diff --git a/.dev/PER-STATE-CACHING-PROPOSAL.md b/.dev/PER-STATE-CACHING-PROPOSAL.md
@@ -0,0 +1,258 @@
+# Per-State Caching Architecture Proposal
+
+## Problem
+
+Currently, if a nationwide dataset is missing just 1-2 states (e.g., FPL 2022 missing HI and IL), we must re-download all 51 states (~12GB, 30-60 minutes) instead of just the missing states.
+
+**Root Cause**: Individual state ZIP files are not cached - they're downloaded, extracted, then deleted.
+
+## Proposed Architecture
+
+### 1. Cache Structure
+
+```
+~/.cache/emburden/
+├── lead_2022_fpl_AL.zip          # Individual state ZIPs (kept!)
+├── lead_2022_fpl_AK.zip
+├── lead_2022_fpl_...
+├── lead_2022_fpl_HI.zip          # Missing state
+├── lead_2022_fpl_IL.zip          # Missing state
+├── lead_2022_fpl_WY.zip
+├── lead_2022_fpl.csv             # Merged nationwide CSV
+└── emburden_db.sqlite            # Database with nationwide data
+```
+
+### 2. Smart Download Logic
+
+#### Before (Current):
+```r
+download_and_merge_states() {
+  for each state in all 51 states:
+    download ZIP
+    extract CSV
+    delete ZIP    # ❌ Lost!
+  merge all CSVs
+  save merged CSV
+}
+```
+
+#### After (Proposed):
+```r
+download_and_merge_states() {
+  # 1. Check which states are already cached
+  cached_states <- check_cached_state_files(dataset, vintage)
+  missing_states <- setdiff(all_states, cached_states)
+
+  # 2. Only download missing states
+  for each state in missing_states:
+    download ZIP to state-specific file (e.g., lead_2022_fpl_HI.zip)
+    keep ZIP for future use    # ✅ Cached!
+
+  # 3. Load all states (cached + newly downloaded)
+  for each state in all 51 states:
+    if (state ZIP exists):
+      extract and load data
+    else:
+      skip (log warning)
+
+  # 4. Merge and validate
+  merge all loaded states
+  if (missing states):
+    report which states are missing
+  save merged CSV
+}
+```
+
+### 3. Validation & Self-Healing
+
+When validation detects corrupt/incomplete nationwide data:
+
+```r
+# Current behavior:
+clear_dataset_cache("fpl", "2022")  # Deletes EVERYTHING
+re-download all 51 states           # 12GB download
+
+# Proposed behavior:
+detect_missing_states(data)         # Returns: ["HI", "IL"]
+clear_state_cache("fpl", "2022", c("HI", "IL"))  # Delete only corrupt states
+re-download missing 2 states        # 500MB download
+merge with 49 cached states         # 1-2 minutes
+```
+
+### 4. Functions to Implement
+
+#### `check_cached_state_files(dataset, vintage)`
+Returns character vector of states that have valid cached ZIP files.
+
+```r
+check_cached_state_files <- function(dataset, vintage) {
+  cache_dir <- get_cache_dir()
+  all_states <- get_all_states()
+
+  cached <- character()
+  for (state in all_states) {
+    zip_file <- file.path(cache_dir,
+                          sprintf("lead_%s_%s_%s.zip", vintage, dataset, state))
+    if (file.exists(zip_file) && file.size(zip_file) > 10000) {  # >10KB
+      cached <- c(cached, state)
+    }
+  }
+
+  return(cached)
+}
+```
+
+#### `clear_state_cache(dataset, vintage, states)`
+Removes specific state ZIP files (for corrupted data).
+
+```r
+clear_state_cache <- function(dataset, vintage, states, verbose = TRUE) {
+  cache_dir <- get_cache_dir()
+
+  for (state in states) {
+    zip_file <- file.path(cache_dir,
+                          sprintf("lead_%s_%s_%s.zip", vintage, dataset, state))
+    if (file.exists(zip_file)) {
+      unlink(zip_file)
+      if (verbose) message("  ✓ Deleted: ", basename(zip_file))
+    }
+  }
+}
+```
+
+#### Modified `download_and_merge_states()`
+
+```r
+download_and_merge_states <- function(dataset, vintage, states, verbose = TRUE) {
+
+  # Check which states are already cached
+  cached_states <- check_cached_state_files(dataset, vintage)
+  missing_states <- setdiff(states, cached_states)
+
+  if (verbose) {
+    message(sprintf("Cached states: %d, Missing states: %d",
+                    length(cached_states), length(missing_states)))
+    if (length(missing_states) > 0) {
+      message("Will download: ", paste(missing_states, collapse = ", "))
+    }
+    if (length(cached_states) > 0) {
+      message("Will load from cache: ", paste(cached_states, collapse = ", "))
+    }
+  }
+
+  # Download only missing states
+  if (length(missing_states) > 0) {
+    for (i in seq_along(missing_states)) {
+      state <- missing_states[i]
+      if (verbose) {
+        message(sprintf("[%d/%d] Downloading %s...", i, length(missing_states), state))
+      }
+      download_single_state_cached(dataset, vintage, state, verbose = FALSE)
+    }
+  }
+
+  # Load all states (cached + newly downloaded)
+  all_data <- list()
+  failed_states <- character()
+
+  for (state in states) {
+    tryCatch({
+      state_data <- load_state_from_cache(dataset, vintage, state, verbose = FALSE)
+      if (!is.null(state_data) && nrow(state_data) > 0) {
+        all_data[[state]] <- state_data
+      } else {
+        failed_states <- c(failed_states, state)
+      }
+    }, error = function(e) {
+      warning(sprintf("Failed to load %s: %s", state, e$message))
+      failed_states <- c(failed_states, state)
+    })
+  }
+
+  # Merge and save
+  combined_data <- dplyr::bind_rows(all_data)
+
+  # Save merged nationwide CSV
+  cache_dir <- get_cache_dir()
+  cache_file <- file.path(cache_dir, paste0("lead_", vintage, "_", dataset, ".csv"))
+  readr::write_csv(combined_data, cache_file)
+
+  # Import to database
+  try_import_to_database(combined_data, dataset, vintage, verbose = verbose)
+
+  return(combined_data)
+}
+```
+
+### 5. Benefits
+
+✅ **Efficiency**: Download only missing states (minutes vs hours)
+✅ **Resilience**: Individual state corruption doesn't require full re-download
+✅ **Transparency**: Clear reporting of cached vs downloaded states
+✅ **Storage**: ~13GB per dataset (51 states × ~250MB), but saves bandwidth
+✅ **Debugging**: Can inspect individual state files
+
+### 6. Disk Space Considerations
+
+**Before**: ~50MB merged CSV per dataset
+**After**: ~13GB state ZIPs + ~50MB merged CSV per dataset
+
+**Mitigation**:
+- State ZIPs can be deleted after successful merge (optional)
+- Add `clear_state_cache()` function for manual cleanup
+- Add `--keep-state-cache` flag to regeneration script
+
+### 7. Implementation Priority
+
+1. **Phase 1** (For current regeneration):
+   - Modify `download_and_merge_states()` to cache state ZIPs
+   - Implement `check_cached_state_files()`
+   - Test with current FPL 2022 issue
+
+2. **Phase 2** (Post-CRAN):
+   - Add `clear_state_cache()` to `R/cache_utils.R`
+   - Update corruption detection to identify missing states
+   - Implement selective re-download
+
+3. **Phase 3** (Optional):
+   - Add cleanup options to regeneration script
+   - Implement automatic state cache expiration (30 days?)
+
+### 8. Migration Strategy
+
+Existing users with no cached state files will simply download as before. Once state caching is implemented, future downloads benefit from the per-state cache.
+
+No breaking changes to existing API.
+
+---
+
+## Implementation Decision
+
+**Should we implement this now?**
+
+### Option A: Implement now (before completing current regeneration)
+- ✅ PRO: Solves FPL 2022 issue efficiently (download just HI, IL)
+- ✅ PRO: Future-proofs against similar issues
+- ❌ CON: Delays Zenodo upload by 1-2 hours
+- ❌ CON: Requires testing with active downloads
+
+### Option B: Implement after Zenodo upload (post-CRAN)
+- ✅ PRO: Current regeneration completes sooner
+- ✅ PRO: Can test thoroughly in development
+- ✅ PRO: CRAN submission not delayed
+- ❌ CON: Must re-download all 51 states for FPL 2022 now
+
+### Recommendation: **Option B**
+
+**Reason**: We're already 71% through AMI 2018 download. Implementing per-state caching now would require:
+1. Stopping current regeneration
+2. Implementing and testing new code
+3. Re-running downloads (losing current progress)
+
+Better to:
+1. Complete current regeneration
+2. Get clean datasets to Zenodo
+3. Implement per-state caching properly in next version
+4. Include in v0.6.0 release notes as improvement
+
+This makes per-state caching a **v0.6.0 feature** rather than rushing it into v0.5.x.
diff --git a/.github/workflows/R-CMD-check.yml b/.github/workflows/R-CMD-check.yml
@@ -55,6 +55,9 @@ jobs:
           extra-packages: any::rcmdcheck
           needs: check
 
+      - name: Install LaTeX packages for vignettes
+        run: Rscript -e "tinytex::tlmgr_install('orcidlink')"
+
       - uses: r-lib/actions/check-r-package@v2
         with:
           upload-snapshots: true
diff --git a/.github/workflows/publish-to-public.yml b/.github/workflows/publish-to-public.yml
@@ -124,9 +124,12 @@ jobs:
           find . -name "*_files" -type d -exec rm -rf {} + 2>/dev/null || true
           echo "✓ Removed *_files directories"
 
-          # Remove workflow file itself (don't want this in public repo)
+          # Remove private-only workflow files (don't want these in public repo)
           rm -f .github/workflows/publish-to-public.yml
-          echo "✓ Removed workflow file"
+          rm -f .github/workflows/auto-release.yml
+          rm -f .github/workflows/auto-release.yaml
+          rm -f .github/workflows/auto-tag-on-version-bump.yml
+          echo "✓ Removed private workflow files"
 
           echo ""
           echo "Files cleaned. Current status:"
diff --git a/.gitignore b/.gitignore
@@ -169,3 +169,4 @@ rsconnect/.env
 # Zenodo upload staging
 zenodo-upload/
 zenodo-upload-nationwide/
+data/zenodo-upload-nationwide/
diff --git a/.zenodo.json b/.zenodo.json
@@ -41,6 +41,6 @@
     }
   ],
   "grants": [],
-  "version": "0.5.6",
+  "version": "0.5.7",
   "language": "eng"
 }
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: emburden
 Title: Energy Burden Analysis Using Net Energy Return Methodology
-Version: 0.5.6
+Version: 0.5.7
 Authors@R:
     person("Eric", "Scheier", , "eric@scheier.org", role = c("aut", "cre"))
 Description: Provides tools for calculating and analyzing household energy
diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,17 @@
+# emburden 0.5.7
+
+## CRAN Readiness - Final Fixes
+
+This patch release completes CRAN readiness with final compliance fixes.
+
+### Bug Fixes
+
+* **Package build exclusions**: Excluded `data/zenodo-upload-nationwide/` directory from package tarball (fixes CRAN data directory WARNING)
+* **Spelling whitelist**: Added `inst/WORDLIST` with 85 technical terms and acronyms to prevent false-positive spelling errors
+* **Public repository sync**: Fixed `publish-to-public` workflow to properly remove private-only workflow files before syncing to public repository
+
+---
+
 # emburden 0.5.6
 
 ## CRAN Quality-of-Life Improvements
diff --git a/inst/CITATION b/inst/CITATION
@@ -3,12 +3,12 @@ bibentry(
   title    = "{emburden}: Energy Burden Analysis Using Net Energy Return Methodology",
   author   = "Eric Scheier",
   year     = "2025",
-  note     = "R package version 0.5.6",
+  note     = "R package version 0.5.7",
   url      = "https://github.com/ericscheier/emburden",
   textVersion = paste(
     "Scheier, Eric (2025).",
     "emburden: Energy Burden Analysis Using Net Energy Return Methodology.",
-    "R package version 0.5.6",
+    "R package version 0.5.7",
     "https://github.com/ericscheier/emburden"
   )
 )
diff --git a/inst/WORDLIST b/inst/WORDLIST

Original file line number	Diff line number	Diff line change
`@@ -41,6 +41,6 @@`
`41`	`41`	`}`
`42`	`42`	`],`
`43`	`43`	`"grants": [],`
`44`		`- "version": "0.5.6",`
	`44`	`+ "version": "0.5.7",`
`45`	`45`	`"language": "eng"`
`46`	`46`	`}`