Skip to content

Commit df56efc

Browse files
authored
fix: Exclude zenodo-upload-nationwide from package build (#42)
* fix: Remove all private workflows when publishing to public Updated publish-to-public workflow to remove all private-only workflow files (auto-release.yml, auto-release.yaml, auto-tag-on-version-bump.yml) in addition to publish-to-public.yml itself. This prevents these workflows from running in the public repo where they would fail due to missing secrets and environments. * fix: Exclude zenodo-upload-nationwide from package build Fixes CRAN WARNING: "Files not of a type allowed in a 'data' directory" The data/ directory in R packages is reserved for R data objects (.rda, .RData). The zenodo-upload-nationwide directory contains CSV/gzip files for Zenodo deployment and should not be included in the package build. Changes: - Added ^data/zenodo-upload-nationwide$ to .Rbuildignore - Added data/zenodo-upload-nationwide/ to .gitignore This resolves the critical CRAN warning, leaving only the optional qpdf warning (PDF compression tool). R CMD check results after fix: - 0 errors ✓ - 1 warning (qpdf - optional) - 2 notes (expected: new submission + AGPL license) * feat: Add WORDLIST for technical terms and acronyms Whitelists domain-specific terminology to fix spelling check issues: - Technical acronyms (ACS, AMI, FPL, EROI, NER, etc.) - Software/tooling terms (OpenEI, Zenodo, dplyr, tidyverse) - Methodology-specific notation (Nh, EB, etc.) - Place names (Carrboro, Hillsborough) - Author names (Kittner) - File formats and technical terms This resolves all spelling warnings for legitimate technical terminology while maintaining spell-checking for actual typos. * chore: Bump version to 0.5.7 CRAN readiness release with final compliance fixes: - Excluded data/zenodo-upload-nationwide/ from package build - Added WORDLIST for technical terms and acronyms - Fixed publish-to-public workflow * docs: Add per-state caching architecture proposal Proposes improved caching strategy to avoid re-downloading all 51 states when only a few are missing. * fix: Install orcidlink LaTeX package for Windows builds Adds installation of orcidlink.sty package required by JSS vignette. This fixes Windows R CMD check failures with missing LaTeX dependency. * Revert "fix: Install orcidlink LaTeX package for Windows builds" This reverts commit 43e4f7a987be03814456d7bca4183c0c9ede8eeb. * fix: Install orcidlink LaTeX package for JSS vignette Uses standard Rscript command to install the orcidlink LaTeX package required by the JSS (Journal of Statistical Software) vignette format. This fixes vignette building on all platforms, especially Windows. * Revert "fix: Install orcidlink LaTeX package for JSS vignette" This reverts commit 3113fa5655aefc0f57689488cebd1cece148a9fc. * fix: Install orcidlink LaTeX package after R dependencies The previous attempts failed because tinytex::tlmgr_install() was called before the tinytex R package was installed. Moving this step to after setup-r-dependencies ensures the tinytex package is available. This fixes the JSS vignette build failure on Windows (and all platforms).
1 parent 007e770 commit df56efc

10 files changed

Lines changed: 371 additions & 6 deletions

File tree

.Rbuildignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,7 @@
156156
^STATUS\.md$
157157
^zenodo-upload$
158158
^zenodo-upload-nationwide$
159+
^data/zenodo-upload-nationwide$
159160
^cleanup_conflicts\.R$
160161
^helpers\.R$
161162
^ratios\.R$

.dev/PER-STATE-CACHING-PROPOSAL.md

Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
# Per-State Caching Architecture Proposal
2+
3+
## Problem
4+
5+
Currently, if a nationwide dataset is missing just 1-2 states (e.g., FPL 2022 missing HI and IL), we must re-download all 51 states (~12GB, 30-60 minutes) instead of just the missing states.
6+
7+
**Root Cause**: Individual state ZIP files are not cached - they're downloaded, extracted, then deleted.
8+
9+
## Proposed Architecture
10+
11+
### 1. Cache Structure
12+
13+
```
14+
~/.cache/emburden/
15+
├── lead_2022_fpl_AL.zip # Individual state ZIPs (kept!)
16+
├── lead_2022_fpl_AK.zip
17+
├── lead_2022_fpl_...
18+
├── lead_2022_fpl_HI.zip # Missing state
19+
├── lead_2022_fpl_IL.zip # Missing state
20+
├── lead_2022_fpl_WY.zip
21+
├── lead_2022_fpl.csv # Merged nationwide CSV
22+
└── emburden_db.sqlite # Database with nationwide data
23+
```
24+
25+
### 2. Smart Download Logic
26+
27+
#### Before (Current):
28+
```r
29+
download_and_merge_states() {
30+
for each state in all 51 states:
31+
download ZIP
32+
extract CSV
33+
delete ZIP # ❌ Lost!
34+
merge all CSVs
35+
save merged CSV
36+
}
37+
```
38+
39+
#### After (Proposed):
40+
```r
41+
download_and_merge_states() {
42+
# 1. Check which states are already cached
43+
cached_states <- check_cached_state_files(dataset, vintage)
44+
missing_states <- setdiff(all_states, cached_states)
45+
46+
# 2. Only download missing states
47+
for each state in missing_states:
48+
download ZIP to state-specific file (e.g., lead_2022_fpl_HI.zip)
49+
keep ZIP for future use # ✅ Cached!
50+
51+
# 3. Load all states (cached + newly downloaded)
52+
for each state in all 51 states:
53+
if (state ZIP exists):
54+
extract and load data
55+
else:
56+
skip (log warning)
57+
58+
# 4. Merge and validate
59+
merge all loaded states
60+
if (missing states):
61+
report which states are missing
62+
save merged CSV
63+
}
64+
```
65+
66+
### 3. Validation & Self-Healing
67+
68+
When validation detects corrupt/incomplete nationwide data:
69+
70+
```r
71+
# Current behavior:
72+
clear_dataset_cache("fpl", "2022") # Deletes EVERYTHING
73+
re-download all 51 states # 12GB download
74+
75+
# Proposed behavior:
76+
detect_missing_states(data) # Returns: ["HI", "IL"]
77+
clear_state_cache("fpl", "2022", c("HI", "IL")) # Delete only corrupt states
78+
re-download missing 2 states # 500MB download
79+
merge with 49 cached states # 1-2 minutes
80+
```
81+
82+
### 4. Functions to Implement
83+
84+
#### `check_cached_state_files(dataset, vintage)`
85+
Returns character vector of states that have valid cached ZIP files.
86+
87+
```r
88+
check_cached_state_files <- function(dataset, vintage) {
89+
cache_dir <- get_cache_dir()
90+
all_states <- get_all_states()
91+
92+
cached <- character()
93+
for (state in all_states) {
94+
zip_file <- file.path(cache_dir,
95+
sprintf("lead_%s_%s_%s.zip", vintage, dataset, state))
96+
if (file.exists(zip_file) && file.size(zip_file) > 10000) { # >10KB
97+
cached <- c(cached, state)
98+
}
99+
}
100+
101+
return(cached)
102+
}
103+
```
104+
105+
#### `clear_state_cache(dataset, vintage, states)`
106+
Removes specific state ZIP files (for corrupted data).
107+
108+
```r
109+
clear_state_cache <- function(dataset, vintage, states, verbose = TRUE) {
110+
cache_dir <- get_cache_dir()
111+
112+
for (state in states) {
113+
zip_file <- file.path(cache_dir,
114+
sprintf("lead_%s_%s_%s.zip", vintage, dataset, state))
115+
if (file.exists(zip_file)) {
116+
unlink(zip_file)
117+
if (verbose) message(" ✓ Deleted: ", basename(zip_file))
118+
}
119+
}
120+
}
121+
```
122+
123+
#### Modified `download_and_merge_states()`
124+
125+
```r
126+
download_and_merge_states <- function(dataset, vintage, states, verbose = TRUE) {
127+
128+
# Check which states are already cached
129+
cached_states <- check_cached_state_files(dataset, vintage)
130+
missing_states <- setdiff(states, cached_states)
131+
132+
if (verbose) {
133+
message(sprintf("Cached states: %d, Missing states: %d",
134+
length(cached_states), length(missing_states)))
135+
if (length(missing_states) > 0) {
136+
message("Will download: ", paste(missing_states, collapse = ", "))
137+
}
138+
if (length(cached_states) > 0) {
139+
message("Will load from cache: ", paste(cached_states, collapse = ", "))
140+
}
141+
}
142+
143+
# Download only missing states
144+
if (length(missing_states) > 0) {
145+
for (i in seq_along(missing_states)) {
146+
state <- missing_states[i]
147+
if (verbose) {
148+
message(sprintf("[%d/%d] Downloading %s...", i, length(missing_states), state))
149+
}
150+
download_single_state_cached(dataset, vintage, state, verbose = FALSE)
151+
}
152+
}
153+
154+
# Load all states (cached + newly downloaded)
155+
all_data <- list()
156+
failed_states <- character()
157+
158+
for (state in states) {
159+
tryCatch({
160+
state_data <- load_state_from_cache(dataset, vintage, state, verbose = FALSE)
161+
if (!is.null(state_data) && nrow(state_data) > 0) {
162+
all_data[[state]] <- state_data
163+
} else {
164+
failed_states <- c(failed_states, state)
165+
}
166+
}, error = function(e) {
167+
warning(sprintf("Failed to load %s: %s", state, e$message))
168+
failed_states <- c(failed_states, state)
169+
})
170+
}
171+
172+
# Merge and save
173+
combined_data <- dplyr::bind_rows(all_data)
174+
175+
# Save merged nationwide CSV
176+
cache_dir <- get_cache_dir()
177+
cache_file <- file.path(cache_dir, paste0("lead_", vintage, "_", dataset, ".csv"))
178+
readr::write_csv(combined_data, cache_file)
179+
180+
# Import to database
181+
try_import_to_database(combined_data, dataset, vintage, verbose = verbose)
182+
183+
return(combined_data)
184+
}
185+
```
186+
187+
### 5. Benefits
188+
189+
**Efficiency**: Download only missing states (minutes vs hours)
190+
**Resilience**: Individual state corruption doesn't require full re-download
191+
**Transparency**: Clear reporting of cached vs downloaded states
192+
**Storage**: ~13GB per dataset (51 states × ~250MB), but saves bandwidth
193+
**Debugging**: Can inspect individual state files
194+
195+
### 6. Disk Space Considerations
196+
197+
**Before**: ~50MB merged CSV per dataset
198+
**After**: ~13GB state ZIPs + ~50MB merged CSV per dataset
199+
200+
**Mitigation**:
201+
- State ZIPs can be deleted after successful merge (optional)
202+
- Add `clear_state_cache()` function for manual cleanup
203+
- Add `--keep-state-cache` flag to regeneration script
204+
205+
### 7. Implementation Priority
206+
207+
1. **Phase 1** (For current regeneration):
208+
- Modify `download_and_merge_states()` to cache state ZIPs
209+
- Implement `check_cached_state_files()`
210+
- Test with current FPL 2022 issue
211+
212+
2. **Phase 2** (Post-CRAN):
213+
- Add `clear_state_cache()` to `R/cache_utils.R`
214+
- Update corruption detection to identify missing states
215+
- Implement selective re-download
216+
217+
3. **Phase 3** (Optional):
218+
- Add cleanup options to regeneration script
219+
- Implement automatic state cache expiration (30 days?)
220+
221+
### 8. Migration Strategy
222+
223+
Existing users with no cached state files will simply download as before. Once state caching is implemented, future downloads benefit from the per-state cache.
224+
225+
No breaking changes to existing API.
226+
227+
---
228+
229+
## Implementation Decision
230+
231+
**Should we implement this now?**
232+
233+
### Option A: Implement now (before completing current regeneration)
234+
- ✅ PRO: Solves FPL 2022 issue efficiently (download just HI, IL)
235+
- ✅ PRO: Future-proofs against similar issues
236+
- ❌ CON: Delays Zenodo upload by 1-2 hours
237+
- ❌ CON: Requires testing with active downloads
238+
239+
### Option B: Implement after Zenodo upload (post-CRAN)
240+
- ✅ PRO: Current regeneration completes sooner
241+
- ✅ PRO: Can test thoroughly in development
242+
- ✅ PRO: CRAN submission not delayed
243+
- ❌ CON: Must re-download all 51 states for FPL 2022 now
244+
245+
### Recommendation: **Option B**
246+
247+
**Reason**: We're already 71% through AMI 2018 download. Implementing per-state caching now would require:
248+
1. Stopping current regeneration
249+
2. Implementing and testing new code
250+
3. Re-running downloads (losing current progress)
251+
252+
Better to:
253+
1. Complete current regeneration
254+
2. Get clean datasets to Zenodo
255+
3. Implement per-state caching properly in next version
256+
4. Include in v0.6.0 release notes as improvement
257+
258+
This makes per-state caching a **v0.6.0 feature** rather than rushing it into v0.5.x.

.github/workflows/R-CMD-check.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@ jobs:
5555
extra-packages: any::rcmdcheck
5656
needs: check
5757

58+
- name: Install LaTeX packages for vignettes
59+
run: Rscript -e "tinytex::tlmgr_install('orcidlink')"
60+
5861
- uses: r-lib/actions/check-r-package@v2
5962
with:
6063
upload-snapshots: true

.github/workflows/publish-to-public.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,9 +124,12 @@ jobs:
124124
find . -name "*_files" -type d -exec rm -rf {} + 2>/dev/null || true
125125
echo "✓ Removed *_files directories"
126126
127-
# Remove workflow file itself (don't want this in public repo)
127+
# Remove private-only workflow files (don't want these in public repo)
128128
rm -f .github/workflows/publish-to-public.yml
129-
echo "✓ Removed workflow file"
129+
rm -f .github/workflows/auto-release.yml
130+
rm -f .github/workflows/auto-release.yaml
131+
rm -f .github/workflows/auto-tag-on-version-bump.yml
132+
echo "✓ Removed private workflow files"
130133
131134
echo ""
132135
echo "Files cleaned. Current status:"

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,3 +169,4 @@ rsconnect/.env
169169
# Zenodo upload staging
170170
zenodo-upload/
171171
zenodo-upload-nationwide/
172+
data/zenodo-upload-nationwide/

.zenodo.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,6 @@
4141
}
4242
],
4343
"grants": [],
44-
"version": "0.5.6",
44+
"version": "0.5.7",
4545
"language": "eng"
4646
}

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package: emburden
22
Title: Energy Burden Analysis Using Net Energy Return Methodology
3-
Version: 0.5.6
3+
Version: 0.5.7
44
Authors@R:
55
person("Eric", "Scheier", , "eric@scheier.org", role = c("aut", "cre"))
66
Description: Provides tools for calculating and analyzing household energy

NEWS.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,17 @@
1+
# emburden 0.5.7
2+
3+
## CRAN Readiness - Final Fixes
4+
5+
This patch release completes CRAN readiness with final compliance fixes.
6+
7+
### Bug Fixes
8+
9+
* **Package build exclusions**: Excluded `data/zenodo-upload-nationwide/` directory from package tarball (fixes CRAN data directory WARNING)
10+
* **Spelling whitelist**: Added `inst/WORDLIST` with 85 technical terms and acronyms to prevent false-positive spelling errors
11+
* **Public repository sync**: Fixed `publish-to-public` workflow to properly remove private-only workflow files before syncing to public repository
12+
13+
---
14+
115
# emburden 0.5.6
216

317
## CRAN Quality-of-Life Improvements

inst/CITATION

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@ bibentry(
33
title = "{emburden}: Energy Burden Analysis Using Net Energy Return Methodology",
44
author = "Eric Scheier",
55
year = "2025",
6-
note = "R package version 0.5.6",
6+
note = "R package version 0.5.7",
77
url = "https://github.com/ericscheier/emburden",
88
textVersion = paste(
99
"Scheier, Eric (2025).",
1010
"emburden: Energy Burden Analysis Using Net Energy Return Methodology.",
11-
"R package version 0.5.6",
11+
"R package version 0.5.7",
1212
"https://github.com/ericscheier/emburden"
1313
)
1414
)

0 commit comments

Comments
 (0)