Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 16 additions & 9 deletions .github/workflows/R-CMD-check.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
# Workflow derived from https://github.com/r-lib/actions/tree/master/examples
# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
on:
push:
branches: [main, master]
pull_request:
branches: [main, master]

name: R-CMD-check
name: R-CMD-check.yaml

permissions: read-all

jobs:
R-CMD-check:
Expand All @@ -18,7 +19,7 @@ jobs:
fail-fast: false
matrix:
config:
- {os: macOS-latest, r: 'release'}
- {os: macos-latest, r: 'release'}
- {os: windows-latest, r: 'release'}
- {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'}
- {os: ubuntu-latest, r: 'release'}
Expand All @@ -29,16 +30,22 @@ jobs:
R_KEEP_PKG_SOURCE: yes

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4

- uses: r-lib/actions/setup-pandoc@v2

- uses: r-lib/actions/setup-r@v2
with:
r-version: ${{ matrix.config.r }}
http-user-agent: ${{ matrix.config.http-user-agent }}
use-public-rspm: true
- uses: r-lib/actions/setup-pandoc@v2

- uses: r-lib/actions/setup-r-dependencies@v1
- uses: r-lib/actions/setup-r-dependencies@v2
with:
extra-packages: rcmdcheck
extra-packages: any::rcmdcheck
needs: check

- uses: r-lib/actions/check-r-package@v1
- uses: r-lib/actions/check-r-package@v2
with:
upload-snapshots: true
build_args: 'c("--no-manual","--compact-vignettes=gs+qpdf")'
6 changes: 3 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: tongfen
Type: Package
Title: Make Data Based on Different Geographies Comparable
Version: 0.3.5
Version: 0.3.6
Authors@R: c(
person("Jens", "von Bergmann", email = "jens@mountainmath.ca", role = c("aut", "cre"), comment = "creator and maintainer"))
Description: Several functions to allow comparisons of data across different geographies, in particular for Canadian census data from different censuses.
Expand All @@ -21,7 +21,7 @@ Imports:
readr,
utils,
lifecycle
RoxygenNote: 7.1.2
RoxygenNote: 7.3.1
Suggests:
knitr,
rmarkdown,
Expand All @@ -39,4 +39,4 @@ BugReports: https://github.com/mountainMath/tongfen/issues
Language: en-US
RdMacros: lifecycle
Depends:
R (>= 2.10)
R (>= 4.1)
9 changes: 9 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
# tongfen v.0.3.6
## Major changs
- better downsampling that can also accommodate averages
- performance improvements
## Minor changes
- better documentation
- allow for datasets vartiables by census year for canadian data
- fix issue where some metadata might get duplicated

# tongfen 0.3.2
- Fix compatibility issue with changes in {sf} package
- More reliable GitHub action CRAN checks
Expand Down
10 changes: 10 additions & 0 deletions R/helpers.R
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,16 @@ aggregate_correspondences <- function(correspondences){
}


normalize_datasets <- function(geo_datasets) {
geo_datasets <- as.character(geo_datasets)
dataset_translation <- setNames(
c("CA21","CA16","CA11","CA06","CA01") %>% rev(),
as.character(seq(2001,2021,5)))
geo_datasets <- geo_datasets %>% dplyr::recode(!!!dataset_translation)
geo_datasets
}


ensure_names <- function(list,default_names=seq(1,length(list))){
nn <- names(list)
if (is.null(nn)) {
Expand Down
107 changes: 78 additions & 29 deletions R/tongfen.R
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,7 @@ tongfen_aggregate <- function(data,correspondence,meta=NULL, base_geo = NULL){
d <- d %>% sf::st_drop_geometry()
}
match_column <- intersect(names(d),names(correspondence))
if (length(match_column)==0) stop("Did not found matching geographic identifiers.")
if (length(match_column)==0) stop("Did not find matching geographic identifiers.")
if (length(match_column)>1) warning(paste0("Matching over several geographic identifiers: ",paste0(match_column,collapse=", ")))
c <- correspondence %>%
select_at(c(match_column,"TongfenID","TongfenUID")) %>%
Expand Down Expand Up @@ -283,7 +283,8 @@ tongfen_aggregate <- function(data,correspondence,meta=NULL, base_geo = NULL){
#' @param parent_data Higher level geographic data
#' @param geo_match A named string informing on what column names to match data and parent_data
#' @param categories Vector of column names to re-aggregate
#' @param base Column name to use for proportional weighting when re-aggregating
#' @param base Column name to use for proportional weighting when re-aggregating, or named vector with column name for each category.
#' Categries that should be re-aggregated as means should be set to NA and will only be reaggregated if the base data has NA values.
#' @return dataframe with downsampled variables from parent_data
#' @keywords reaggregate proportionally wrt base variable
#' @export
Expand All @@ -307,35 +308,83 @@ tongfen_aggregate <- function(data,correspondence,meta=NULL, base_geo = NULL){
proportional_reaggregate <- function(data,parent_data,geo_match,categories,base="Population"){
# create zero categories if we don't have them on base (for example DB geo)
for (v in setdiff(categories,names(data))) {
data <- data %>% mutate(!!v := 0)
data <- data %>% mutate(!!v := NA_real_)
}
## join and compute the weights
## maybe should be left join, but then have to worry about what happens if there is no match. For hierarchial data should always have higher level geo!
d1 <- inner_join(data %>% mutate(!!base:=tidyr::replace_na(!!as.name(base),0)),
select(parent_data %>% as.data.frame,c(categories,c(as.character(geo_match)))),
by=geo_match) %>%
group_by(!!as.name(names(geo_match))) %>%
mutate(weight=!!as.name(base)/sum(!!as.name(base),na.rm=TRUE)) %>%
ungroup() %>%
mutate(weight=tidyr::replace_na(.data$weight,0))
## aggregate variables up and down
## lower level geography counts might have been suppressed, reaggregating these makes sure that the total number of
## dots on the map are given by more accurate higher level geo counts, difference is distributed proportionally by *base*
for (v in categories) {
vss=paste(v,'s',sep=".")
vs=as.name(vss)
vx=as.name(paste(v,'x',sep="."))
vy=as.name(paste(v,'y',sep="."))
d1 <- d1 %>%
mutate(!!vx:=tidyr::replace_na(!!vx,0)) %>%
group_by(!!as.name(names(geo_match))) %>%
mutate(!!vss := sum(!!vx,na.rm=TRUE)) %>%
ungroup() %>%
mutate(!!v := !!quo(UQ(vx) + .data$weight * (UQ(vy) - UQ(vs))))

if (length(base) == 1 && length(categories)>1) {
base=setNames(rep_len(base,length(categories)),categories)
} else if (length(base) == 1 && is.null(names(base))) {
base <- setNames(base,categories)
}

if (length(base) != length(categories)) {
stop("Base and categories must be of the same length")
}

if (is.null(names(base))) {
# ensure that base is named, f
base <- setNames(base,categories)
}

id <- "...id"
while (id %in% names(data)) {
id <- paste0("...",id)
}
## clean up and return
d1 %>%
select(-one_of(c(paste0(categories,".s"),paste0(categories,".x"),paste0(categories,".y"))))

na_base <- "...na_base"
while (na_base %in% names(data)) {
na_base <- paste0(na_base)
}


na_weight_cats <- names(base[is.na(base)])
if (length(na_weight_cats) >0) {
data <- data %>%
mutate(!!na_base := NA_real_)
base[na_weight_cats] <- na_base
}

unique_base_vars <- unique(base)

data <- data %>%
mutate(!!id:=as.character(row_number()))

d_base <- data %>%
st_drop_geometry() %>%
select(any_of(c(id,na_base,names(geo_match),unique_base_vars,categories))) %>%
tidyr::pivot_longer(cols=all_of(categories),
names_to="category",
values_to="value") %>%
mutate(weight=select(.,base[.data$category])[[1]]) %>%
mutate(agg_type=ifelse(.data$category %in% na_weight_cats,"na_weight","additive")) %>%
mutate(weight=.data$weight/sum(.data$weight,na.rm=TRUE),.by=c(names(geo_match),"category")) %>%
mutate(weight=coalesce(.data$weight,0)) %>%
mutate(weight=if_else(.data$agg_type=="na_weight",NA_real_,.data$weight)) %>%
select(-any_of(unique_base_vars))

d_parent <- parent_data %>%
st_drop_geometry() %>%
select(any_of(c(as.character(geo_match),categories))) %>%
tidyr::pivot_longer(cols=all_of(categories),
names_to="category",
values_to="p_value")

d_combined <- full_join(d_base,d_parent,by=c(geo_match,"category"="category")) %>%
mutate(s_value=sum(.data$value),.by=names(geo_match)) %>%
mutate(across(any_of(c("p_value","s_value")),\(x)coalesce(x,0))) %>%
mutate(value=case_when(.data$agg_type=="additive" ~ coalesce(.data$value,0) + .data$weight*(.data$p_value-.data$s_value),
is.na(.data$value) ~ .data$p_value,
TRUE ~ .data$value))

d_result <- d_combined %>%
select(any_of(id),"category","value") %>%
tidyr::pivot_wider(names_from="category",
values_from="value")

data %>%
select(-any_of(c(categories,na_base))) %>%
left_join(d_result,by=id) %>%
select(-any_of(id))
}

#' Generate togfen correspondence for two geographies
Expand Down
22 changes: 15 additions & 7 deletions R/tongfen_ca.R
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,12 @@ years_from_datasets <- function(ds) {


datasets_from_vectors <- function(vs){
vs %>%
ds<-vs %>%
stringr::str_split("_") %>%
purrr::map(function(v)v[[2]]) %>%
unlist()
ds[grepl("^\\d{4}$",ds)]<-geo_dataset_for_years(ds[grepl("^\\d{4}$",ds)])
ds
}

GEO_DATASET_LOOKUP <- c(
Expand All @@ -51,7 +53,7 @@ geo_dataset_for_years <- function(years){

geo_dataset_from_dataset <- function(datasets){
if (TRUE) { # legacy until cancensus updates
datasets <- datasets %>% gsub("^CA11[NF]$","CA11",.)
datasets <- datasets %>% gsub("^CA11[NF]$","CA11",.) %>% gsub("\\d{4}x","",.)
dataset_list <- cancensus::list_census_datasets()
lapply(datasets, function(ds){
dataset_list %>%
Expand Down Expand Up @@ -132,16 +134,19 @@ meta_for_ca_census_vectors <- function(vectors){
extras <- meta %>%
select(variable=.data$parent,.data$dataset) %>%
mutate(type="Extra",aggregation="Additive",rule="Additive") %>%
filter(!is.na(.data$variable),!.data$variable %in% meta$variable)
filter(!is.na(.data$variable),!.data$variable %in% meta$variable) %>%
filter(!duplicated(.data$variable,.data$dataset)) %>%
mutate(label=.data$variable)

if (nrow(extras)>0) {
meta <- meta %>% bind_rows(extras)
meta <- meta %>%
bind_rows(extras) %>%
filter(!duplicated(.data$variable,.data$dataset))
}

meta <- meta %>%
mutate(geo_dataset=geo_dataset_from_dataset(.data$dataset),
year=years_from_datasets(.data$dataset)) %>%
mutate(label = coalesce(.data$label,.data$variable))
year=years_from_datasets(.data$dataset))
meta
}

Expand Down Expand Up @@ -261,9 +266,12 @@ get_single_correspondence_ca_census_for <- function(year,level=c("DA","DB"),refr
get_tongfen_correspondence_ca_census <- function(geo_datasets, regions, level="CT", method="statcan",
tolerance = 50, area_mismatch_cutoff = 0.1,
quiet = FALSE, refresh = FALSE) {

geo_datasets <- normalize_datasets(geo_datasets)
if (method=="statcan") {
assert(level %in% c("DB","DA","CT"),"Level has to be one of DB, DA, or CT when using method = 'statcan'.")
assert(length(setdiff(geo_datasets, c("CA21","CA16","CA11","CA06","CA01")))==0,"Method 'statcan' only works for census years 2001 through 2016.")
assert(length(setdiff(geo_datasets, c("CA21","CA16","CA11","CA06","CA01")))==0,
"Method 'statcan' only works for census years 2001 through 2021.")
} else if (method=="estimate") {

} else if (method=="identifier") {
Expand Down
8 changes: 4 additions & 4 deletions R/tongfen_ca_estimate.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
#' \lifecycle{maturing}
#'
#' Estimates values for the given census vectors for the given geometry using
#' data from the specified level range
#' data from the specified level range. This is a wrapper around `cancensus::get_intersecting_geometries` and `tongfen_estimate`,
#' optionally with downsampling via `proportional_reaggregate`,
#' to streamline estimating Canadian census data on custom geographies.
#'
#' @param geometry geometry
#' @param meta metadata for the census variables to aggregate, for example as returned by `meta_for_ca_census_vectors`.
Expand Down Expand Up @@ -70,9 +72,7 @@ tongfen_estimate_ca_census <- function(geometry, meta, level,

if (!is.null(downsample_level)){
if (!("downsample" %in% names(meta))) stop("The downsample column in meta needs to be set")
base_vars <- meta %>% pull(.data$downsample) %>% stats::na.omit() %>% unique()
if (length(base_vars)==0) stop("The downsample column in meta needs to be set")
if (length(base_vars)!=1) stop(paste0("The downsample column has to have a unique variable, you provided ",paste0(base_vars, collapse = ", ")))
base_vars <- setNames(meta$downsample, meta$label)
dg <- cancensus::get_census(dataset=datasets,regions = regions,
level = downsample_level, quiet = quiet, geo_format = "sf") %>%
sf::st_transform(st_crs(geometry))
Expand Down
9 changes: 6 additions & 3 deletions R/tongfen_estimate.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,20 @@
#' @description
#' \lifecycle{maturing}
#'
#' Estimates data from source geometry onto target geometry
#' Estimates data from source geometry onto target geometry using area-weighted interpolation. The metadata specifies how data
#' should be aggregated, "additive" data like population counts are summed up proportionally to the area of the intersection, "averages"
#' need further additive "parent" count variables to estimate weighted averages.
#'
#' @param target custom geography to estimate values for
#' @param source input geography with values
#' @param meta metadata for variable aggregation
#' @param meta metadata for variable aggregation, see `meta_for_additive_variables` and `meta_for_ca_census_vectors` for more information
#' on how to construct metadata.
#' @param na.rm remove NA values when aggregating, default is FALSE
#' @return `target` with estimated quantities from `source` as specified by `meta`
#' @export
#'
#' @examples
#' # Estimate 2006 Populatino in the City of Vancouver dissemination ares on 2016 census geoographies
#' # Estimate 2006 Population in the City of Vancouver dissemination ares on 2016 census geographies
#' \dontrun{
#' geo1 <- cancensus::get_census("CA06",regions=list(CSD="5915022"),geo_format='sf',level='DA')
#' geo2 <- cancensus::get_census("CA16",regions=list(CSD="5915022"),geo_format='sf',level='DA')
Expand Down
3 changes: 2 additions & 1 deletion R/tongfen_us.R
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,8 @@ get_tongfen_us_census <- function(regions,meta,level='tract',survey="census",
m <- meta %>% filter(.data$dataset==ds)
year=as.numeric(gsub("dec", "", ds))
short_year <- substr(as.character(year),3,4)
tidycensus::get_decennial(geography=level, state=state, variables = m$variable, year = year,
tidycensus::get_decennial(geography=level, state=state, county=regions$county,
variables = m$variable, year = year,
geometry = base_geo==ds, output="wide") %>%
rename(!!paste0("GEOID",short_year):=.data$GEOID)
}) %>%
Expand Down
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
[![R-CMD-check](https://github.com/mountainMath/tongfen/workflows/R-CMD-check/badge.svg)](https://github.com/mountainMath/tongfen/actions)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/tongfen)](https://cran.r-project.org/package=tongfen)
[![CRAN_Downloads_Badge](https://cranlogs.r-pkg.org/badges/tongfen)](https://cranlogs.r-pkg.org/badges/tongfen)
[![R-CMD-check](https://github.com/mountainMath/tongfen/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mountainMath/tongfen/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->

<a href="https://mountainmath.github.io/tongfen/index.html"><img src="https://raw.githubusercontent.com/mountainMath/tongfen/master/images/tongfen-sticker.png" alt="tongfen logo" align="right" width = "25%" height = "25%"/></a>
Expand Down Expand Up @@ -84,17 +85,18 @@ Methods to facilitate this are still under active development.

If you wish to cite tongfen:

von Bergmann, J. (2021). tongfen: R package to
Make Data Based on Different Geographies Comparable. v0.3.3.
von Bergmann, J. (2024). tongfen: R package to
Make Data Based on Different Geographies Comparable. v0.3.6.


A BibTeX entry for LaTeX users is
```
@Manual{,
@Manual{tongfen,
author = {Jens {von Bergmann}},
title = {tongfen: R package to Make Data Based on Different Geographies Comparable},
year = {2021},
note = {R package version 0.3.3},
year = {2024},
doi = {10.32614/CRAN.package.tongfen},
note = {R package version 0.3.6},
url = {https://mountainmath.github.io/tongfen/},
}
```
Loading