Skip to content

An error occurred while fitting the spatial model: Not yet able to subset general weights lists: SAC SAR LM and lw matrixes #54

Open
@Sixto233

Description

@Sixto233

I've been working on fitting a spatial autoregressive model (sacsarlm) to property price data in Córdoba, Argentina. While the model runs smoothly on smaller datasets, I've encountered an error when using larger samples, specifically when working with a 4K. observations dataset

.Here’s a breakdown of what happened, what I suspect the issue is, and some potential fixes.

The Datasets
For context, I’m working with three versions of the Cordoba dataset:

Cordoba Full Dataset: The complete dataset with all observations.
Cordoba Non-Working Sample (4K observations): A subset that consistently throws an error.
Cordoba Working Sample (1.5K observations): A smaller subset that runs without issues.

Here are the versions: https://drive.google.com/drive/folders/1VoJ-Rbb9uwh68q-lHQajh7qubeD9olAH?usp=sharing

Str of Cordoba

Geometry:

Type: sfc_POINT
Description: Each entry in the dataset is represented by a point geometry, capturing the precise location of properties in Córdoba.
CRS: WGS 84

The Error
The issue arises when I attempt to fit the SACARLM model with the 4K observation subset.

My libraries for this code

library("tidyverse")     # For data manipulation and visualization
library("ggplot2")       # For advanced plotting
library("sf")            # For handling spatial data
library("spatialreg")    # For spatial regression models
library("spdep")         # For spatial weights and dependencies
library("car")           # For Variance Inflation Factor (VIF) and other diagnostics
library("lmtest")        # For diagnostic tests on linear models
library("mapview")       # For interactive mapping and visualization

This is the spatial model creating a linear regression then applying it into the sacsar creating list weights based on a 0-500 neighbour catchment area and then using the inverse distance to weight the neighbours.
The error appears trying to pass sac <- sacsarlm(model, listw = lw, data = data, zero.policy = TRUE)

fit_spatial_model <- function(data, has_river = FALSE, has_coastline = FALSE) {
    # Initialize the base formula with the main predictors for the model.
    base_formula <- "log(pm2) ~ property_type + log(surface_total) + log(dist_hospital) + 
    log(dist_school) + log(dist_university) + 
    log(dist_parque) + count_restaurants + 
    nearest_park_area + log(dist_alcaldia) + log(dist_alcaldia):before_after_covid + 
    before_after_covid:log(dist_parque) + before_after_covid:count_restaurants + 
    before_after_covid:nearest_park_area"
  
    # If the city has a river, add the distance to the river as a predictor in the formula.
    if (has_river) {
        base_formula <- paste(base_formula, "+ log(dist_rio)")
    }
  
    # If the city has a coastline, add the distance to the coast as a predictor in the formula.
    if (has_coastline) {
        base_formula <- paste(base_formula, "+ log(dist_costa)")
    }
    
    # Convert the base formula (string) to an actual formula object in R.
    formula_obj <- as.formula(base_formula)
  
    # Fit a linear model using the specified formula and dataset.
    model <- lm(formula = formula_obj, data = data)
    
    # Print the summary of the linear model, which includes coefficients, standard errors, etc.
    print(summary(model))
  
    # Calculate Variance Inflation Factor (VIF) values to check for multicollinearity among predictors.
    vif_values <- vif(model)
    
    # Print the VIF values to identify potential multicollinearity issues.
    print(vif_values)
  
    # Extract the coordinates from the spatial data's geometry column.
    cord <- st_coordinates(data$geometry)
    
    # Create a neighborhood structure (spatial neighbors) for points within 500 units of distance.
    d <- dnearneigh(cord, 0, 500)
    
    # Calculate distances between neighboring points.
    dlist <- nbdists(d, coordinates(cord))
    
    # Create inverse distance weights for each neighbor relationship.
    idlist <- lapply(dlist, function(x) 1/x)
  
    # Convert the neighborhood structure and weights into a spatial weights list object.
    lw <- nb2listw(d, glist=idlist, style="W", zero.policy = TRUE)
    
    # Fit a spatial autoregressive model (SAR) using the linear model and the spatial weights.
    sac <- sacsarlm(model, listw = lw, data = data, zero.policy = TRUE)
  
    # Generate a summary of the SAR model, including the Nagelkerke pseudo R-squared.
    sac_summary <- summary(sac, Nagelkerke = TRUE)
    
    # Print the summary of the SAR model to review model performance and diagnostics.
    print(sac_summary)
  
    # Calculate and print the impacts (direct, indirect, total) of the SAR model.
    impacts <- impacts(sac, listw = lw, zero.policy = TRUE)
    print(impacts)
  
    # Return a list containing summaries of the linear model, SAR model, and impacts.
    return(list(model_summary = summary(model), sac_summary = sac_summary, impacts = impacts))
}

The error message I get is:

(Error in subset.listw(listw, subset, zero.policy = zero.policy) :
Not yet able to subset general weights lists
5.stop("Not yet able to subset general weights lists")
4. subset.listw(listw, subset, zero.policy = zero.policy)
3. subset(listw, subset, zero.policy = zero.policy)
2. sacsarlm(model, listw = lw, data = data, zero.policy = TRUE)
1.fit_spatial_model(Cordoba, has_river = TRUE)

The way Cordoba Full becomes subsets is through a bigger dataset from Argentina in the following way.

analyze_location <- function(lat, lon, dataset, distance_threshold = 55000, sample_size = 4000, jitter_amount = 2 / 111320, pm2_min = 1, pm2_max = 8000, Coastline1 = FALSE) {
  
  # Create Central Business District (CBD) point
  location_cbd <- st_as_sf(data.frame(lat = lat, lon = lon), coords = c("lon", "lat"), crs = 4326)
  
  # Filter dataset by distance from CBD
  location_data <- dataset %>%
    mutate(dist_location = as.numeric(st_distance(geometry, location_cbd))) %>%
    filter(dist_location < distance_threshold)
  
  # Randomly sample the filtered data
  location_sample <- sample_n(location_data, sample_size)
  
  # Calculate bounding box with buffer for the sampled data
  location_sample_bb <- st_bbox(st_buffer(st_as_sf(st_as_sfc(st_bbox(location_sample))), 3000))
  
  # Calculate price per square meter (pm2)
  location_sample <- location_sample %>%
    mutate(price = as.numeric(price), 
           surface_covered = as.numeric(surface_covered),
           surface_total = as.numeric(surface_total)) %>%
    filter(property_type %in% c("Casa", "Lote", "Departamento")) %>%
    mutate(pm2 = case_when(
      property_type %in% c("Casa", "Departamento") ~ price / if_else(is.na(surface_covered) | surface_covered == 0, surface_total, surface_covered),
      property_type == "Lote" ~ price / surface_total,
      TRUE ~ NA_real_)) %>%
    distinct(geometry, pm2, .keep_all = TRUE)
  
  # Calculate distances and apply jitter
  location_sample <- calculadora(location_sample, alcaldia = location_cbd, Coastline = Coastline1) %>%
    st_jitter(amount = jitter_amount)
  
  # Filter by pm2 range and non-empty surface_total
  location_sample <- location_sample %>%
    filter(pm2 > pm2_min, pm2 < pm2_max, surface_total != "")
  
  return(location_sample)
}

# Example usage:
Cordoba <- analyze_location(lon = -64.188441, lat = -31.419912, dataset = Argentina, jitter_amount = 2 / 111320, sample_size = 1500)

One can easily make Cordoba_subset with replacing dataset = Argentina with Cordoba.

Cordoba_1500 <- analyze_location(lon = -64.188441, lat = -31.419912, dataset = Cordoba, jitter_amount = 2 / 111320, sample_size = 1500)

Key steps
Removed NAs: Filters were added to remove any rows with NA values in pm2, surface_total, and geometry to prevent issues with the spatial model fitting.
Jittering: Applied jitter to the spatial data to prevent infinite values in the weight matrices that could cause problems during spatial regression.

So the following paths are taking for the formulas. First we analyze the location and then we run the regression as:

Cordoba = analyze_location( lon = -64.188441, lat = -31.419912, dataset = Cordoba_Full,  jitter_amount = 2 / 111320, sample_size = 1500)
Cordoba_spatial_Model = fit_spatial_model (Cordoba, has_river = TRUE,  has_coastline = FALSE )

If anybody could help I would be really thankful

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions