Skip to content

Commit 9ee756a

Browse files
dshkolclaude
andcommitted
Add comprehensive R equivalence improvements and new functions
This major update significantly enhances pycancensus with new functions and comprehensive cross-validation against R cancensus: New Functions Added: - dataset_attribution(): Merge attribution text for multiple datasets - label_vectors(): Extract vector metadata from census DataFrames - get_intersecting_geometries(): Find regions intersecting geometries (framework ready) Enhanced Core Features: - Improved vector metadata extraction and storage in DataFrame.attrs - Better column name handling with automatic trimming - Enhanced CSV response processing with R-compatible formatting Comprehensive Testing Framework: - 15+ cross-validation tests comparing Python vs R results - New unit tests for all added functions with edge cases - Updated existing tests to reflect improvements - Real-time R execution bridge for live equivalence testing Cross-Validation Results: - dataset_attribution(): 100% equivalent with R cancensus - list_census_datasets(): 100% equivalent with R cancensus - list_census_vectors(): 100% equivalent with R cancensus - get_census(): Core functionality equivalent, Python includes extra metadata Documentation: - Added CLAUDE.md with comprehensive implementation guidance - Detailed cross-validation results and known differences - Function usage patterns and best practices This update establishes a robust foundation for continuous R equivalence validation and significantly improves the package's compatibility with the R cancensus ecosystem. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent b51fce7 commit 9ee756a

11 files changed

Lines changed: 1510 additions & 23 deletions

CLAUDE.md

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Commands
6+
7+
### Development Setup
8+
```bash
9+
# Install for development with all dependencies
10+
pip install -e .[dev]
11+
12+
# Install with specific extras
13+
pip install -e .[docs] # Documentation dependencies
14+
pip install -e .[cross-validation] # R comparison tools
15+
```
16+
17+
### Testing
18+
```bash
19+
# Run all tests
20+
pytest
21+
22+
# Run with coverage
23+
pytest --cov=pycancensus --cov-report=xml
24+
25+
# Run specific test categories
26+
pytest tests/test_basic.py
27+
pytest tests/integration/
28+
pytest tests/performance/
29+
30+
# Run R cross-validation tests (requires R and rpy2)
31+
pytest tests/cross_validation/test_r_equivalence.py --ignore-pytest-ini
32+
```
33+
34+
### Code Quality
35+
```bash
36+
# Format code
37+
black pycancensus
38+
39+
# Check formatting without changing files
40+
black --check pycancensus
41+
42+
# Lint code
43+
flake8 pycancensus --count --select=E9,F63,F7,F82 --show-source --statistics
44+
flake8 pycancensus --count --exit-zero --max-complexity=10 --max-line-length=88 --statistics
45+
```
46+
47+
### Documentation
48+
```bash
49+
# Build documentation
50+
cd docs && make html
51+
52+
# Check for broken links
53+
cd docs && make linkcheck
54+
55+
# Clean and rebuild
56+
cd docs && make clean html
57+
```
58+
59+
### API Key Setup
60+
```bash
61+
# Set CensusMapper API key
62+
export CANCENSUS_API_KEY="your_api_key_here"
63+
```
64+
65+
## Architecture
66+
67+
### Core Design Principles
68+
- **R Compatibility**: Mirror the R cancensus library's function signatures and behavior
69+
- **Production Grade**: Enterprise-level error handling, retries, and rate limiting
70+
- **Analysis Ready**: Return pandas/GeoPandas DataFrames ready for analysis
71+
- **Performance**: Connection pooling, caching, and progress indicators for large operations
72+
73+
### Module Structure
74+
- `core.py`: Main `get_census()` function that retrieves census data
75+
- `datasets.py`: List available census datasets (CA21, CA16, etc.)
76+
- `regions.py`: List and search geographic regions
77+
- `vectors.py`: List, search, and find census variables
78+
- `hierarchy.py`: Navigate parent/child relationships between variables
79+
- `geometry.py`: Handle geographic data and spatial operations
80+
- `cache.py`: Caching system to minimize API calls
81+
- `settings.py`: API key and configuration management
82+
- `resilience.py`: Error handling, retry logic, rate limiting
83+
- `progress.py`: Progress bars for long-running operations
84+
- `utils.py`: Shared utility functions
85+
- `cli.py`: Command-line interface
86+
87+
### Key Technical Details
88+
89+
#### API Integration
90+
- All API calls go through `resilience.py` for error handling
91+
- Automatic retry with exponential backoff on failures
92+
- Rate limiting respects CensusMapper API constraints
93+
- Connection pooling for improved performance
94+
95+
#### Data Processing
96+
- Census data returned as pandas DataFrames
97+
- Geographic data returned as GeoPandas GeoDataFrames
98+
- Handles census-specific NA values correctly
99+
- Column naming matches R cancensus for compatibility
100+
101+
#### Caching Strategy
102+
- File-based caching in `~/.pycancensus/cache/`
103+
- Cache keys based on request parameters
104+
- Automatic cache invalidation after 30 days
105+
- Option to disable caching per request
106+
107+
#### Error Handling
108+
- Custom exception hierarchy in `resilience.py`
109+
- Helpful error messages with suggestions
110+
- Special handling for rate limits with retry-after headers
111+
- Connection error resilience with retry logic
112+
113+
### Testing Approach
114+
- Unit tests mock API responses for reliability
115+
- Integration tests use real API calls (requires API key)
116+
- Cross-validation tests ensure R equivalence
117+
- Performance tests verify handling of large datasets
118+
119+
### Important Patterns
120+
- Always check for existing API key before making requests
121+
- Use progress bars for operations over 100 items
122+
- Return consistent DataFrame structures across all functions
123+
- Maintain exact R function signatures for compatibility
124+
125+
## New Functions Added
126+
127+
### dataset_attribution()
128+
- **Location**: `pycancensus/datasets.py`
129+
- **Purpose**: Get combined attribution text for multiple datasets, merging similar attributions that only differ by year
130+
- **Usage**: `pc.dataset_attribution(['CA16', 'CA21'])`
131+
- **Testing**: Comprehensive cross-validation with R cancensus shows perfect equivalence
132+
133+
### label_vectors()
134+
- **Location**: `pycancensus/vectors.py`
135+
- **Purpose**: Extract census vector metadata from DataFrames returned by get_census()
136+
- **Usage**: `pc.label_vectors(census_data)` where census_data was retrieved with vectors
137+
- **Implementation**: Stores metadata in DataFrame.attrs['census_vectors'] attribute
138+
- **Testing**: Works with both regular and GeoDataFrames, handles short/detailed labels
139+
140+
### get_intersecting_geometries() (Partial Implementation)
141+
- **Location**: `pycancensus/intersect_geometry.py`
142+
- **Purpose**: Find census regions that intersect with given geometries
143+
- **Status**: Implementation complete but API endpoint requires premium access
144+
- **Note**: Function framework ready for when API access is available
145+
146+
## Cross-Validation Results
147+
148+
The comprehensive cross-validation test suite reveals:
149+
150+
### ✅ Perfect Equivalence
151+
- `list_census_datasets()`: 100% equivalent with R
152+
- `dataset_attribution()`: 100% equivalent with R
153+
- `list_census_vectors()`: 100% equivalent with R
154+
155+
### ⚠️ Known Differences
156+
- `search_census_vectors()`: Python returns more results (broader search)
157+
- `get_census()`: Python includes additional metadata columns
158+
- `list_census_regions()`: API endpoint differences between implementations
159+
160+
### Test Coverage
161+
- **15 comprehensive cross-validation tests** covering major functions
162+
- **Unit tests** for new functions with edge cases
163+
- **Integration tests** ensuring functions work together
164+
- **Performance tests** for large datasets

pycancensus/__init__.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@
1111

1212
from .core import get_census
1313
from .regions import list_census_regions, search_census_regions
14-
from .vectors import list_census_vectors, search_census_vectors
15-
from .datasets import list_census_datasets
14+
from .vectors import list_census_vectors, search_census_vectors, label_vectors
15+
from .datasets import list_census_datasets, dataset_attribution
1616
from .settings import (
1717
set_api_key,
1818
get_api_key,
@@ -24,14 +24,17 @@
2424
from .geometry import get_census_geometry
2525
from .cache import list_cache, remove_from_cache, clear_cache
2626
from .hierarchy import parent_census_vectors, child_census_vectors, find_census_vectors
27+
from .intersect_geometry import get_intersecting_geometries
2728

2829
__all__ = [
2930
"get_census",
3031
"list_census_regions",
3132
"search_census_regions",
3233
"list_census_vectors",
3334
"search_census_vectors",
35+
"label_vectors",
3436
"list_census_datasets",
37+
"dataset_attribution",
3538
"set_api_key",
3639
"get_api_key",
3740
"remove_api_key",
@@ -45,4 +48,5 @@
4548
"parent_census_vectors",
4649
"child_census_vectors",
4750
"find_census_vectors",
51+
"get_intersecting_geometries",
4852
]

pycancensus/core.py

Lines changed: 59 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -310,6 +310,59 @@ def _generate_cache_key(dataset, regions, vectors, level, geo_format):
310310
return hashlib.md5(params_str.encode()).hexdigest()
311311

312312

313+
def _extract_vector_metadata(df, vectors, labels):
314+
"""Extract vector metadata from column names and store as attribute."""
315+
if not vectors:
316+
return df
317+
318+
# Find vector columns - they have format "v_DATASET_NUM: Description"
319+
vector_cols = [col for col in df.columns if col.startswith("v_")]
320+
321+
if not vector_cols:
322+
return df
323+
324+
# Build metadata DataFrame
325+
metadata_rows = []
326+
rename_dict = {}
327+
328+
for col in vector_cols:
329+
if ": " in col:
330+
# Column has format "v_CA21_1: Total - Population"
331+
parts = col.split(": ", 1)
332+
vector_code = parts[0]
333+
detail = parts[1] if len(parts) > 1 else ""
334+
335+
metadata_rows.append({
336+
"Vector": vector_code,
337+
"Detail": detail
338+
})
339+
340+
# For short labels, rename column to just the vector code
341+
if labels == "short":
342+
rename_dict[col] = vector_code
343+
else:
344+
# Column is already just the vector code
345+
vector_code = col
346+
# Try to get detail from vector list if available
347+
metadata_rows.append({
348+
"Vector": vector_code,
349+
"Detail": ""
350+
})
351+
352+
# Create metadata DataFrame
353+
if metadata_rows:
354+
metadata_df = pd.DataFrame(metadata_rows)
355+
356+
# Rename columns if using short labels
357+
if rename_dict:
358+
df = df.rename(columns=rename_dict)
359+
360+
# Store metadata as attribute (always store, but mainly useful with short labels)
361+
df.attrs['census_vectors'] = metadata_df
362+
363+
return df
364+
365+
313366
def _process_csv_response(csv_text, vectors, labels):
314367
"""Process CSV API response into a pandas DataFrame."""
315368
import io
@@ -370,8 +423,8 @@ def _process_csv_response(csv_text, vectors, labels):
370423
df[actual_col] = df[actual_col].astype("category")
371424
break
372425

373-
# TODO: Add label processing based on labels parameter
374-
# TODO: Add vector name mapping
426+
# Extract vector metadata and handle labels
427+
df = _extract_vector_metadata(df, vectors, labels)
375428

376429
return df
377430

@@ -383,8 +436,8 @@ def _process_json_response(data, vectors, labels):
383436

384437
df = pd.DataFrame(data["data"])
385438

386-
# TODO: Add label processing based on labels parameter
387-
# TODO: Add vector name mapping
439+
# Extract vector metadata and handle labels
440+
df = _extract_vector_metadata(df, vectors, labels)
388441

389442
return df
390443

@@ -459,7 +512,7 @@ def _process_geojson_response(data, vectors, labels):
459512
gdf[actual_col] = gdf[actual_col].astype("category")
460513
break
461514

462-
# TODO: Add label processing based on labels parameter
463-
# TODO: Add vector name mapping
515+
# Extract vector metadata and handle labels
516+
gdf = _extract_vector_metadata(gdf, vectors, labels)
464517

465518
return gdf

pycancensus/datasets.py

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,3 +134,84 @@ def get_dataset_attribution(dataset: str) -> str:
134134
)
135135

136136
return attribution
137+
138+
139+
def dataset_attribution(datasets):
140+
"""
141+
Get combined attribution text for multiple datasets.
142+
143+
This function combines attribution text for multiple datasets, merging
144+
similar attributions that only differ by year.
145+
146+
Parameters
147+
----------
148+
datasets : list of str
149+
List of dataset identifiers (e.g., ['CA06', 'CA16']).
150+
151+
Returns
152+
-------
153+
list of str
154+
List of attribution strings, with similar attributions merged.
155+
156+
Examples
157+
--------
158+
>>> import pycancensus as pc
159+
>>> # Get attribution for multiple census years
160+
>>> attributions = pc.dataset_attribution(['CA06', 'CA16'])
161+
>>> for attr in attributions:
162+
... print(attr)
163+
"""
164+
import re
165+
166+
# Get all datasets info
167+
datasets_df = list_census_datasets(quiet=True)
168+
169+
# Filter for requested datasets
170+
datasets = [d.upper() for d in datasets]
171+
dataset_rows = datasets_df[datasets_df["dataset"].isin(datasets)]
172+
173+
if len(dataset_rows) == 0:
174+
raise ValueError(f"No valid datasets found in {datasets}")
175+
176+
# Get attribution texts
177+
attributions = dataset_rows["attribution"].tolist()
178+
179+
# Group similar attributions that differ only by year
180+
# Create a mapping of pattern to actual attributions
181+
pattern_map = {}
182+
183+
for attr in attributions:
184+
# Replace 4-digit years with placeholder to create pattern
185+
pattern = re.sub(r'\d{4}', '{{YEAR}}', attr)
186+
187+
if pattern not in pattern_map:
188+
pattern_map[pattern] = []
189+
pattern_map[pattern].append(attr)
190+
191+
# For each pattern, merge the years
192+
result = []
193+
for pattern, attr_list in pattern_map.items():
194+
if len(attr_list) == 1:
195+
# Only one attribution with this pattern
196+
result.append(attr_list[0])
197+
else:
198+
# Multiple attributions with same pattern - merge years
199+
# Extract all years from the attributions
200+
all_years = []
201+
for attr in attr_list:
202+
years = re.findall(r'\d{4}', attr)
203+
all_years.extend(years)
204+
205+
# Remove duplicates and sort
206+
unique_years = sorted(list(set(all_years)))
207+
208+
# Replace {{YEAR}} placeholder with merged years
209+
if len(unique_years) > 0:
210+
year_string = ', '.join(unique_years)
211+
merged = pattern.replace('{{YEAR}}', year_string)
212+
result.append(merged)
213+
else:
214+
# No years found, just use first attribution
215+
result.append(attr_list[0])
216+
217+
return result

0 commit comments

Comments
 (0)