-
Notifications
You must be signed in to change notification settings - Fork 1
Geocoding Cron Job (#76) #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* Add geocoding refresh cron job and related fields - Add route_archive_after field to admin_info table (default: 30 days) - Add geocoded_at field to locations table to track when address was geocoded - Set geocoded_at timestamp when creating locations with coordinates - Create daily cron job to refresh geocoding for locations older than threshold - Fix scheduler service to handle event loop isolation for async jobs - Add database migration for new fields The cron job runs daily at 2 AM and refreshes lat/lon for locations where geocoded_at is NULL or older than route_archive_after days from admin settings. * Update README to include Geocoding Update section and add screenshots - Added a new section detailing the automated geocoding refresh cron job. - Included features of the geocoding update, such as automatic refresh and manual trigger. - Added test data instructions and before/after screenshots to illustrate the geocoding process. * Resolved commits on PR * removed appropriate migrations. * Refactor geocoding refresh logic into helper functions - Introduced `_get_archive_threshold`, `_get_locations_to_refresh`, and `_refresh_locations` to modularize the geocoding refresh process. - Updated `refresh_geocoding` to utilize these new functions for improved readability and maintainability. - Removed redundant code and streamlined the process of fetching and refreshing location data.
Pull Request Review: Geocoding Refresh Cron JobThank you for this contribution! This is a well-structured implementation of an automated geocoding refresh system. Below is my detailed feedback: ✅ Strengths
🔴 Critical Issues1. Missing Database MigrationSeverity: CRITICAL The PR adds two new database fields but doesn't include an Alembic migration:
Impact: This will cause runtime errors in production when the code tries to access fields that don't exist in the database schema. Required Action: Create an Alembic migration file: cd backend/python
alembic revision -m "add_geocoding_fields"Then add upgrade/downgrade logic for both fields. Files to check: 2. Missing Seed Test Data FileSeverity: HIGH The PR description references a seed file for testing: docker-compose exec backend python -w /app app/seed_geocoding_test_data.pyHowever, this file is not included in the PR. The file Required Action: Either:
3. Timezone-Naive datetime.now()Severity: MEDIUM-HIGH Location:
location.geocoded_at = datetime.now() # Timezone-naive\!
cutoff_date = datetime.now() - timedelta(days=archive_threshold_days) # Timezone-naive\!Problem: Using
Recommended Fix: Use timezone-aware datetimes: from datetime import datetime, timezone
# Option 1: UTC (recommended for storage)
location.geocoded_at = datetime.now(timezone.utc)
# Option 2: Application timezone (if you need local time)
from zoneinfo import ZoneInfo
from app.config import settings
location.geocoded_at = datetime.now(ZoneInfo(settings.scheduler_timezone))Note: The scheduler correctly uses 🟡 Medium Priority Issues4. Redundant Database QueryLocation: async with async_session_maker_instance() as session:
admin_statement = select(Admin)
admin_result = await session.execute(admin_statement)
admin_record = admin_result.scalars().first() # ❌ Unused variable
archive_threshold_days = await _get_archive_threshold(session) # Queries Admin againThe code queries the Admin table twice:
Fix: Remove lines 100-102 entirely. 5. No Manual Trigger EndpointSeverity: MEDIUM The PR description mentions:
But no API endpoint is included in the PR. The job can only run on schedule. Recommendation: Add an endpoint in a routes file (e.g., @router.post("/geocoding/refresh")
async def trigger_geocoding_refresh():
await refresh_geocoding()
return {"status": "success"}6. Cron Schedule MismatchSeverity: LOW PR Description says: "runs daily at midnight EST" Actual implementation ( scheduler_service.add_cron_job(
refresh_geocoding,
job_id="daily_geocoding_refresh",
hour=2, # ❌ Runs at 2 AM, not midnight
minute=0,
)Fix: Either update the description or change 🟢 Minor Issues / Suggestions7. Missing Type HintLocation: async def _get_archive_threshold(session) -> int: # ❌ Missing type hint for 'session'Better: from sqlalchemy.ext.asyncio import AsyncSession
async def _get_archive_threshold(session: AsyncSession) -> int:Same applies to 8. Inconsistent Docstring FormattingThe docstring in """Refresh geocoding for locations - runs daily
This job:
- Gets the route_archive_after setting from admin_info (default 30 days)
- Finds all locations that need refreshing
- Refreshes lat/lon for those locations using Google Geocoding API
- Updates geocoded_at timestamp for refreshed locations
"""9. Google API Rate LimitingConsideration: Google Maps Geocoding API has rate limits (typically 50 requests/second). The current implementation uses Potential Issue: If there are 1000+ stale locations, this could hit rate limits. Suggestion (optional): Consider adding batch size limits or rate limiting: # Process in chunks of 50
for i in range(0, len(addresses), 50):
chunk = addresses[i:i+50]
results = await geocode_addresses(chunk)
# ... process resultsHowever, this may not be necessary if your location count is manageable. Just something to monitor. 10. Location.geocoded_at Not Set on Manual UpdatesThe 🧪 Testing ConcernsMissing TestsNo test files were added for:
Recommendation: Add tests covering:
📋 Summary of Required Changes
🎯 RecommendationStatus: This PR needs critical fixes before merging:
Once these are addressed, this will be a solid addition to the codebase. The overall architecture is sound and follows existing patterns well. Great work on implementing this feature! Let me know if you have questions about any of these points. Happy to help with the migration file or other fixes. 🚀 |
| from app.models import async_session_maker_instance | ||
| from app.models.admin import Admin | ||
| from app.models.location import Location | ||
| from app.utilities.geocoding import geocode_addresses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the app.utilities.google_maps_client instead of the raw functions here :)
Use app.utilities.google_maps_client instead of raw geocode_addresses function: - Import and use get_google_maps_client from dependencies - Update result handling to use GeocodeResult objects instead of dicts - Follow existing pattern used in location_service.py Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Code Review - PR #77: Geocoding Refresh Cron JobThank you for this contribution! This PR implements automated geocoding refresh functionality. I've conducted a thorough review and have feedback organized by priority. 🔴 Critical Issues1. Missing Database MigrationSeverity: BLOCKING The PR adds two new database fields but does not include an Alembic migration:
Impact: This will cause runtime errors when deployed, as the application code expects fields that don't exist in the database schema. Required Action: Create and include an Alembic migration: cd backend/python
alembic revision -m "add geocoded_at and route_archive_after fields"Example migration content: def upgrade() -> None:
op.add_column('locations', sa.Column('geocoded_at', sa.DateTime(), nullable=True))
op.add_column('admin', sa.Column('route_archive_after', sa.Integer(), nullable=False, server_default='30'))
def downgrade() -> None:
op.drop_column('locations', 'geocoded_at')
op.drop_column('admin', 'route_archive_after')2. Timezone-Naive datetime UsageSeverity: HIGH Locations:
Problem: Using
Note: I see Recommended Fix: from datetime import datetime, timezone
# In geocoding_refresh_jobs.py and location_service.py
location.geocoded_at = datetime.now(timezone.utc)
cutoff_date = datetime.now(timezone.utc) - timedelta(days=archive_threshold_days)Alternative: If you must use naive datetimes for database compatibility, at least be explicit about UTC: location.geocoded_at = datetime.now(timezone.utc).replace(tzinfo=None)3. Missing Seed Test FileSeverity: MEDIUM-HIGH The PR description references: docker-compose exec backend python -w /app app/seed_geocoding_test_data.pyHowever, Required Action: Either add the seed file or update the testing instructions to use an alternative approach. 🟡 Medium Priority Issues4. Redundant Database QueryLocation: async with async_session_maker_instance() as session:
admin_statement = select(Admin)
admin_result = await session.execute(admin_statement)
admin_record = admin_result.scalars().first() # ❌ Result never used
archive_threshold_days = await _get_archive_threshold(session) # Queries Admin againThe Admin table is queried twice: once on lines 100-103 (result unused), and again inside Fix: Remove lines 100-103 entirely. 5. Missing Type HintsLocations: async def _get_archive_threshold(session) -> int: # Missing type hint
async def _get_locations_to_refresh(session, cutoff_date) -> list[Location]: # Missing types
async def _refresh_locations(session, locations: list[Location]) -> int: # Missing session typeRecommended Fix: from sqlalchemy.ext.asyncio import AsyncSession
async def _get_archive_threshold(session: AsyncSession) -> int:
async def _get_locations_to_refresh(session: AsyncSession, cutoff_date: datetime) -> list[Location]:
async def _refresh_locations(session: AsyncSession, locations: list[Location]) -> int:6. Documentation InconsistencyLocation: PR description vs. code PR Description states: "runs daily at midnight EST" Actual code ( scheduler_service.add_cron_job(
refresh_geocoding,
job_id="daily_geocoding_refresh",
hour=2, # ❌ Runs at 2 AM, not midnight
minute=0,
)Fix: Update the PR description to say "2 AM EST" or change 7. Manual Trigger Endpoint Not ImplementedSeverity: LOW The PR description lists:
But no API endpoint is included. The job can only be triggered by the scheduler. Recommendation: Either:
🟢 Minor Issues & Suggestions8. Potential Google API Rate LimitingThe Google Maps Geocoding API has rate limits (typically 50 requests/second for standard plans). The current implementation processes all locations concurrently via Consideration: If you have 100+ stale locations, you might hit rate limits. Optional Enhancement (for future consideration): # Process in batches of 50
BATCH_SIZE = 50
for i in range(0, len(addresses), BATCH_SIZE):
batch = addresses[i:i+BATCH_SIZE]
results = await google_maps_client.geocode_addresses(batch)
# ... process resultsThis may not be urgent for current scale but worth monitoring. 9. geocoded_at Not Updated on Manual Coordinate ChangesThe Suggestion: Consider adding logic in 10. Missing TestsNo test files were added for the new functionality. Recommended Test Coverage:
Consider adding ✅ Strengths
📋 Summary & Recommendation
Overall Assessment: The implementation is architecturally sound and follows good patterns, but requires critical fixes before merging:
Once these issues are addressed, this will be a solid addition to the codebase. Happy to help with any questions or the migration file! |
…TQ/geocoding-refresh-cron-job
JIRA ticket link
Geocoding refresh cron job
Implementation description
Geocoding Update
The system includes an automated geocoding refresh cron job that updates location coordinates using the Google Maps Geocoding API.
Features
route_archive_aftersetting in admin_info (default: 30 days)Steps to test
Seed the database with test locations for geocoding:
docker-compose exec backend python -w /app app/seed_geocoding_test_data.pyBefore Update
Locations with NULL coordinates or stale geocoding data:
After Update
Locations with refreshed coordinates from Google Geocoding API:
What should reviewers focus on?
Checklist