Skip to content

Conversation

@tqmsh
Copy link
Member

@tqmsh tqmsh commented Jan 12, 2026

JIRA ticket link

Geocoding refresh cron job

Implementation description

  • Geocoding Update

The system includes an automated geocoding refresh cron job that updates location coordinates using the Google Maps Geocoding API.

Features

  • Automatic refresh of location coordinates that are NULL or older than 30 days
  • Respects route_archive_after setting in admin_info (default: 30 days)
  • Scheduled job that runs daily at midnight EST
  • Manual trigger via API endpoint

Steps to test

Seed the database with test locations for geocoding:

docker-compose exec backend python -w /app app/seed_geocoding_test_data.py

Before Update

Locations with NULL coordinates or stale geocoding data:

before_test

After Update

Locations with refreshed coordinates from Google Geocoding API:

after_test

What should reviewers focus on?

Checklist

  • My PR name is descriptive and in imperative tense
  • My commit messages are descriptive and in imperative tense. My commits are atomic and trivial commits are squashed or fixup'd into non-trivial commits
  • I have requested a review from the PL, as well as other devs who have background knowledge on this PR or who will be building on top of this PR

* Add geocoding refresh cron job and related fields

- Add route_archive_after field to admin_info table (default: 30 days)
- Add geocoded_at field to locations table to track when address was geocoded
- Set geocoded_at timestamp when creating locations with coordinates
- Create daily cron job to refresh geocoding for locations older than threshold
- Fix scheduler service to handle event loop isolation for async jobs
- Add database migration for new fields

The cron job runs daily at 2 AM and refreshes lat/lon for locations where
geocoded_at is NULL or older than route_archive_after days from admin settings.

* Update README to include Geocoding Update section and add screenshots

- Added a new section detailing the automated geocoding refresh cron job.
- Included features of the geocoding update, such as automatic refresh and manual trigger.
- Added test data instructions and before/after screenshots to illustrate the geocoding process.

* Resolved commits on PR

* removed appropriate migrations.

* Refactor geocoding refresh logic into helper functions

- Introduced `_get_archive_threshold`, `_get_locations_to_refresh`, and `_refresh_locations` to modularize the geocoding refresh process.
- Updated `refresh_geocoding` to utilize these new functions for improved readability and maintainability.
- Removed redundant code and streamlined the process of fetching and refreshing location data.
@tqmsh tqmsh requested review from PranavGopinath and ludavidca and removed request for PranavGopinath January 12, 2026 22:45
@claude
Copy link

claude bot commented Jan 12, 2026

Pull Request Review: Geocoding Refresh Cron Job

Thank you for this contribution! This is a well-structured implementation of an automated geocoding refresh system. Below is my detailed feedback:


Strengths

  1. Good code structure: The job is well-organized with clear separation of concerns using helper functions
  2. Proper error handling: Exception handling with logging is in place
  3. Consistent with codebase patterns: Follows the same structure as existing jobs (driver_history_jobs.py, email_reminder_jobs.py)
  4. Batch processing: Uses geocode_addresses for efficient bulk geocoding
  5. Configurable threshold: Respects route_archive_after setting from admin_info

🔴 Critical Issues

1. Missing Database Migration

Severity: CRITICAL

The PR adds two new database fields but doesn't include an Alembic migration:

  • Location.geocoded_at (datetime field)
  • Admin.route_archive_after (int field with default=30)

Impact: This will cause runtime errors in production when the code tries to access fields that don't exist in the database schema.

Required Action: Create an Alembic migration file:

cd backend/python
alembic revision -m "add_geocoding_fields"

Then add upgrade/downgrade logic for both fields.

Files to check: backend/python/migrations/versions/ - no migration file exists for these schema changes.


2. Missing Seed Test Data File

Severity: HIGH

The PR description references a seed file for testing:

docker-compose exec backend python -w /app app/seed_geocoding_test_data.py

However, this file is not included in the PR. The file backend/python/app/seed_geocoding_test_data.py does not exist.

Required Action: Either:

  • Add the seed file to the PR, or
  • Update the testing instructions to use a different method

3. Timezone-Naive datetime.now()

Severity: MEDIUM-HIGH

Location:

  • backend/python/app/services/jobs/geocoding_refresh_jobs.py:67
  • backend/python/app/services/jobs/geocoding_refresh_jobs.py:105
  • backend/python/app/services/implementations/location_service.py:77
location.geocoded_at = datetime.now()  # Timezone-naive\!
cutoff_date = datetime.now() - timedelta(days=archive_threshold_days)  # Timezone-naive\!

Problem: Using datetime.now() creates timezone-naive datetime objects. This can cause issues with:

  • Daylight saving time transitions
  • Consistency across different server timezones
  • Comparison with timezone-aware datetime objects from other sources

Recommended Fix: Use timezone-aware datetimes:

from datetime import datetime, timezone

# Option 1: UTC (recommended for storage)
location.geocoded_at = datetime.now(timezone.utc)

# Option 2: Application timezone (if you need local time)
from zoneinfo import ZoneInfo
from app.config import settings
location.geocoded_at = datetime.now(ZoneInfo(settings.scheduler_timezone))

Note: The scheduler correctly uses America/New_York timezone (config.py:61), but the datetime comparisons don't leverage this.


🟡 Medium Priority Issues

4. Redundant Database Query

Location: backend/python/app/services/jobs/geocoding_refresh_jobs.py:100-102

async with async_session_maker_instance() as session:
    admin_statement = select(Admin)
    admin_result = await session.execute(admin_statement)
    admin_record = admin_result.scalars().first()  # ❌ Unused variable

    archive_threshold_days = await _get_archive_threshold(session)  # Queries Admin again

The code queries the Admin table twice:

  1. Lines 100-102 (result unused)
  2. Inside _get_archive_threshold() (lines 23-25)

Fix: Remove lines 100-102 entirely.


5. No Manual Trigger Endpoint

Severity: MEDIUM

The PR description mentions:

Manual trigger via API endpoint

But no API endpoint is included in the PR. The job can only run on schedule.

Recommendation: Add an endpoint in a routes file (e.g., admin_routes.py):

@router.post("/geocoding/refresh")
async def trigger_geocoding_refresh():
    await refresh_geocoding()
    return {"status": "success"}

6. Cron Schedule Mismatch

Severity: LOW

PR Description says: "runs daily at midnight EST"

Actual implementation (backend/python/app/services/jobs/__init__.py:35-40):

scheduler_service.add_cron_job(
    refresh_geocoding,
    job_id="daily_geocoding_refresh",
    hour=2,  # ❌ Runs at 2 AM, not midnight
    minute=0,
)

Fix: Either update the description or change hour=2 to hour=0.


🟢 Minor Issues / Suggestions

7. Missing Type Hint

Location: backend/python/app/services/jobs/geocoding_refresh_jobs.py:21

async def _get_archive_threshold(session) -> int:  # ❌ Missing type hint for 'session'

Better:

from sqlalchemy.ext.asyncio import AsyncSession

async def _get_archive_threshold(session: AsyncSession) -> int:

Same applies to _get_locations_to_refresh (line 37) and _refresh_locations (line 52).


8. Inconsistent Docstring Formatting

The docstring in refresh_geocoding() uses numbered list format, while other jobs use bullet points. For consistency with driver_history_jobs.py:

"""Refresh geocoding for locations - runs daily

This job:
- Gets the route_archive_after setting from admin_info (default 30 days)
- Finds all locations that need refreshing
- Refreshes lat/lon for those locations using Google Geocoding API
- Updates geocoded_at timestamp for refreshed locations
"""

9. Google API Rate Limiting

Consideration: Google Maps Geocoding API has rate limits (typically 50 requests/second). The current implementation uses asyncio.gather(*tasks) which fires all requests concurrently.

Potential Issue: If there are 1000+ stale locations, this could hit rate limits.

Suggestion (optional): Consider adding batch size limits or rate limiting:

# Process in chunks of 50
for i in range(0, len(addresses), 50):
    chunk = addresses[i:i+50]
    results = await geocode_addresses(chunk)
    # ... process results

However, this may not be necessary if your location count is manageable. Just something to monitor.


10. Location.geocoded_at Not Set on Manual Updates

The LocationUpdate model includes geocoded_at, but there's no validation ensuring it's updated when lat/long are manually changed. Consider adding logic to update geocoded_at automatically when coordinates are modified through the API.


🧪 Testing Concerns

Missing Tests

No test files were added for:

  • geocoding_refresh_jobs.py functionality
  • New database fields
  • Scheduler integration

Recommendation: Add tests covering:

  1. Locations with NULL geocoded_at are selected
  2. Locations older than threshold are selected
  3. Recently geocoded locations are skipped
  4. Failed geocoding doesn't crash the job
  5. Admin threshold setting is respected

📋 Summary of Required Changes

Priority Item Status
🔴 CRITICAL Add Alembic database migration ❌ Missing
🔴 HIGH Add/fix seed_geocoding_test_data.py ❌ Missing
🔴 MEDIUM-HIGH Use timezone-aware datetime.now() ⚠️ Needs Fix
🟡 MEDIUM Remove redundant Admin query ⚠️ Code smell
🟡 MEDIUM Add manual trigger endpoint (or remove from description) ⚠️ Incomplete
🟡 LOW Fix cron schedule documentation mismatch ⚠️ Incorrect docs
🟢 MINOR Add type hints for session parameters ✅ Nice to have
🟢 MINOR Consistent docstring formatting ✅ Nice to have

🎯 Recommendation

Status: ⚠️ REQUEST CHANGES

This PR needs critical fixes before merging:

  1. Add database migration (blocking)
  2. Fix or add seed test file (blocking)
  3. Fix timezone-naive datetime usage (strongly recommended)

Once these are addressed, this will be a solid addition to the codebase. The overall architecture is sound and follows existing patterns well.


Great work on implementing this feature! Let me know if you have questions about any of these points. Happy to help with the migration file or other fixes. 🚀

from app.models import async_session_maker_instance
from app.models.admin import Admin
from app.models.location import Location
from app.utilities.geocoding import geocode_addresses
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the app.utilities.google_maps_client instead of the raw functions here :)

Use app.utilities.google_maps_client instead of raw geocode_addresses function:
- Import and use get_google_maps_client from dependencies
- Update result handling to use GeocodeResult objects instead of dicts
- Follow existing pattern used in location_service.py

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@claude
Copy link

claude bot commented Jan 17, 2026

Code Review - PR #77: Geocoding Refresh Cron Job

Thank you for this contribution! This PR implements automated geocoding refresh functionality. I've conducted a thorough review and have feedback organized by priority.


🔴 Critical Issues

1. Missing Database Migration

Severity: BLOCKING

The PR adds two new database fields but does not include an Alembic migration:

  • Location.geocoded_at (datetime, nullable)
  • Admin.route_archive_after (int, default=30, non-nullable)

Impact: This will cause runtime errors when deployed, as the application code expects fields that don't exist in the database schema.

Required Action: Create and include an Alembic migration:

cd backend/python
alembic revision -m "add geocoded_at and route_archive_after fields"

Example migration content:

def upgrade() -> None:
    op.add_column('locations', sa.Column('geocoded_at', sa.DateTime(), nullable=True))
    op.add_column('admin', sa.Column('route_archive_after', sa.Integer(), nullable=False, server_default='30'))

def downgrade() -> None:
    op.drop_column('locations', 'geocoded_at')
    op.drop_column('admin', 'route_archive_after')

2. Timezone-Naive datetime Usage

Severity: HIGH

Locations:

  • geocoding_refresh_jobs.py:68 - location.geocoded_at = datetime.now()
  • geocoding_refresh_jobs.py:107 - cutoff_date = datetime.now() - timedelta(...)
  • location_service.py:77 - geocoded_at=datetime.now()

Problem: Using datetime.now() creates timezone-naive datetime objects, which can cause:

  • Incorrect comparisons during DST transitions
  • Inconsistencies if the server timezone changes
  • Subtle bugs when comparing with timezone-aware datetimes

Note: I see job_service.py:28-29 has a utc_now_naive() helper that uses datetime.now(timezone.utc).replace(tzinfo=None). While this provides UTC consistency, the codebase should ideally move toward timezone-aware datetimes.

Recommended Fix:

from datetime import datetime, timezone

# In geocoding_refresh_jobs.py and location_service.py
location.geocoded_at = datetime.now(timezone.utc)
cutoff_date = datetime.now(timezone.utc) - timedelta(days=archive_threshold_days)

Alternative: If you must use naive datetimes for database compatibility, at least be explicit about UTC:

location.geocoded_at = datetime.now(timezone.utc).replace(tzinfo=None)

3. Missing Seed Test File

Severity: MEDIUM-HIGH

The PR description references:

docker-compose exec backend python -w /app app/seed_geocoding_test_data.py

However, backend/python/app/seed_geocoding_test_data.py is not included in this PR.

Required Action: Either add the seed file or update the testing instructions to use an alternative approach.


🟡 Medium Priority Issues

4. Redundant Database Query

Location: geocoding_refresh_jobs.py:100-103

async with async_session_maker_instance() as session:
    admin_statement = select(Admin)
    admin_result = await session.execute(admin_statement)
    admin_record = admin_result.scalars().first()  # ❌ Result never used

    archive_threshold_days = await _get_archive_threshold(session)  # Queries Admin again

The Admin table is queried twice: once on lines 100-103 (result unused), and again inside _get_archive_threshold().

Fix: Remove lines 100-103 entirely.


5. Missing Type Hints

Locations: geocoding_refresh_jobs.py:21, 37, 52

async def _get_archive_threshold(session) -> int:  # Missing type hint
async def _get_locations_to_refresh(session, cutoff_date) -> list[Location]:  # Missing types
async def _refresh_locations(session, locations: list[Location]) -> int:  # Missing session type

Recommended Fix:

from sqlalchemy.ext.asyncio import AsyncSession

async def _get_archive_threshold(session: AsyncSession) -> int:
async def _get_locations_to_refresh(session: AsyncSession, cutoff_date: datetime) -> list[Location]:
async def _refresh_locations(session: AsyncSession, locations: list[Location]) -> int:

6. Documentation Inconsistency

Location: PR description vs. code

PR Description states: "runs daily at midnight EST"

Actual code (jobs/__init__.py:35-40):

scheduler_service.add_cron_job(
    refresh_geocoding,
    job_id="daily_geocoding_refresh",
    hour=2,  # ❌ Runs at 2 AM, not midnight
    minute=0,
)

Fix: Update the PR description to say "2 AM EST" or change hour=2 to hour=0.


7. Manual Trigger Endpoint Not Implemented

Severity: LOW

The PR description lists:

Manual trigger via API endpoint

But no API endpoint is included. The job can only be triggered by the scheduler.

Recommendation: Either:

  1. Add a manual trigger endpoint (e.g., in admin_routes.py)
  2. Or remove this claim from the PR description.

🟢 Minor Issues & Suggestions

8. Potential Google API Rate Limiting

The Google Maps Geocoding API has rate limits (typically 50 requests/second for standard plans). The current implementation processes all locations concurrently via geocode_addresses(), which internally uses list comprehension.

Consideration: If you have 100+ stale locations, you might hit rate limits.

Optional Enhancement (for future consideration):

# Process in batches of 50
BATCH_SIZE = 50
for i in range(0, len(addresses), BATCH_SIZE):
    batch = addresses[i:i+BATCH_SIZE]
    results = await google_maps_client.geocode_addresses(batch)
    # ... process results

This may not be urgent for current scale but worth monitoring.


9. geocoded_at Not Updated on Manual Coordinate Changes

The LocationUpdate model includes geocoded_at, but there's no validation ensuring it's automatically updated when lat/long are manually changed through the API.

Suggestion: Consider adding logic in update_location() to automatically update geocoded_at when coordinates are modified.


10. Missing Tests

No test files were added for the new functionality.

Recommended Test Coverage:

  1. Locations with NULL geocoded_at are selected
  2. Locations older than threshold are selected
  3. Recently geocoded locations (< threshold) are skipped
  4. Failed geocoding doesn't crash the entire job
  5. Admin route_archive_after setting is respected
  6. Default threshold (30 days) used when admin record doesn't exist

Consider adding tests/test_geocoding_refresh_jobs.py following the pattern in tests/test_driver_assignment_service.py.


✅ Strengths

  1. Well-structured code: Clean separation of concerns with helper functions
  2. Consistent with codebase patterns: Follows the same structure as existing jobs
  3. Proper error handling: Comprehensive logging and exception handling
  4. Efficient batch processing: Uses geocode_addresses() for bulk operations
  5. Good scheduler integration: Properly integrated with the event loop management

📋 Summary & Recommendation

Priority Issue Status
🔴 CRITICAL Add Alembic database migration BLOCKING
🔴 HIGH Use timezone-aware datetimes ⚠️ Needs Fix
🟡 MEDIUM Add missing seed test file ⚠️ Missing
🟡 MEDIUM Remove redundant Admin query ⚠️ Code smell
🟡 MEDIUM Fix documentation mismatch (2 AM vs midnight) ⚠️ Inconsistent
🟡 MEDIUM Add type hints for session parameters ⚠️ Missing
🟢 LOW Implement manual trigger endpoint or update docs ⚠️ Incomplete feature
🟢 LOW Add test coverage ⚠️ No tests

Overall Assessment: ⚠️ REQUEST CHANGES

The implementation is architecturally sound and follows good patterns, but requires critical fixes before merging:

  1. Database migration (blocking - will break in production)
  2. Timezone-aware datetimes (strongly recommended for data integrity)
  3. Seed test file (needed for testing instructions to work)

Once these issues are addressed, this will be a solid addition to the codebase. Happy to help with any questions or the migration file!

@tqmsh tqmsh requested a review from landont168 January 20, 2026 20:31
@ludavidca ludavidca requested review from PranavGopinath and ludavidca and removed request for PranavGopinath and ludavidca February 2, 2026 05:10
@ludavidca ludavidca requested review from eddywang4340 and removed request for landont168 and ludavidca February 11, 2026 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants