Skip to content

Conversation

@reybahl
Copy link
Member

@reybahl reybahl commented Oct 10, 2025

more efficient data processing / loading, specifically during the importing courses step during sync

@sentry
Copy link

sentry bot commented Oct 10, 2025

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: ferry/transform/import_courses.py

Function Unhandled Issue
import_courses KeyError: 11423 ferry.transform.import_courses in...
Event Count: 3
resolve_cross_listings IndexError: single positional indexer is out-of-bounds ferry.transform.import_courses in resolve_cross_li...
Event Count: 1

@reybahl reybahl requested a review from Copilot October 10, 2025 23:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements memory optimizations to reduce memory usage during data transformation processes. The changes focus on reducing intermediate data structures, explicit garbage collection, and more efficient data processing patterns.

Key changes:

  • Implemented explicit garbage collection at strategic points in the transformation pipeline
  • Optimized professor and course rating computations to avoid large intermediate DataFrames
  • Added immediate cleanup of temporary data structures with explicit del statements

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
ferry/transform/init.py Added garbage collection calls after major transformation steps
ferry/transform/transform_compute.py Refactored rating computations to use dictionaries and direct aggregation instead of lambda functions and intermediate lists
ferry/transform/import_courses.py Added cleanup of source data immediately after concatenation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

# Pre-compute aggregated ratings for each same_course group to avoid repeated list creation
logging.debug("Pre-computing same-course rating aggregates")

def compute_aggregate_rating(course_ids: list[int], rating_dict: dict) -> tuple[float | None, int]:
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rating_dict parameter should have a more specific type annotation. Consider using dict[int, float] or dict[int, float | None] to better document the expected key-value types.

Suggested change
def compute_aggregate_rating(course_ids: list[int], rating_dict: dict) -> tuple[float | None, int]:
def compute_aggregate_rating(course_ids: list[int], rating_dict: dict[int, float | None]) -> tuple[float | None, int]:

Copilot uses AI. Check for mistakes.
@reybahl reybahl changed the title another memory enhancement attempt fix: memory enhancement in course import Oct 11, 2025
@reybahl reybahl changed the title fix: memory enhancement in course import fix: memory enhancement in import courses Oct 11, 2025
@reybahl reybahl merged commit a542f8e into master Oct 11, 2025
6 checks passed
@reybahl reybahl deleted the memory-enhancements-v2 branch October 11, 2025 01:38
@bearsyankees
Copy link
Member

lfg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants