Skip to content

Conversation

@beatrizsaldana
Copy link
Member

Problem

The immunohisto transform was missing y-axis maximum values needed for proper visualization of data. Without these values, frontend applications couldn't properly scale figures, leading to poor user experience when viewing biomarker and pathology data.

Solution

This PR implements a comprehensive solution for calculating and rounding y-axis maximum values in the immunohisto transform:

Key Features Added:

  1. Y-axis Maximum Calculation: Added logic to calculate the maximum value across all ages for each combination of (name, evidence_type, tissue)
  2. Smart Rounding Algorithm: Implemented round_y_axis_max() function with sophisticated rounding logic that:
    • Returns 10 for zero values
    • Rounds UP to the next "nice" number where the second digit is 0 or 5
    • Handles different magnitudes correctly (e.g., 0.0021 → 0.0025, 1094 → 1500)
    • Maintains monotonicity and handles floating-point precision issues
  3. Comprehensive Error Handling: Handles edge cases like negative values, conversion errors, and missing data

Rounding Logic Details:
The rounding algorithm follows a specific pattern with key design principles:

  1. Always Round UP: Ensures the y-axis scale accommodates all data points
  2. "Nice Numbers": Creates visually appealing scales with 0 or 5 as second digit
  3. Magnitude Preservation: Maintains the order of magnitude while rounding
  4. Floating-Point Safety: Uses string manipulation to avoid precision issues
  5. Monotonicity: Larger inputs always produce larger or equal outputs

Tests

Added comprehensive test coverage including:

  • Unit tests for round_y_axis_max() function with 15+ test cases covering:
    • All Jira ticket examples
    • Edge cases (zero, negative values, very small/large numbers)
    • Second digit rounding logic
    • Magnitude handling
    • Floating-point precision issues
    • Monotonicity verification
    • Parametrized tests for main examples
  • Integration tests updated to include expected y_axis_max field in all test output files
  • Test assets updated with new expected output containing y-axis max values

Files Modified:

  • src/agoradatatools/etl/transform/immunohisto_transform.py - Core implementation
  • tests/transform/test_immunohisto_transform.py - Comprehensive test suite
  • All test output JSON files updated with expected y_axis_max values

@beatrizsaldana beatrizsaldana requested a review from a team as a code owner September 8, 2025 20:54
PyYAML~=6.0
pyarrow~=14.0.1
typer~=0.7.0
click<8.3.0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent CI failure

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive y-axis maximum calculation functionality to the immunohisto transform to support proper data visualization scaling in frontend applications. The key changes include implementing a sophisticated rounding algorithm, calculating y-axis maximums across age groups, and maintaining data completeness.

  • Added round_y_axis_max() function with smart rounding logic that always rounds UP to "nice numbers"
  • Implemented y-axis maximum calculation across all ages for each (name, evidence_type, tissue) combination
  • Refactored transform logic into modular helper functions and updated all test output files with expected y_axis_max values

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

File Description
src/agoradatatools/etl/transform/immunohisto_transform.py Core implementation of y-axis max calculation and rounding logic
tests/transform/test_immunohisto_transform.py Comprehensive test suite with 150+ test cases for new functionality
tests/test_assets//output/.json Updated expected test outputs to include y_axis_max field
setup.cfg Added click version constraint for dependency management

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever solution to making sure things get rounded correctly. I do think this code can be simplified a lot and I made some suggestions on how to do that with pandas instead of lists of dictionaries.

entry[extra_column_name] = group[extra_columns].to_dict("records")
data_rows.append(entry)

return data_rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion for simplifying this function. You might even be able to delete this a function and move this to the main code since it's short:

# Add nest_fields from agoradatatools.etl.utils to the imports at the top of the file

grouped = nest_fields(dataset.copy(), grouping=group_columns, new_column="data", drop_columns=group_columns)

grouped["y_axis_max"] = grouped.apply(
  lambda entry: 
    round_y_axis_max(y_axis_max_map.get((entry["name"], entry["evidence_type"], entry["tissue"]), 0)), axis=1)
  # Though as above, I really think the rounding should already be in the lookup table
))

return grouped.to_dict("records")

}
)

return data_rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion for simplifying this function. If you use my suggestion for _create_data_rows but leave data_rows as a data frame, you can play some games with pandas to fill in missing ages. This removes the need to pass in the original data set or create a y_max lookup table and you can just return the data frame and I don't think you need to call convert_numpy_types either.

# add import numpy as np to imports at top of function

available_ages = list(data_rows["age"].drop_duplicates())
# All unique combinations of groups
fill_df = data_rows[["name", "evidence_type", "tissue", "y_axis_max"]].copy().drop_duplicates()
fill_df["age"] = [available_ages] * fill_df.shape[0] # Make an "age" column where each entry is a list of all possible ages

# "explode" makes one row per age + group. Then merge back into the data to create new rows for missing ages
fill_df = fill_df.explode("age").merge(data_rows, how="outer", validate="one_to_one") 

# Fill NA values for units. Can't use fillna to make an empty list so we add an extra line
fill_df = fill_df.fillna({"units": ""})
fill_df["data"] = fill_df["data"].apply(lambda x: [] if x is np.NaN else x)

# sort by age
# Adjust _extract_age_num to work with a single string from a pd.Series, no need to return a tuple
def _extract_age_num(entry: str) -> float:
  try/catch but return float

fill_df["age_numeric"] = fill_df["age"].apply(_extract_age_num)

# Sort
fill_df = fill_df.sort_values(["age_numeric", "age", "evidence_type"]).drop(columns="age_numeric") 

import os

import pandas as pd
import pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review the tests after you simplify. You might be able to remove some of the helper functions and delete/simplify those tests.

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants