Skip to content

DiscrimTwoSample output depends on input order #422

Open
@ameliecr

Description

When running the DiscrimTwoSample.test function, I am getting different results depending on the order of my x1 and x2 input. Please see code and output below.
After looking into the code, I think the problem probably lies with the removal of isolates in the data. It seems that when providing two matrices, the isolates are only removed from the first matrix. When removing the isolates before passing the matrices to the DiscrimTwoSample.test function, I receive the expected results.

Reproducing code example:

import pandas as pd
import numpy as np
import re
from hyppo.discrim import DiscrimTwoSample

def get_discrim_two_sample(dice_1: str, dice_2: str):
    dices_1 = pd.read_csv(dice_1, index_col=0, na_values=[""]).values
    distances_1 = 1 - dices_1
    subject_ids_1 = pd.read_csv(dice_1, nrows=0).columns.values[1:]
    subject_ids_1 = np.array(
        [re.sub(r"sub-", "", col) for col in subject_ids_1]
    )

    
    dices_2 = pd.read_csv(dice_2, index_col=0, na_values=[""]).values
    distances_2 = 1 - dices_2
    subject_ids_2 = pd.read_csv(dice_2, nrows=0).columns.values[1:]
    subject_ids_2 = np.array(
        [re.sub(r"sub-", "", col) for col in subject_ids_2]
    )

    # Remove rows and columns from the distance matrix that only contain NaNs
    # This will exclude runs that the bundle of interest couldn't be reconstructed for
    rows_to_keep_1 = ~np.isnan(distances_1).all(axis=1)
    distances_1 = distances_1[rows_to_keep_1]
    distances_1 = distances_1[:, rows_to_keep_1]
    subject_ids_1 = subject_ids_1[rows_to_keep_1]
    rows_to_keep_2 = ~np.isnan(distances_2).all(axis=1)
    distances_2 = distances_2[rows_to_keep_2]
    distances_2 = distances_2[:, rows_to_keep_2]
    subject_ids_2 = subject_ids_2[rows_to_keep_2]

    # Remove all subID-run combos that are not present in both distance matrices
    # Step 1: Find and sort the common subject-run combinations for stable indexing
    common_subids = np.intersect1d(subject_ids_1, subject_ids_2)
    common_subids.sort()  # Sort to ensure consistency in ordering

    # Step 2: Find indices of common combinations in both vectors using sorted common_subids
    indices1 = np.searchsorted(subject_ids_1, common_subids)
    indices2 = np.searchsorted(subject_ids_2, common_subids)

    # Step 3: Filter matrices accordingly, ensuring consistent ordering
    filtered_distances_1 = distances_1[np.ix_(indices1, indices1)]
    filtered_distances_2 = distances_2[np.ix_(indices2, indices2)]

    # remove run from the subject IDS so that they can be converted to float
    common_subids = np.array(
        [re.sub(r"\_run-\d+", "", col) for col in common_subids]
    )

    two_sample_output = DiscrimTwoSample(is_dist=True, remove_isolates=True).test(filtered_distances_1, filtered_distances_2, common_subids, workers=1)
    print(two_sample_output)
    two_sample_output = DiscrimTwoSample(is_dist=True, remove_isolates=True).test(filtered_distances_2, filtered_distances_1, common_subids, workers=1)
    print(two_sample_output)

    return two_sample_output


dice1 = '/Users/amelier/Code/dice_GQI/ProjectionBrainstemDentatorubrothalamicTractlr.csv'
dice2 = "/Users/amelier/Code/dice_SS3T/ProjectionBrainstemDentatorubrothalamicTractlr.csv"

get_discrim_two_sample(dice1, dice2)

Output:

DiscrimTwoSampleTestOutput(d1=0.6447122262572907, d2=0.5462221671985621, pvalue=0.001)
DiscrimTwoSampleTestOutput(d1=0.6224689116320018, d2=0.5630685778217968, pvalue=2.002002002002002e-06)

Version information

  • OS: macOS
  • Python Version 3.12.5
  • Package Version 0.5.1

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions