Skip to content

Dev/spark backend new comparator#213

Open
TonyKatkov89 wants to merge 36 commits intodev/spark_backendfrom
dev/spark_backend_new_comparator
Open

Dev/spark backend new comparator#213
TonyKatkov89 wants to merge 36 commits intodev/spark_backendfrom
dev/spark_backend_new_comparator

Conversation

@TonyKatkov89
Copy link
Collaborator

Stats-based comparator for Spark-efficient hypothesis testing

Summary

  • Refactor comparator hierarchy: extract BaseComparator as a shared root, keeping GroupsComparator (former Comparator) for raw-data comparisons and adding the new StatsComparator for aggregation-based comparisons.
  • Add StatsComparator: a two-phase abstract comparator that operates on pre-aggregated sufficient statistics instead of raw data slices. Phase 1 issues a single .agg() call across all target columns and groups; Phase 2 runs analytical tests on the returned scalar dicts — entirely driver-side. This reduces Spark jobs from O(columns × groups) to a constant one per executor.
  • Add AggTTest: a concrete StatsComparator implementing Welch's t-test from {mean, var, count} statistics. Produces the same output shape as TTest and is a drop-in replacement in pipelines where raw data transfer is expensive.
  • Fix GroupedDataset.agg with list input: flatten the Pandas MultiIndex produced by list-style aggregation into {col}┆{stat} column names, which StatsComparator.execute relies on.

Motivation

The existing Comparator/GroupsComparator pattern pulls raw group data to the driver before running statistical tests. On a Spark backend this causes one separate distributed job per (group pair × column), which is prohibitively slow for wide datasets. StatsComparator + AggTTest solve this by aggregating in one distributed pass and doing the math locally on small scalar dicts.

Files changed

File Change
hypex/comparators/abstract.py New BaseComparator, GroupsComparator (renamed from Comparator), StatsComparator
hypex/comparators/stats_hypothesis_testing.py New AggTTest
hypex/comparators/__init__.py Export new public classes
hypex/dataset/groupby_dataset.py Fix MultiIndex flattening in agg()
hypex/utils/constants.py / __init__.py Remove duplicate constant definitions

Test plan

  • Run existing comparator tests to verify GroupsComparator (formerly Comparator) behaviour is unchanged
  • Verify AggTTest p-values match TTest on the same dataset
  • Test GroupedDataset.agg with a list of stat functions produces flat col┆stat column names
  • Smoke-test StatsComparator/AggTTest against a Spark-backed ExperimentData

yurashku and others added 30 commits February 11, 2026 17:46
fixed set_value
…icate code get_values and iget_values from PandasDataset
- Fix __init__ to conditionally initialize physical index based on flag
- Implement iloc with physical_index_actual_flag checking
- Add physical_index_actual_flag=False to loc, sort_values, dropna
- Exclude utility columns from fillna, drop, rename operations
- Use _public_columns in agg, mode, log to avoid utility column processing
- Add warnings when user attempts to modify utility columns

BREAKING CHANGE: iloc now requires physical_index_actual_flag to be True
…ues, from_dict, to_dict, to_records, index.

Testing and limitation required.
…park' into dev/spark_backend_new_comparator

# Conflicts:
#	hypex/dataset/backends/pandas_backend.py
#	hypex/dataset/backends/spark_backend.py
# Conflicts:
#	hypex/dataset/abstract.py
#	hypex/dataset/backends/pandas_backend.py
#	hypex/dataset/backends/spark_backend.py
#	hypex/dataset/dataset.py
#	hypex/utils/__init__.py
#	hypex/utils/typings.py
#	tests/test_spark_backend.ipynb
@TonyKatkov89 TonyKatkov89 added this to the 1.1 milestone Mar 13, 2026
@TonyKatkov89 TonyKatkov89 requested a review from Mkrie March 13, 2026 13:49
@TonyKatkov89 TonyKatkov89 self-assigned this Mar 13, 2026
@TonyKatkov89 TonyKatkov89 added the enhancement New feature or request label Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants