Skip to content

arsenelupin14/rubin-sampling

Repository files navigation

Rubin Sampling

Public baseline pipeline for variable-star period recovery and alias analysis under Rubin-like cadence constraints.

Overview

This repository tracks an end-to-end workflow for:

  • building Gaia DR3 truth-set samples
  • ingesting real public-survey light curves
  • standardizing raw photometry into a project schema
  • running baseline period search
  • evaluating recovery, alias modes, and failure cases

The current live-data workflow is centered on RR Lyrae objects cross-matched from Gaia DR3 to ZTF.

Current Status

As of 2026-03-09, the project has completed:

  • Gaia truth tables for RR Lyrae, Cepheid, and Eclipsing Binary
  • a successful live RR Lyrae pilot run on real ZTF data
  • a first larger RR Lyrae baseline batch with 30 usable objects
  • end-to-end baseline period search and evaluation on that batch

Current Baseline Snapshot

Item Value
Survey proxy ZTF
Truth set Gaia DR3
Current live class RR Lyrae
Usable baseline objects 30
Ingest success rate 30/43 (69.8%)
Period recovery 24/30 (80%)
Median relative period error ~1.4e-5
Main failure modes P/2, 2P, other mismatches
Most stable live provider ALeRCE

Repository Structure

Rubin Sampling/
  configs/
  data/
    catalogs/
    lc_raw/
    lc_resampled/
    truth/
  logs/
  notebooks/
  results/
    figures/
    tables/
  src/
    rubin_sampling/
  CHECKLIST.md
  CHECKLIST_NEXT.md

Tracked Public Artifacts

This public repository now includes generated science artifacts directly in git:

  • Gaia truth-set parquet files in data/truth/
  • pilot and baseline ZTF light-curve parquet files in data/lc_raw/ztf/
  • period-result tables in results/tables/
  • evaluation tables in results/**/tables/
  • figure bundles in results/**/figures/

Main Artifacts

Truth sets:

  • data/truth/truth_gaia_rrlyrae.parquet
  • data/truth/truth_gaia_cepheid.parquet
  • data/truth/truth_gaia_eb.parquet
  • data/truth/truth_gaia_summary.parquet

Pilot RR Lyrae:

  • standardized batch: data/lc_raw/ztf/pilot_standardized.parquet
  • ingest summary: data/lc_raw/ztf/pilot_summary.parquet
  • period results: results/tables/pilot_period_results.parquet
  • evaluation bundle: results/pilot/

Baseline RR Lyrae batch:

  • raw per-object files: data/lc_raw/ztf/rrlyrae_baseline/
  • standardized batch: data/lc_raw/ztf/rrlyrae_baseline_standardized.parquet
  • ingest summary: data/lc_raw/ztf/rrlyrae_baseline_summary.parquet
  • period results: results/tables/rrlyrae_baseline_period_results.parquet
  • evaluation bundle: results/rrlyrae_baseline/

Core Pipeline

The baseline pipeline currently covers:

  1. Gaia truth-set download via rubin_sampling.download_gaia_truth
  2. live ZTF ingest via rubin_sampling.ingest_ztf_pilot
  3. period search via rubin_sampling.period_pipeline
  4. feature extraction via rubin_sampling.features
  5. evaluation and figure generation via rubin_sampling.evaluate

Data Contract

Minimum light-curve schema:

  • object_id
  • time
  • mag_or_flux
  • err
  • band

Optional metadata:

  • class_label
  • period_ref
  • ra
  • dec
  • survey

Primary storage format: parquet

Quickstart

Run from the project root:

python -m venv .venv
.venv/bin/python -m pip install -r requirements.txt
export PYTHONPATH=src

Main Commands

Download Gaia truth sets:

.venv/bin/python -m rubin_sampling.download_gaia_truth \
  --output-dir data/truth \
  --limit-per-class 1000

Run a live ZTF ingest:

.venv/bin/python -m rubin_sampling.ingest_ztf_pilot \
  --truth data/truth/truth_gaia_rrlyrae.parquet \
  --target-count 30 \
  --candidate-pool 200

Run baseline period search:

.venv/bin/python -m rubin_sampling.period_pipeline \
  --input data/lc_raw/ztf/rrlyrae_baseline_standardized.parquet \
  --output results/tables/rrlyrae_baseline_period_results.parquet \
  --config configs/baseline.yaml

Run evaluation:

.venv/bin/python -m rubin_sampling.evaluate \
  --period-results results/tables/rrlyrae_baseline_period_results.parquet \
  --lightcurves data/lc_raw/ztf/rrlyrae_baseline_standardized.parquet \
  --ingest-summary data/lc_raw/ztf/rrlyrae_baseline_summary.parquet \
  --output-dir results/rrlyrae_baseline

Ingest Notes

ingest_ztf_pilot supports live providers auto|irsa|alerce.

  • auto tries IRSA first, then falls back to ALeRCE
  • alerce is currently the fastest stable option for scaling the real RR Lyrae batch
  • --resume reuses an existing summary file
  • --flush-every writes incremental progress during long runs
  • successful raw per-object files can rebuild the standardized batch output without refetching

Evaluation Outputs

evaluate writes:

  • tables/period_evaluation.parquet
  • tables/metrics_summary.parquet
  • tables/ingest_status_counts.parquet
  • tables/ingest_failure_cases.parquet

Known Gaps

  • the real-data baseline sample is still below the 100-300 usable-object target
  • extreme photometric outlier filtering is not yet explicit in ingest validation
  • some period solutions still collapse to P/2 or 2P
  • failure cases still need manual review before cadence-emulation work begins

Next Steps

  1. scale RR Lyrae from 30 usable objects to 100+
  2. review the 6 current period misses in the baseline batch
  3. add explicit outlier hardening before the next scaling run
  4. extend the stabilized workflow to the next variable-star class

See CHECKLIST_NEXT.md for the active execution plan and CHECKLIST.md for the broader roadmap.

About

Baseline workflow for ZTF/Gaia-linked time-series ingestion, standardization, and period-recovery evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages