Skip to content

rudolffu/euclidkit

Repository files navigation

euclidkit

PyPI version Read the Docs

A comprehensive Python package for Euclid archival data analysis, designed for use within the ESA Datalabs environment.

Overview

euclidkit facilitates advanced data exploration and visualization for Euclid Q1/(I)DR1 archival releases, including:

  • Data Access: Query and crossmatch sources with the Euclid MER catalogue
  • Spectroscopic Analysis: Access, download, and combine NISP spectra of archival sources
  • Unified Workflow: Streamlined tools for researchers working with Euclid spectroscopic data

The package is designed for efficient archive querying and Euclid spectrum compilation workflows.

Installation

Requirements

  • Python 3.11+
  • Access to ESA Datalabs environment (for data volumes)
  • COSMOS credentials for Euclid archive access

Basic Installation

pip install euclidkit

Development Installation

git clone https://github.com/rudolffu/euclidkit.git
cd euclidkit
pip install -e .

Quick Start

Setup Credentials

Store credentials in a private file under your home directory and restrict permissions:

mkdir -p ~/.euclidkit
touch ~/.euclidkit/.cred.txt
chmod 600 ~/.euclidkit/.cred.txt

Edit ~/.euclidkit/.cred.txt manually with your preferred editor (do not put credentials in shell history).

Use two lines:

  1. COSMOS username
  2. COSMOS password

Configuration

Create and edit the user config file:

euclidkit init-config --output ~/.euclidkit/euclidkit_config.yaml --template basic

Then edit ~/.euclidkit/euclidkit_config.yaml and set the credential path.

Set the credential path in the config:

data:
  credentials_file: /home/<user>/.euclidkit/.cred.txt

Basic Usage

# Note: the Python import path is currently still `euclidkit`.
from euclidkit.core.data_access import EuclidArchive

# Initialize archive connection
archive = EuclidArchive(environment='PDR')
archive.login()

# Crossmatch your sources with Euclid MER catalogue
results = archive.crossmatch_sources(
    user_table="my_sources.csv",
    radius=1.0,  # arcseconds
    output_file="crossmatch_results.fits"
)

# Query for available spectra
spectra_table = archive.query_spectra_sources(
    crossmatch_table=results,
    output_file="spectra_sources.fits"
)

# Combine spectra into a single FITS file
combined_file = archive.combine_spectra_to_fits(
    spectra_table=spectra_table,
    output_file="my_combined_spectra.fits"
)

Command Line Interface

Crossmatching Sources

# Crossmatch user table with Euclid MER catalogue
euclidkit crossmatch \
    --input my_sources.csv \
    --output crossmatch_results.fits \
    --radius 1.0 \
    --verbose

# Submit the entire table as a single async job (no batching). The output file
# uses async TAP mode; for very large tables euclidkit splits into async chunks.
euclidkit crossmatch \
    --input my_sources.csv \
    --output crossmatch_results.fits \
    --full-async \
    --async-chunk-size 500000

# When using the IDR environment the command defaults to the WIDE field and
# writes results to wide_<filename>. Use --idr-field DEEP to query the deep stack:
euclidkit crossmatch \
    --input my_sources.csv \
    --output crossmatch_results.fits \
    --environment IDR \
    --idr-field DEEP

# Crossmatch an already-uploaded archive user table (no local upload needed)
euclidkit crossmatch \
    --user-table-name my_table \
    --output crossmatch_results.fits \
    --match-mode object-id \
    --environment IDR \
    --idr-field WIDE

--full-async behavior:

  • For smaller inputs, euclidkit submits one async TAP job, downloads the result to the requested output file, and then removes the remote job.
  • For large local input tables (--input), euclidkit splits the upload into async chunks, saves each chunk to <output>_part_####.fits, removes each remote job after the chunk is saved, writes <output>.manifest.json, and merges the chunk files into the requested final output.
  • For large archive user tables (--user-table-name), euclidkit uses the same on-disk chunking pattern and final merge.

Matching mode recommendation:

  • Prefer --match-mode object-id whenever the input already contains Euclid object_id values, or source_id values that should be joined to MER object_id. This avoids positional matching and is usually faster and more robust for large tables.

--max-sources vs --async-chunk-size:

  • --max-sources: limits how many rows from the input table are processed in total.
  • --async-chunk-size: controls rows per async TAP job when --full-async is enabled.

Uploading Tables

# Upload a FITS table to your Euclid TAP workspace
euclidkit upload-table \
    --input my_sources.fits \
    --table-name my_workspace_table \
    --description "Sources awaiting deep crossmatch" \
    --overwrite

# Upload CSV data as-is (format inferred automatically)
euclidkit upload-table \
    --input trimmed_sources.csv \
    --table-name trimmed_sources

Querying Spectra

# Query spectra from crossmatch results
euclidkit query-spectra \
    --crossmatch crossmatch_results.fits \
    --output spectra_sources.fits \
    --environment IDR \
    --idr-field WIDE \
    --verbose

# Query spectra by object IDs and auto-combine
euclidkit query-spectra \
    --object-ids 123456,789012,345678 \
    --output spectra_sources.fits \
    --combine-output my_spectra.fits \
    --max-spectra 100 \
    --verbose

Building Cutana Input

# Build Cutana CSV from a source table with object_id or ra/dec columns
euclidkit query-cutana \
    --sources my_sources.fits \
    --output cutana_input.csv \
    --instrument VIS \
    --cutout-size arcsec \
    --cutout-size-value 15

# NISP example with explicit filters
euclidkit query-cutana \
    --sources my_sources.fits \
    --output cutana_input_nisp.csv \
    --instrument NISP \
    --nisp-filters NIR_Y,NIR_H \
    --environment IDR \
    --idr-field DEEP \
    --cutout-size arcsec \
    --cutout-size-value 15

Compiling Spectra

# Compile individual spectra into chunked FITS files
euclidkit compile-spectra \
    --spectra-table spectra_sources.fits \
    --output-dir ./output \
    --prefix compiled_spectra \
    --max-extensions 1000 \
    --verbose

# IDR DEEP canonical mode: choose arm(s) using XML LambdaRange
euclidkit compile-spectra \
    --spectra-table spectra_sources.fits \
    --output-dir ./output \
    --prefix compiled_deep \
    --environment IDR \
    --idr-field DEEP \
    -L BOTH

# Datalink mode: compile BOTH arms into separate _rgs / _bgs outputs
euclidkit compile-spectra \
    --spectra-table spectra_sources.fits \
    --output-dir ./output \
    --prefix compiled_dl \
    --use-datalink \
    --environment IDR \
    --schema sedm \
    -L BOTH

Note: for canonical compilation from local Datalabs FITS volumes, --workers 2 is often not faster due to shared-storage I/O contention. Prefer --workers 1 unless benchmarking on your setup shows a clear gain. Note: -L/--lambda-range is the unified arm selector. In datalink mode, RGS/BGS map to corresponding retrieval types, and BOTH runs two passes and writes separate _rgs and _bgs files. --retrieval-type is kept for backward compatibility.

Key Features

Data Archive Integration

  • Multiple Environments: Support for PDR, IDR, OTF, and REG archive environments
  • Efficient Queries: Batch processing with TAP table uploads for large datasets
  • Crossmatching: Position-based matching with configurable search radius

Spectroscopic Tools

  • Spectrum Access: Direct access to Euclid data volumes on ESA Datalabs
  • FITS Compilation: Combine individual spectra into multi-extension FITS files
  • Metadata Preservation: Maintain source IDs, coordinates, and provenance information

Analysis Pipeline

  • Quality Control: Spectrum validation and quality assessment

Data Environment

ESA Datalabs Integration

This package is optimized for the ESA Datalabs environment with direct access to:

  • Euclid Q1 Data: /data/euclid_q1/ (35 TB volume)

API Reference

Core Classes

EuclidArchive

Main interface to the Euclid science archive.

archive = EuclidArchive(environment='PDR')
archive.login(credentials_file='~/.euclidkit/.cred.txt')

# Crossmatch sources
results = archive.crossmatch_sources(
    user_table="sources.csv",
    radius=1.0,
    output_file="results.fits"
)

# Query spectra
spectra = archive.query_spectra_sources(
    crossmatch_table=results,
    output_file="spectra.fits"
)

# Get individual spectrum
spectrum_hdu = archive.get_individual_spectrum(
    datalabs_path="/data/euclid_q1/path",
    file_name="spectrum_file.fits", 
    hdu_index=42
)

# Combine spectra
combined = archive.combine_spectra_to_fits(
    spectra_table=spectra,
    output_file="combined.fits",
    max_spectra=1000
)

SpectrumCompiler

Advanced spectrum compilation with chunking support.

from euclidkit.core.spectra import SpectrumCompiler

compiler = SpectrumCompiler(max_extensions=1000)

# Compile into chunked files
output_files = compiler.compile_spectra(
    spectra_table=spectra_table,
    output_dir="./output",
    output_prefix="compiled_spectra"
)

# Create single FITS file
single_file = compiler.compile_single_fits(
    spectra_table=spectra_table,
    output_file="all_spectra.fits"
)

# Generate metadata table
metadata = compiler.create_metadata_table(
    spectra_table=spectra_table,
    output_files=output_files,
    output_dir="./output"
)

Workflow Examples

Complete Spectroscopic Analysis Pipeline

from euclidkit.core.data_access import EuclidArchive
from euclidkit.core.spectra import SpectrumCompiler
import pandas as pd

# 1. Initialize archive
archive = EuclidArchive(environment='PDR')
archive.login()

# 2. Load your QSO candidates
qso_candidates = pd.read_csv('qso_candidates.csv')

# 3. Crossmatch with Euclid MER catalogue
crossmatches = archive.crossmatch_sources(
    user_table=qso_candidates,
    radius=2.0,  # 2 arcsecond radius
    output_file='qso_crossmatches.fits'
)

# 4. Find available spectra
spectra_sources = archive.query_spectra_sources(
    crossmatch_table=crossmatches,
    output_file='qso_spectra_sources.fits'
)

print(f"Found {len(spectra_sources)} spectra for {len(crossmatches)} crossmatches")

# 5. Create combined FITS file (for small samples)
if len(spectra_sources) <= 1000:
    combined_spectra = archive.combine_spectra_to_fits(
        spectra_table=spectra_sources,
        output_file='qso_combined_spectra.fits'
    )
    print(f"Combined spectra saved to: {combined_spectra}")

# 6. Or use chunked compilation for large samples
else:
    compiler = SpectrumCompiler(max_extensions=2000)
    output_files = compiler.compile_spectra(
        spectra_table=spectra_sources,
        output_dir='./spectra_chunks',
        output_prefix='qso_spectra'
    )
    print(f"Created {len(output_files)} chunked files")

archive.logout()

Diagnostics

Check your installation and environment:

# Check all components
euclidkit diagnostics

# Check specific components
euclidkit diagnostics --check-deps --check-data

Archive Environments

Use --environment (CLI) or environment=... (Python API) to select the archive backend:

  • PDR: Public Data Release archive.
  • IDR: Internal Data Release archive (consortium access).
  • OTF: On-the-fly archive environment.
  • REG: Regression/testing archive environment.

For IDR, you can also select the field with --idr-field:

  • WIDE: Uses the IDR WIDE MER catalogue.
  • DEEP: Uses the IDR DEEP MER catalogue.

Examples:

# IDR WIDE (default IDR field)
euclidkit crossmatch \
  --input my_sources.fits \
  --output xmatch_wide.fits \
  --environment IDR \
  --idr-field WIDE

# IDR DEEP
euclidkit crossmatch \
  --input my_sources.fits \
  --output xmatch_deep.fits \
  --environment IDR \
  --idr-field DEEP

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Documentation

For detailed documentation and examples, visit:

Support

Author

Yuming Fu (@rudolffu)

License

This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.

Acknowledgments

  • ESA Euclid Mission and Euclid Consortium
  • ESA Datalabs and Euclid Data Space infrastructure team
  • Astropy and astroquery communities

Changelog

Latest Changes

  • Spectroscopic Pipeline: Complete pipeline for accessing and combining Euclid spectra
  • CLI Integration: Added --combine-output option to query-spectra command
  • TAP Upload: Improved query performance using TAP table uploads
  • FITS Compilation: Efficient multi-extension FITS file creation
  • Error Handling: Robust handling of long filenames and missing data

See CHANGELOG.md for detailed version history.

About

A comprehensive Python package for Euclid archival data analysis

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages