Skip to content

Conversation

@BenjaminDEMAILLE
Copy link

Add native CRAM output support to STAR

Overview

This contribution adds full support for CRAM output in STAR, both for unsorted and coordinate-sorted alignments, using HTSlib’s native APIs. The implementation is fully backward compatible and preserves all existing BAM/SAM functionality.


Motivation

  • CRAM is a modern, highly compressed format for storing sequencing alignments, supported by the GA4GH and major genomics tools.
  • STAR previously only supported SAM and BAM output. Adding CRAM output allows users to save disk space and integrate more easily with modern pipelines.

Features

  • New output format:
    Users can now specify --outSAMtype CRAM Unsorted or --outSAMtype CRAM SortedByCoordinate to produce CRAM files directly from STAR.

  • HTSlib integration:
    Output is handled via HTSlib’s htsFile and sam_write1 APIs, ensuring compatibility with the CRAM standard and future-proofing for other formats.

  • Automatic format detection:
    The output format is detected from the --outSAMtype parameter and the correct file mode is used for htslib.

  • Unsorted and sorted output:

    • Unsorted CRAM output is written directly during alignment.
    • For coordinate-sorted output, STAR merges the sorted BAM bins and writes the result as CRAM using sam_write1.
  • Backward compatibility:
    Existing BAM/SAM output logic is preserved. If CRAM is not requested, STAR behaves as before.


Implementation Details

  • The BAMoutput class now supports both legacy BGZF output and htslib-based output (htsFile), with logic to route records to the appropriate backend.
  • Output file opening in ReadAlignChunk.cpp and bamSortByCoordinate.cpp is updated to use hts_open with the correct mode for CRAM/SAM/BAM.
  • The merging logic for coordinate-sorted output is updated:
    • If CRAM (or SAM/BAM via htslib) is requested, each sorted bin is read and records are written to the final output using sam_write1.
    • If BAM is requested and htslib is not used, the legacy bam_cat logic is preserved.
  • All changes are guarded to ensure no impact on users who do not request CRAM output.

Usage

  • Unsorted CRAM output:
    --outSAMtype CRAM Unsorted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant