Skip to content

[ENH] Add sequence_statistics utility for RNA/DNA sequence analysis#579

Open
Vaishnav88sk wants to merge 1 commit into
gc-os-ai:mainfrom
Vaishnav88sk:feature/sequence-statistics-utility
Open

[ENH] Add sequence_statistics utility for RNA/DNA sequence analysis#579
Vaishnav88sk wants to merge 1 commit into
gc-os-ai:mainfrom
Vaishnav88sk:feature/sequence-statistics-utility

Conversation

@Vaishnav88sk

@Vaishnav88sk Vaishnav88sk commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Add gc_content(), nucleotide_composition(), and sequence_summary() utility functions for aptamer sequence profiling and screening.

These functions provide essential metrics (GC content, nucleotide breakdown, batch statistics) that researchers need during aptamer design workflows.

Reference Issues/PRs

Fixes #576

What does this implement/fix? Explain your changes.

Adds a new pyaptamer/utils/_sequence_stats.py module with three utility functions for aptamer sequence profiling:

  • gc_content(sequence) — Computes GC content as a float (0.0–1.0). Case-insensitive, supports both DNA (ACGT) and RNA (ACGU).
  • nucleotide_composition(sequence) — Returns a dict with per-nucleotide counts and frequencies. Unknown characters grouped under "other".
  • sequence_summary(sequences) — Batch analysis returning a pd.DataFrame with columns: sequence, length, gc_content, A, C, G, T, U.
    All three are exported from pyaptamer.utils:
from pyaptamer.utils import gc_content, nucleotide_composition, sequence_summary

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Yes - pyaptamer/utils/tests/test_sequence_stats.py with 15 tests covering:

  • Normal DNA/RNA inputs, edge cases (empty string, single char)
  • Case insensitivity
  • Unknown character handling
  • Type validation (TypeError for non-string inputs)
  • Batch summary with variable-length sequences

Any other comments?

PR checklist

  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
  • Added/modified tests
  • Used pre-commit hooks when committing to ensure that code is compliant with hooks. Install hooks with pre-commit install.
    To run hooks independent of commit, execute pre-commit run --all-files

Add gc_content(), nucleotide_composition(), and sequence_summary() utility
functions for aptamer sequence profiling and screening.

These functions provide essential metrics (GC content, nucleotide breakdown,
batch statistics) that researchers need during aptamer design workflows.
@Vaishnav88sk

Copy link
Copy Markdown
Contributor Author

@fkiraly @NennoMP

@Vaishnav88sk

Copy link
Copy Markdown
Contributor Author

@SimonBlanke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] Add sequence statistics utility for RNA/DNA sequence analysis

1 participant