This skill calculates statistical metrics for sequence alignments including identity, conservation, entropy, and substitution patterns. These metrics are essential for assessing alignment quality, identifying conserved regions, and understanding evolutionary relationships.
pip install biopython numpyTell your AI agent what you want to do:
- "Calculate pairwise identity between all sequences in this alignment"
- "Show me the conservation score for each column"
- "What is the average sequence identity in this alignment?"
"Create a pairwise identity matrix for this alignment"
"What is the percent identity between sequence A and sequence B?"
"Find the most similar pair of sequences in the alignment"
"Calculate the conservation score at each position"
"Which columns are most conserved?"
"Plot a conservation profile across the alignment"
"Calculate Shannon entropy for each column"
"What is the information content at each position?"
"Find the most variable positions in the alignment"
"What fraction of the alignment is gaps?"
"Which sequences have the most gaps?"
"How many gap-free columns are there?"
"Count the substitutions between all pairs of sequences"
"What are the most common substitution types?"
"Build a substitution matrix from this alignment"
- Load the alignment file
- Calculate requested metrics (identity, conservation, entropy, etc.)
- Summarize results (averages, distributions, extremes)
- Identify notable patterns (highly conserved/variable regions)
- Output tables, matrices, or profiles as appropriate
| Metric | What It Measures | Interpretation |
|---|---|---|
| Identity | Exact matches | Higher = more similar |
| Conservation | Most common residue frequency | Higher = less variable |
| Entropy | Variability | Lower = more conserved |
| Information Content | Constraint level | Higher = more constrained |
- Identity can be calculated different ways (with/without gaps) - specify which you need
- Conservation and entropy are inversely related
- For proteins, use BLOSUM62 for scoring; for DNA, use simple match/mismatch
- Gap-rich columns often indicate alignment uncertainty
- Compare metrics across columns to find functional regions