Skip to content

Translate BED/GFF to nodes#153

Merged
Chris7 merged 9 commits intomainfrom
bed-gff-translate
Mar 11, 2025
Merged

Translate BED/GFF to nodes#153
Chris7 merged 9 commits intomainfrom
bed-gff-translate

Conversation

@Chris7
Copy link
Contributor

@Chris7 Chris7 commented Feb 24, 2025

This translates GFF/BED entries to be node centric.
Here's how simple.gff gets translated to sample foo (simple.fa + simple.vcf)

m123	HAVANA	gene	1	20	.	-	.	ID=ENSG00000294541.1
m123	HAVANA	transcript	1	20	.	-	.	ID=ENST00000724296.1;Parent=ENSG00000294541.1
m123	HAVANA	exon	4	8	.	-	.	ID=exon:ENST00000724296.1:1;Parent=ENST00000724296.1
m123	HAVANA	exon	10	14	.	-	.	ID=exon:ENST00000724296.1:2;Parent=ENST00000724296.1
m123	HAVANA	exon	16	19	.	-	.	ID=exon:ENST00000724296.1:3;Parent=ENST00000724296.1
m123	ENSEMBL	gene	3	15	.	-	.	ID=ENSG00000277248.1
m123	ENSEMBL	transcript	3	15	.	-	.	ID=ENST00000615943.1;Parent=ENSG00000277248.1
m123	ENSEMBL	exon	3	15	.	-	.	ID=exon:ENST00000615943.1:1;Parent=ENST00000615943.1
3	HAVANA	gene	1	3	.	-	.	ID=ENSG00000294541.1
3	HAVANA	gene	4	5	.	-	.	ID=ENSG00000294541.1
3	HAVANA	gene	6	20	.	-	.	ID=ENSG00000294541.1
3	HAVANA	transcript	1	3	.	-	.	ID=ENST00000724296.1;Parent=ENSG00000294541.1
3	HAVANA	transcript	4	5	.	-	.	ID=ENST00000724296.1;Parent=ENSG00000294541.1
3	HAVANA	transcript	6	20	.	-	.	ID=ENST00000724296.1;Parent=ENSG00000294541.1
3	HAVANA	exon	4	5	.	-	.	ID=exon:ENST00000724296.1:1;Parent=ENST00000724296.1
3	HAVANA	exon	6	8	.	-	.	ID=exon:ENST00000724296.1:1;Parent=ENST00000724296.1
3	HAVANA	exon	10	14	.	-	.	ID=exon:ENST00000724296.1:2;Parent=ENST00000724296.1
3	HAVANA	exon	16	19	.	-	.	ID=exon:ENST00000724296.1:3;Parent=ENST00000724296.1
3	ENSEMBL	gene	4	5	.	-	.	ID=ENSG00000277248.1
3	ENSEMBL	gene	6	15	.	-	.	ID=ENSG00000277248.1
3	ENSEMBL	transcript	4	5	.	-	.	ID=ENST00000615943.1;Parent=ENSG00000277248.1
3	ENSEMBL	transcript	6	15	.	-	.	ID=ENST00000615943.1;Parent=ENSG00000277248.1
3	ENSEMBL	exon	4	5	.	-	.	ID=exon:ENST00000615943.1:1;Parent=ENST00000615943.1
3	ENSEMBL	exon	6	15	.	-	.	ID=exon:ENST00000615943.1:1;Parent=ENST00000615943.1

Notably, repeat IDs are allowed by the spec



ID
    Indicates the ID of the feature. The ID attribute is required for features that have children (e.g. gene and mRNAs), or for those that span multiple lines, but are optional for other features. IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID must collectively represent a single feature.

@Chris7 Chris7 requested a review from dkhofer March 7, 2025 12:45
translate_bed(
conn,
&collection,
"foo",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor, but might be good to add a comment here and in the gff code saying this is a sample from simple.vcf, it took me a couple minutes to work my way through the data flow

Copy link
Contributor

@dkhofer dkhofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Chris7 Chris7 merged commit a4a91bb into main Mar 11, 2025
@Chris7 Chris7 deleted the bed-gff-translate branch March 11, 2025 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants