Skip to content

Commit c12d420

Browse files
authored
Merge pull request #205 from puja-trivedi/generate_linkml_ai
Generate linkml ai
2 parents e0baee8 + 749e079 commit c12d420

File tree

6 files changed

+1782
-1
lines changed

6 files changed

+1782
-1
lines changed

linkml-schema-ai-tools/README.md

Lines changed: 226 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,13 +39,238 @@ Reuse: linkml-schema/bican_biolink.yaml (a biolink subset commonly used in BICAN
3939
### Model
4040
- openai/gpt-5
4141

42-
## Run 3
42+
## Run 3
4343
### User Prompt
4444
Task: create a linkml model
4545
Background: You are an expert in data modeling and the tool LinkML.
4646
Goal: Given the 'linkml-schema-ai-tools/gff3.md' file, create a linkml model to represent the data present in a gff3 file. We are only interested in representing feature types that are 'genes' and the associated information stored in the 'attributes' column.
4747
Example Data: 'data/GCF_000003025.6_Sscrofa11.1_genomic.gff' is an example of how data is represented in a gff3 file. Use this file to help refine the model.
4848
Reuse: linkml-schema/bican_biolink.yaml (a biolink subset commonly used in BICAN) and linkml-schema/bican_core.yaml (BICAN core metadata such as versioning and checksums)
4949
Testing: Test the generated linkml model by running 2 commands. First run : 'linkml lint' and then run 'linkml generate pydantic'.
50+
### Model
51+
- openai/gpt-5
52+
53+
## Run 4
54+
### User Prompt
55+
Task:
56+
Design a LinkML schema that models the metadata of a GFF3 file, focusing specifically on gene-level features and their associated attributes.
57+
58+
Background:
59+
You are an expert in data modeling and the LinkML framework. The goal is to capture the essential structure and semantics of gene features as represented in GFF3 files.
60+
61+
Inputs:
62+
Use the following reference files to understand the GFF3 structure and attribute conventions:
63+
64+
linkml-schema-ai-tools/gff3.md — general GFF3 specification summary
65+
66+
linkml-schema-ai-tools/ncbi_gff3.txt — documentation from NCBI GFF3 format
67+
68+
linkml-schema-ai-tools/ensembl_gff3.md — documentation from Ensembl GFF3 format
69+
70+
71+
Goal:
72+
73+
Create a LinkML schema that represents the GFF3 file structure, but only for features of type “gene”.
74+
75+
Model the core GFF3 columns (e.g., seqid, source, type, start, end, score, strand, phase) as appropriate.
76+
77+
Focus primarily on modeling the attributes column, unifying attribute definitions and types between NCBI and Ensembl conventions. For aunified attributes, also provide mapping to original attribute name from NCBI and Ensembl.
78+
79+
Include clear descriptions, slot ranges, and enums where applicable.
80+
81+
Reuse existing entities, mixins, and patterns from:
82+
83+
linkml-schema/bican_biolink.yaml — Biolink subset used in BICAN
84+
85+
linkml-schema/bican_core.yaml — Core BICAN metadata (e.g., versioning, provenance, checksums)
86+
87+
DO NOT READ FROM ANY OTHER FILES.
88+
89+
Deliverables:
90+
91+
A complete LinkML schema file (YAML) defining the GFF3Gene class (or similar), associated slots, and unified attribute representation. Place schema in a new file in the linkml-schema-ai-tools directory.
92+
93+
Include appropriate schema metadata (e.g., id, name, description, version, prefixes, imports).
94+
95+
Ensure semantic reuse of terms consistent with Biolink and BICAN naming conventions.
96+
97+
Testing:
98+
After creating the schema, validate it by running the following commands:
99+
100+
linkml lint
101+
linkml generate pydantic
102+
103+
104+
Ensure both commands execute successfully without errors or warnings.
105+
### Model
106+
- openai/gpt-5
107+
108+
## Run 5
109+
### User Prompt
110+
Task: Design a LinkML schema that models gene annotations derived from GFF3 gene features and organizes them into a reusable “genome annotation” package. The schema should:
111+
112+
1. Represent individual genes (derived from GFF3 rows where type=gene)
113+
2. Represent a genome annotation dataset that the genes are referenced from
114+
3. Represent the reference genome assembly used by the dataset
115+
4. Provide a top-level collection to hold multiple gene annotations, genome annotations, and assemblies
116+
117+
Background: You are an expert in data modeling and the LinkML framework. The goal is to capture the essential structure and semantics of gene features from GFF3, while structuring the schema into a gene record class that references a dataset class (the genome annotation) with assembly context and minimal, clear controlled vocabularies.
118+
119+
Inputs: Use ONLY the following reference files to understand GFF3 structure and attribute conventions:
120+
121+
- linkml-schema-ai-tools/gff3.md — general GFF3 specification summary
122+
- linkml-schema-ai-tools/ncbi_gff3.txt — documentation from NCBI GFF3 format
123+
- linkml-schema-ai-tools/ensembl_gff3.md — documentation from Ensembl GFF3 format
124+
125+
Goal and required structure:
126+
127+
- Provide the following classes and relationships:
128+
129+
1. GeneAnnotation (is_a: gene)
130+
131+
- Represents a single gene derived from a GFF3 row with type=gene.
132+
- Must include: a small set of unifying attributes from GFF3’s attributes column, minimal biotype-like classification, and an explicit reference to the genome annotation dataset it came from.
133+
- Should include gene identity and optional location context; coordinates are allowed but keep them minimal and gene-appropriate.
134+
135+
2. GenomeAnnotation (is_a: genome)
136+
137+
- Represents a genome annotation dataset (e.g., an authority’s release of gene annotations).
138+
- Must include dataset-level metadata such as version, digest, content_url, and authority.
139+
- Must reference a genome assembly.
140+
141+
3. GenomeAssembly (is_a: named thing; mixin: thing with taxon)
142+
143+
- Represents the genome assembly used by the genome annotation.
144+
- Include minimal assembly metadata (e.g., version, strain).
145+
146+
4. AnnotationCollection (tree_root: true)
147+
148+
- A root container class for transporting sets of GeneAnnotation, GenomeAnnotation, and GenomeAssembly instances.
149+
- Provide list attributes for each of these item types, using inlined_as_list: true.
150+
151+
Class details:
152+
153+
1. GeneAnnotation (is_a: gene)
154+
155+
- Core identity and provenance:
156+
157+
- slots: source_id, molecular_type
158+
- attributes:
159+
- referenced_in: required, inlined, any_of: [GenomeAnnotation, string] (the genome annotation dataset this gene came from)
160+
161+
- GFF3 gene columns (keep minimal and appropriate for “gene”):
162+
163+
- seqid, source, type (constrain to “gene”), start, end, score, strand
164+
- Do NOT include “phase” for the gene feature (phase is for CDS)
165+
- Use a one-based integer type for start and end
166+
167+
- Unified attributes (Column 9 harmonization across NCBI/Ensembl):
168+
169+
- molecular_type: any_of: [BioType, string] with small controlled vocabulary (see enums)
170+
171+
- source_id: schema:identifier for the authority-specific identifier for this gene
172+
173+
- Optional harmonized attributes with annotations for original provenance:
174+
175+
- ensembl_gene_id (annotations: ensembl_attr: gene_id)
176+
- ncbi_gene_id (range: uriorcurie; annotations: ncbi_attr: Dbxref(GeneID))
177+
- biotype or a richer “gene_biotype” value (if provided), but map/roll-up to molecular_type for a simple classification
178+
- symbol, name, description, synonym, xref, version
179+
- locus_tag, pseudo (boolean), pseudogene_subtype, note
180+
181+
- Provide slot_usage with exact_mappings/narrow_mappings to GFF3 attributes where applicable (e.g., ID, Name, Alias, Dbxref) and to the GFF3 columns (seqid, source, type, start, end, score, strand).
182+
183+
- Constraints:
184+
185+
- type should be constrained to “gene” via an enum (e.g., GeneFeatureType with permissible value “gene”)
186+
- start/end should use a custom integer type with minimum_value: 1
187+
188+
2. GenomeAnnotation (is_a: genome)
189+
190+
- Dataset-level metadata:
191+
- slots: version, digest, content_url, authority
192+
- Link to assembly:
193+
- attributes:
194+
- reference_assembly: required, inlined, any_of: [GenomeAssembly, string]
195+
- Authority should be controlled by a small enum (e.g., AuthorityType with ENSEMBL, NCBI)
196+
197+
3. GenomeAssembly (is_a: named thing; mixins: [thing with taxon])
198+
199+
- slots: version, strain
200+
- The mixin ensures taxon context is available for the assembly
201+
202+
4. AnnotationCollection (tree_root: true)
203+
204+
- attributes:
205+
206+
- annotations: multivalued, inlined_as_list: true, range: GeneAnnotation
207+
- genome_annotations: multivalued, inlined_as_list: true, range: GenomeAnnotation
208+
- genome_assemblies: multivalued, inlined_as_list: true, range: GenomeAssembly
209+
210+
Slots to define (non-exhaustive; use clear descriptions and ranges):
211+
212+
- seqid (maps to GFF3 Column 1; consider accession.version semantics)
213+
- source (GFF3 Column 2)
214+
- type (GFF3 Column 3; constrained to gene)
215+
- start (GFF3 Column 4; range: one_based_int with minimum_value: 1)
216+
- end (GFF3 Column 5; range: one_based_int with minimum_value: 1)
217+
- score (GFF3 Column 6; range: float)
218+
- strand (GFF3 Column 7; use StrandEnum with +, -, ., ?)
219+
- molecular_type (any_of: [BioType, string])
220+
- authority (range: AuthorityType)
221+
- source_id (slot_uri: schema:identifier)
222+
- strain (string)
223+
- referenced_in (as described above)
224+
- reference_assembly (as described above)
225+
226+
Types:
227+
228+
- one_based_int: typeof: integer; minimum_value: 1
229+
230+
Enums:
231+
232+
- StrandEnum: +, -, ., ?
233+
- GeneFeatureType: permissible value: gene
234+
- BioType: permissible values: protein_coding, noncoding (keep intentionally small and simple)
235+
- AuthorityType: permissible values: ENSEMBL, NCBI
236+
237+
Prefixes and imports:
238+
239+
- prefixes: linkml, bican, schema, NCBIGene, NCBIAssembly, NCBITaxon (add others as needed for GFF3 or Ensembl identifiers)
240+
- imports: linkml:types, bican_biolink, bican_core
241+
- default_prefix: bican
242+
- default_range: string
243+
244+
Attribute harmonization guidance:
245+
246+
- For each unifying gene attribute, add annotations indicating original source keys (e.g., ncbi_attr: gene_biotype; ensembl_attr: biotype).
247+
- For identifiers, ensure ncbi_gene_id is uriorcurie (e.g., NCBIGene:####) and ensembl_gene_id is a string (e.g., ENSG..., ENSMUSG...).
248+
- Use exact_mappings or narrow_mappings to GFF3 keys (e.g., gff3:ID, gff3:Name, gff3:Dbxref) for clarity.
249+
250+
Deliverables:
251+
252+
- A single LinkML YAML file placed in linkml-schema-ai-tools defining:
253+
254+
- The classes: GeneAnnotation, GenomeAnnotation, GenomeAssembly, AnnotationCollection
255+
- The slots, types, and enums listed above
256+
- Clear descriptions for classes/slots and mapping annotations to GFF3
257+
258+
- Include top-level schema metadata (id, name, description, version, prefixes, imports)
259+
260+
- Ensure semantic reuse of terms consistent with Biolink/BICAN naming where appropriate
261+
262+
Testing:
263+
264+
- Run:
265+
266+
- linkml lint
267+
- linkml generate pydantic
268+
269+
- Both commands must execute successfully without errors or warnings.
270+
271+
Constraints:
272+
273+
- DO NOT READ FROM ANY OTHER FILES beyond the three Inputs specified above.
274+
50275
### Model
51276
- openai/gpt-5
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
#### README ####
2+
3+
-----------------------
4+
GFF FLATFILE DUMPS
5+
-----------------------
6+
Gene annotation is provided in GFF3 format. Detailed specification of
7+
the format is maintained by the Sequence Ontology:
8+
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
9+
10+
GFF3 files are validated using GenomeTools: http://genometools.org
11+
12+
For chromosomal assemblies, in addition to a file containing all
13+
genes, there are per-chromosome files. If a predicted geneset is
14+
available (generated by Genscan and other ab initio tools), these
15+
genes are in a separate 'abinitio' file.
16+
17+
18+
The 'type' of gene features is:
19+
* "gene" for protein-coding genes
20+
* "ncRNA_gene" for RNA genes
21+
* "pseudogene" for pseudogenes
22+
The 'type' of transcript features is:
23+
* "mRNA" for protein-coding transcripts
24+
* a specific type or RNA transcript such as "snoRNA" or "lnc_RNA"
25+
* "pseudogenic_transcript" for pseudogenes
26+
All transcripts are linked to "exon" features.
27+
Protein-coding transcripts are linked to "CDS", "five_prime_UTR", and
28+
"three_prime_UTR" features.
29+
30+
Attributes for feature types:
31+
(square brackets indicate data which is not available for all features)
32+
* region types:
33+
* ID: Unique identifier, format "<region_type>:<region_name>"
34+
* [Alias]: A comma-separated list of aliases, usually including the
35+
INSDC accession
36+
* [Is_circular]: Flag to indicate circular regions
37+
* gene types:
38+
* ID: Unique identifier, format "gene:<gene_stable_id>"
39+
* biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene"
40+
* gene_id: Ensembl gene stable ID
41+
* version: Ensembl gene version
42+
* [Name]: Gene name
43+
* [description]: Gene description
44+
* transcript types:
45+
* ID: Unique identifier, format "transcript:<transcript_stable_id>"
46+
* Parent: Gene identifier, format "gene:<gene_stable_id>"
47+
* biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene"
48+
* transcript_id: Ensembl transcript stable ID
49+
* version: Ensembl transcript version
50+
* [Note]: If the transcript sequence has been edited (i.e. differs
51+
from the genomic sequence), the edits are described in a note.
52+
* exon
53+
* Parent: Transcript identifier, format "transcript:<transcript_stable_id>"
54+
* exon_id: Ensembl exon stable ID
55+
* version: Ensembl exon version
56+
* constitutive: Flag to indicate if exon is present in all
57+
transcripts
58+
* rank: Integer that show the 5'->3' ordering of exons
59+
* CDS
60+
* ID: Unique identifier, format "CDS:<protein_stable_id>"
61+
* Parent: Transcript identifier, format "transcript:<transcript_stable_id>"
62+
* protein_id: Ensembl protein stable ID
63+
* version: Ensembl protein version
64+
65+
Metadata:
66+
* genome-build - Build identifier of the assembly e.g. GRCh37.p11
67+
* genome-version - Version of this assembly e.g. GRCh37
68+
* genome-date - The date of the release of this assembly e.g. 2009-02
69+
* genome-build-accession - Genome accession e.g. GCA_000001405.14
70+
* genebuild-last-updated - Date of the last genebuild update e.g. 2013-09
71+
72+
-----------
73+
FILE NAMES
74+
------------
75+
The files are consistently named following this pattern:
76+
<species>.<assembly>.<_version>.gff3.gz
77+
78+
<species>: The systematic name of the species.
79+
<assembly>: The assembly build name.
80+
<version>: The version of Ensembl from which the data was exported.
81+
gff3 : All files in these directories are in GFF3 format
82+
gz : All files are compacted with GNU Zip for storage efficiency.
83+
84+
e.g.
85+
Homo_sapiens.GRCh38.81.gff3.gz
86+
87+
For the predicted gene set, an additional abinitio flag is added to the name file.
88+
<species>.<assembly>.<version>.abinitio.gff3.gz
89+
90+
e.g.
91+
Homo_sapiens.GRCh38.81.abinitio.gff3.gz
92+
93+
------------------
94+
Example GFF3 output
95+
------------------
96+
97+
##gff-version 3
98+
#!genome-build Pmarinus_7.0
99+
#!genome-version Pmarinus_7.0
100+
#!genome-date 2011-01
101+
#!genebuild-last-updated 2013-04
102+
103+
GL476399 Pmarinus_7.0 supercontig 1 4695893 . . . ID=supercontig:GL476399;Alias=scaffold_71
104+
GL476399 ensembl gene 2596494 2601138 . + . ID=gene:ENSPMAG00000009070;Name=TRYPA3;biotype=protein_coding;description=Trypsinogen A1%3B Trypsinogen a3%3B Uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:O42608];logic_name=ensembl;version=1
105+
GL476399 ensembl transcript 2596494 2601138 . + . ID=transcript:ENSPMAT00000010026;Name=TRYPA3-201;Parent=gene:ENSPMAG00000009070;biotype=protein_coding;version=1
106+
GL476399 ensembl exon 2596494 2596538 . + . Name=ENSPMAE00000087923;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;rank=1;version=1
107+
GL476399 ensembl exon 2598202 2598361 . + . Name=ENSPMAE00000087929;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=2;ensembl_phase=1;rank=2;version=1
108+
GL476399 ensembl exon 2599023 2599282 . + . Name=ENSPMAE00000087937;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=2;rank=3;version=1
109+
GL476399 ensembl exon 2599814 2599947 . + . Name=ENSPMAE00000087952;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;rank=4;version=1
110+
GL476399 ensembl exon 2600895 2601138 . + . Name=ENSPMAE00000087966;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;rank=5;version=1
111+
GL476399 ensembl CDS 2596499 2596538 . + 0 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
112+
GL476399 ensembl CDS 2598202 2598361 . + 2 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
113+
GL476399 ensembl CDS 2599023 2599282 . + 1 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
114+
GL476399 ensembl CDS 2599814 2599947 . + 2 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
115+
GL476399 ensembl CDS 2600895 2601044 . + 0 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
116+
GL476399 ensembl five_prime_UTR 2596494 2596498 . + . Parent=transcript:ENSPMAT00000010026
117+
GL476399 ensembl three_prime_UTR 2601045 2601138 . + . Parent=transcript:ENSPMAT00000010026
118+
119+
120+
--------------------------------------
121+
Locus Reference Genomic Sequence (LRG)
122+
--------------------------------------
123+
This is a manually curated project that contains stable and un-versioned reference sequences designed specifically for reporting sequence variants with clinical implications.
124+
The sequences of each locus (also called LRG) are chosen in collaboration with research and diagnostic laboratories, LSDB (locus specific database) curators and mutation consortia with expertise in the region of interest.
125+
LRG website: http://www.lrg-sequence.org
126+
LRG data are freely available in several formats (FASTA, BED, XML, Tabulated) at this address: http://www.lrg-sequence.org/downloads
127+

0 commit comments

Comments
 (0)