Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
227 changes: 226 additions & 1 deletion linkml-schema-ai-tools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,238 @@ Reuse: linkml-schema/bican_biolink.yaml (a biolink subset commonly used in BICAN
### Model
- openai/gpt-5

## Run 3
## Run 3
### User Prompt
Task: create a linkml model
Background: You are an expert in data modeling and the tool LinkML.
Goal: Given the 'linkml-schema-ai-tools/gff3.md' file, create a linkml model to represent the data present in a gff3 file. We are only interested in representing feature types that are 'genes' and the associated information stored in the 'attributes' column.
Example Data: 'data/GCF_000003025.6_Sscrofa11.1_genomic.gff' is an example of how data is represented in a gff3 file. Use this file to help refine the model.
Reuse: linkml-schema/bican_biolink.yaml (a biolink subset commonly used in BICAN) and linkml-schema/bican_core.yaml (BICAN core metadata such as versioning and checksums)
Testing: Test the generated linkml model by running 2 commands. First run : 'linkml lint' and then run 'linkml generate pydantic'.
### Model
- openai/gpt-5

## Run 4
### User Prompt
Task:
Design a LinkML schema that models the metadata of a GFF3 file, focusing specifically on gene-level features and their associated attributes.

Background:
You are an expert in data modeling and the LinkML framework. The goal is to capture the essential structure and semantics of gene features as represented in GFF3 files.

Inputs:
Use the following reference files to understand the GFF3 structure and attribute conventions:

linkml-schema-ai-tools/gff3.md — general GFF3 specification summary

linkml-schema-ai-tools/ncbi_gff3.txt — documentation from NCBI GFF3 format

linkml-schema-ai-tools/ensembl_gff3.md — documentation from Ensembl GFF3 format


Goal:

Create a LinkML schema that represents the GFF3 file structure, but only for features of type “gene”.

Model the core GFF3 columns (e.g., seqid, source, type, start, end, score, strand, phase) as appropriate.

Focus primarily on modeling the attributes column, unifying attribute definitions and types between NCBI and Ensembl conventions. For aunified attributes, also provide mapping to original attribute name from NCBI and Ensembl.

Include clear descriptions, slot ranges, and enums where applicable.

Reuse existing entities, mixins, and patterns from:

linkml-schema/bican_biolink.yaml — Biolink subset used in BICAN

linkml-schema/bican_core.yaml — Core BICAN metadata (e.g., versioning, provenance, checksums)

DO NOT READ FROM ANY OTHER FILES.

Deliverables:

A complete LinkML schema file (YAML) defining the GFF3Gene class (or similar), associated slots, and unified attribute representation. Place schema in a new file in the linkml-schema-ai-tools directory.

Include appropriate schema metadata (e.g., id, name, description, version, prefixes, imports).

Ensure semantic reuse of terms consistent with Biolink and BICAN naming conventions.

Testing:
After creating the schema, validate it by running the following commands:

linkml lint
linkml generate pydantic


Ensure both commands execute successfully without errors or warnings.
### Model
- openai/gpt-5

## Run 5
### User Prompt
Task: Design a LinkML schema that models gene annotations derived from GFF3 gene features and organizes them into a reusable “genome annotation” package. The schema should:

1. Represent individual genes (derived from GFF3 rows where type=gene)
2. Represent a genome annotation dataset that the genes are referenced from
3. Represent the reference genome assembly used by the dataset
4. Provide a top-level collection to hold multiple gene annotations, genome annotations, and assemblies

Background: You are an expert in data modeling and the LinkML framework. The goal is to capture the essential structure and semantics of gene features from GFF3, while structuring the schema into a gene record class that references a dataset class (the genome annotation) with assembly context and minimal, clear controlled vocabularies.

Inputs: Use ONLY the following reference files to understand GFF3 structure and attribute conventions:

- linkml-schema-ai-tools/gff3.md — general GFF3 specification summary
- linkml-schema-ai-tools/ncbi_gff3.txt — documentation from NCBI GFF3 format
- linkml-schema-ai-tools/ensembl_gff3.md — documentation from Ensembl GFF3 format

Goal and required structure:

- Provide the following classes and relationships:

1. GeneAnnotation (is_a: gene)

- Represents a single gene derived from a GFF3 row with type=gene.
- Must include: a small set of unifying attributes from GFF3’s attributes column, minimal biotype-like classification, and an explicit reference to the genome annotation dataset it came from.
- Should include gene identity and optional location context; coordinates are allowed but keep them minimal and gene-appropriate.

2. GenomeAnnotation (is_a: genome)

- Represents a genome annotation dataset (e.g., an authority’s release of gene annotations).
- Must include dataset-level metadata such as version, digest, content_url, and authority.
- Must reference a genome assembly.

3. GenomeAssembly (is_a: named thing; mixin: thing with taxon)

- Represents the genome assembly used by the genome annotation.
- Include minimal assembly metadata (e.g., version, strain).

4. AnnotationCollection (tree_root: true)

- A root container class for transporting sets of GeneAnnotation, GenomeAnnotation, and GenomeAssembly instances.
- Provide list attributes for each of these item types, using inlined_as_list: true.

Class details:

1. GeneAnnotation (is_a: gene)

- Core identity and provenance:

- slots: source_id, molecular_type
- attributes:
- referenced_in: required, inlined, any_of: [GenomeAnnotation, string] (the genome annotation dataset this gene came from)

- GFF3 gene columns (keep minimal and appropriate for “gene”):

- seqid, source, type (constrain to “gene”), start, end, score, strand
- Do NOT include “phase” for the gene feature (phase is for CDS)
- Use a one-based integer type for start and end

- Unified attributes (Column 9 harmonization across NCBI/Ensembl):

- molecular_type: any_of: [BioType, string] with small controlled vocabulary (see enums)

- source_id: schema:identifier for the authority-specific identifier for this gene

- Optional harmonized attributes with annotations for original provenance:

- ensembl_gene_id (annotations: ensembl_attr: gene_id)
- ncbi_gene_id (range: uriorcurie; annotations: ncbi_attr: Dbxref(GeneID))
- biotype or a richer “gene_biotype” value (if provided), but map/roll-up to molecular_type for a simple classification
- symbol, name, description, synonym, xref, version
- locus_tag, pseudo (boolean), pseudogene_subtype, note

- Provide slot_usage with exact_mappings/narrow_mappings to GFF3 attributes where applicable (e.g., ID, Name, Alias, Dbxref) and to the GFF3 columns (seqid, source, type, start, end, score, strand).

- Constraints:

- type should be constrained to “gene” via an enum (e.g., GeneFeatureType with permissible value “gene”)
- start/end should use a custom integer type with minimum_value: 1

2. GenomeAnnotation (is_a: genome)

- Dataset-level metadata:
- slots: version, digest, content_url, authority
- Link to assembly:
- attributes:
- reference_assembly: required, inlined, any_of: [GenomeAssembly, string]
- Authority should be controlled by a small enum (e.g., AuthorityType with ENSEMBL, NCBI)

3. GenomeAssembly (is_a: named thing; mixins: [thing with taxon])

- slots: version, strain
- The mixin ensures taxon context is available for the assembly

4. AnnotationCollection (tree_root: true)

- attributes:

- annotations: multivalued, inlined_as_list: true, range: GeneAnnotation
- genome_annotations: multivalued, inlined_as_list: true, range: GenomeAnnotation
- genome_assemblies: multivalued, inlined_as_list: true, range: GenomeAssembly

Slots to define (non-exhaustive; use clear descriptions and ranges):

- seqid (maps to GFF3 Column 1; consider accession.version semantics)
- source (GFF3 Column 2)
- type (GFF3 Column 3; constrained to gene)
- start (GFF3 Column 4; range: one_based_int with minimum_value: 1)
- end (GFF3 Column 5; range: one_based_int with minimum_value: 1)
- score (GFF3 Column 6; range: float)
- strand (GFF3 Column 7; use StrandEnum with +, -, ., ?)
- molecular_type (any_of: [BioType, string])
- authority (range: AuthorityType)
- source_id (slot_uri: schema:identifier)
- strain (string)
- referenced_in (as described above)
- reference_assembly (as described above)

Types:

- one_based_int: typeof: integer; minimum_value: 1

Enums:

- StrandEnum: +, -, ., ?
- GeneFeatureType: permissible value: gene
- BioType: permissible values: protein_coding, noncoding (keep intentionally small and simple)
- AuthorityType: permissible values: ENSEMBL, NCBI

Prefixes and imports:

- prefixes: linkml, bican, schema, NCBIGene, NCBIAssembly, NCBITaxon (add others as needed for GFF3 or Ensembl identifiers)
- imports: linkml:types, bican_biolink, bican_core
- default_prefix: bican
- default_range: string

Attribute harmonization guidance:

- For each unifying gene attribute, add annotations indicating original source keys (e.g., ncbi_attr: gene_biotype; ensembl_attr: biotype).
- For identifiers, ensure ncbi_gene_id is uriorcurie (e.g., NCBIGene:####) and ensembl_gene_id is a string (e.g., ENSG..., ENSMUSG...).
- Use exact_mappings or narrow_mappings to GFF3 keys (e.g., gff3:ID, gff3:Name, gff3:Dbxref) for clarity.

Deliverables:

- A single LinkML YAML file placed in linkml-schema-ai-tools defining:

- The classes: GeneAnnotation, GenomeAnnotation, GenomeAssembly, AnnotationCollection
- The slots, types, and enums listed above
- Clear descriptions for classes/slots and mapping annotations to GFF3

- Include top-level schema metadata (id, name, description, version, prefixes, imports)

- Ensure semantic reuse of terms consistent with Biolink/BICAN naming where appropriate

Testing:

- Run:

- linkml lint
- linkml generate pydantic

- Both commands must execute successfully without errors or warnings.

Constraints:

- DO NOT READ FROM ANY OTHER FILES beyond the three Inputs specified above.

### Model
- openai/gpt-5
127 changes: 127 additions & 0 deletions linkml-schema-ai-tools/ensembl_gff3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
#### README ####

-----------------------
GFF FLATFILE DUMPS
-----------------------
Gene annotation is provided in GFF3 format. Detailed specification of
the format is maintained by the Sequence Ontology:
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

GFF3 files are validated using GenomeTools: http://genometools.org

For chromosomal assemblies, in addition to a file containing all
genes, there are per-chromosome files. If a predicted geneset is
available (generated by Genscan and other ab initio tools), these
genes are in a separate 'abinitio' file.


The 'type' of gene features is:
* "gene" for protein-coding genes
* "ncRNA_gene" for RNA genes
* "pseudogene" for pseudogenes
The 'type' of transcript features is:
* "mRNA" for protein-coding transcripts
* a specific type or RNA transcript such as "snoRNA" or "lnc_RNA"
* "pseudogenic_transcript" for pseudogenes
All transcripts are linked to "exon" features.
Protein-coding transcripts are linked to "CDS", "five_prime_UTR", and
"three_prime_UTR" features.

Attributes for feature types:
(square brackets indicate data which is not available for all features)
* region types:
* ID: Unique identifier, format "<region_type>:<region_name>"
* [Alias]: A comma-separated list of aliases, usually including the
INSDC accession
* [Is_circular]: Flag to indicate circular regions
* gene types:
* ID: Unique identifier, format "gene:<gene_stable_id>"
* biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene"
* gene_id: Ensembl gene stable ID
* version: Ensembl gene version
* [Name]: Gene name
* [description]: Gene description
* transcript types:
* ID: Unique identifier, format "transcript:<transcript_stable_id>"
* Parent: Gene identifier, format "gene:<gene_stable_id>"
* biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene"
* transcript_id: Ensembl transcript stable ID
* version: Ensembl transcript version
* [Note]: If the transcript sequence has been edited (i.e. differs
from the genomic sequence), the edits are described in a note.
* exon
* Parent: Transcript identifier, format "transcript:<transcript_stable_id>"
* exon_id: Ensembl exon stable ID
* version: Ensembl exon version
* constitutive: Flag to indicate if exon is present in all
transcripts
* rank: Integer that show the 5'->3' ordering of exons
* CDS
* ID: Unique identifier, format "CDS:<protein_stable_id>"
* Parent: Transcript identifier, format "transcript:<transcript_stable_id>"
* protein_id: Ensembl protein stable ID
* version: Ensembl protein version

Metadata:
* genome-build - Build identifier of the assembly e.g. GRCh37.p11
* genome-version - Version of this assembly e.g. GRCh37
* genome-date - The date of the release of this assembly e.g. 2009-02
* genome-build-accession - Genome accession e.g. GCA_000001405.14
* genebuild-last-updated - Date of the last genebuild update e.g. 2013-09

-----------
FILE NAMES
------------
The files are consistently named following this pattern:
<species>.<assembly>.<_version>.gff3.gz

<species>: The systematic name of the species.
<assembly>: The assembly build name.
<version>: The version of Ensembl from which the data was exported.
gff3 : All files in these directories are in GFF3 format
gz : All files are compacted with GNU Zip for storage efficiency.

e.g.
Homo_sapiens.GRCh38.81.gff3.gz

For the predicted gene set, an additional abinitio flag is added to the name file.
<species>.<assembly>.<version>.abinitio.gff3.gz

e.g.
Homo_sapiens.GRCh38.81.abinitio.gff3.gz

------------------
Example GFF3 output
------------------

##gff-version 3
#!genome-build Pmarinus_7.0
#!genome-version Pmarinus_7.0
#!genome-date 2011-01
#!genebuild-last-updated 2013-04

GL476399 Pmarinus_7.0 supercontig 1 4695893 . . . ID=supercontig:GL476399;Alias=scaffold_71
GL476399 ensembl gene 2596494 2601138 . + . ID=gene:ENSPMAG00000009070;Name=TRYPA3;biotype=protein_coding;description=Trypsinogen A1%3B Trypsinogen a3%3B Uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:O42608];logic_name=ensembl;version=1
GL476399 ensembl transcript 2596494 2601138 . + . ID=transcript:ENSPMAT00000010026;Name=TRYPA3-201;Parent=gene:ENSPMAG00000009070;biotype=protein_coding;version=1
GL476399 ensembl exon 2596494 2596538 . + . Name=ENSPMAE00000087923;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;rank=1;version=1
GL476399 ensembl exon 2598202 2598361 . + . Name=ENSPMAE00000087929;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=2;ensembl_phase=1;rank=2;version=1
GL476399 ensembl exon 2599023 2599282 . + . Name=ENSPMAE00000087937;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=2;rank=3;version=1
GL476399 ensembl exon 2599814 2599947 . + . Name=ENSPMAE00000087952;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;rank=4;version=1
GL476399 ensembl exon 2600895 2601138 . + . Name=ENSPMAE00000087966;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;rank=5;version=1
GL476399 ensembl CDS 2596499 2596538 . + 0 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399 ensembl CDS 2598202 2598361 . + 2 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399 ensembl CDS 2599023 2599282 . + 1 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399 ensembl CDS 2599814 2599947 . + 2 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399 ensembl CDS 2600895 2601044 . + 0 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399 ensembl five_prime_UTR 2596494 2596498 . + . Parent=transcript:ENSPMAT00000010026
GL476399 ensembl three_prime_UTR 2601045 2601138 . + . Parent=transcript:ENSPMAT00000010026


--------------------------------------
Locus Reference Genomic Sequence (LRG)
--------------------------------------
This is a manually curated project that contains stable and un-versioned reference sequences designed specifically for reporting sequence variants with clinical implications.
The sequences of each locus (also called LRG) are chosen in collaboration with research and diagnostic laboratories, LSDB (locus specific database) curators and mutation consortia with expertise in the region of interest.
LRG website: http://www.lrg-sequence.org
LRG data are freely available in several formats (FASTA, BED, XML, Tabulated) at this address: http://www.lrg-sequence.org/downloads

Loading