brain-bican · puja-trivedi · Nov 16, 2025 · Oct 9, 2025 · Oct 9, 2025 · Oct 9, 2025
diff --git a/linkml-schema-ai-tools/README.md b/linkml-schema-ai-tools/README.md
@@ -39,13 +39,238 @@ Reuse: linkml-schema/bican_biolink.yaml (a biolink subset commonly used in BICAN
 ### Model 
  - openai/gpt-5
 
- ## Run 3
+## Run 3
 ### User Prompt
 Task: create a linkml model 
 Background: You are an expert in data modeling and the tool LinkML. 
 Goal: Given the 'linkml-schema-ai-tools/gff3.md' file, create a linkml model to represent the data present in a gff3 file. We are only interested in representing feature types that are 'genes' and the associated information stored in the 'attributes' column. 
 Example Data: 'data/GCF_000003025.6_Sscrofa11.1_genomic.gff' is an example of how data is represented in a gff3 file. Use this file to help refine the model. 
 Reuse: linkml-schema/bican_biolink.yaml (a biolink subset commonly used in BICAN) and linkml-schema/bican_core.yaml (BICAN core metadata such as versioning and checksums)
 Testing: Test the generated linkml model by running 2 commands. First run : 'linkml lint' and then run 'linkml generate pydantic'.
+### Model 
+ - openai/gpt-5
+
+## Run 4
+### User Prompt
+Task:
+Design a LinkML schema that models the metadata of a GFF3 file, focusing specifically on gene-level features and their associated attributes.
+
+Background:
+You are an expert in data modeling and the LinkML framework. The goal is to capture the essential structure and semantics of gene features as represented in GFF3 files.
+
+Inputs:
+Use the following reference files to understand the GFF3 structure and attribute conventions:
+
+linkml-schema-ai-tools/gff3.md — general GFF3 specification summary
+
+linkml-schema-ai-tools/ncbi_gff3.txt —  documentation from NCBI GFF3 format
+
+linkml-schema-ai-tools/ensembl_gff3.md —  documentation from Ensembl GFF3 format
+
+
+Goal:
+
+Create a LinkML schema that represents the GFF3 file structure, but only for features of type “gene”.
+
+Model the core GFF3 columns (e.g., seqid, source, type, start, end, score, strand, phase) as appropriate.
+
+Focus primarily on modeling the attributes column, unifying attribute definitions and types between NCBI and Ensembl conventions. For aunified attributes, also provide mapping to original attribute name from NCBI and Ensembl.
+
+Include clear descriptions, slot ranges, and enums where applicable.
+
+Reuse existing entities, mixins, and patterns from:
+
+linkml-schema/bican_biolink.yaml — Biolink subset used in BICAN
+
+linkml-schema/bican_core.yaml — Core BICAN metadata (e.g., versioning, provenance, checksums)
+
+DO NOT READ FROM ANY OTHER FILES.
+
+Deliverables:
+
+A complete LinkML schema file (YAML) defining the GFF3Gene class (or similar), associated slots, and unified attribute representation. Place schema in a new file in the linkml-schema-ai-tools directory.
+
+Include appropriate schema metadata (e.g., id, name, description, version, prefixes, imports).
+
+Ensure semantic reuse of terms consistent with Biolink and BICAN naming conventions.
+
+Testing:
+After creating the schema, validate it by running the following commands:
+
+linkml lint
+linkml generate pydantic
+
+
+Ensure both commands execute successfully without errors or warnings.
+### Model 
+ - openai/gpt-5
+
+ ## Run 5
+### User Prompt
+Task: Design a LinkML schema that models gene annotations derived from GFF3 gene features and organizes them into a reusable “genome annotation” package. The schema should:
+
+1. Represent individual genes (derived from GFF3 rows where type=gene)
+2. Represent a genome annotation dataset that the genes are referenced from
+3. Represent the reference genome assembly used by the dataset
+4. Provide a top-level collection to hold multiple gene annotations, genome annotations, and assemblies
+
+Background: You are an expert in data modeling and the LinkML framework. The goal is to capture the essential structure and semantics of gene features from GFF3, while structuring the schema into a gene record class that references a dataset class (the genome annotation) with assembly context and minimal, clear controlled vocabularies.
+
+Inputs: Use ONLY the following reference files to understand GFF3 structure and attribute conventions:
+
+- linkml-schema-ai-tools/gff3.md — general GFF3 specification summary
+- linkml-schema-ai-tools/ncbi_gff3.txt — documentation from NCBI GFF3 format
+- linkml-schema-ai-tools/ensembl_gff3.md — documentation from Ensembl GFF3 format
+
+Goal and required structure:
+
+- Provide the following classes and relationships:
+
+  1. GeneAnnotation (is_a: gene)
+
+     - Represents a single gene derived from a GFF3 row with type=gene.
+     - Must include: a small set of unifying attributes from GFF3’s attributes column, minimal biotype-like classification, and an explicit reference to the genome annotation dataset it came from.
+     - Should include gene identity and optional location context; coordinates are allowed but keep them minimal and gene-appropriate.
+
+  2. GenomeAnnotation (is_a: genome)
+
+     - Represents a genome annotation dataset (e.g., an authority’s release of gene annotations).
+     - Must include dataset-level metadata such as version, digest, content_url, and authority.
+     - Must reference a genome assembly.
+
+  3. GenomeAssembly (is_a: named thing; mixin: thing with taxon)
+
+     - Represents the genome assembly used by the genome annotation.
+     - Include minimal assembly metadata (e.g., version, strain).
+
+  4. AnnotationCollection (tree_root: true)
+
+     - A root container class for transporting sets of GeneAnnotation, GenomeAnnotation, and GenomeAssembly instances.
+     - Provide list attributes for each of these item types, using inlined_as_list: true.
+
+Class details:
+
+1. GeneAnnotation (is_a: gene)
+
+- Core identity and provenance:
+
+  - slots: source_id, molecular_type
+  - attributes:
+    - referenced_in: required, inlined, any_of: [GenomeAnnotation, string] (the genome annotation dataset this gene came from)
+
+- GFF3 gene columns (keep minimal and appropriate for “gene”):
+
+  - seqid, source, type (constrain to “gene”), start, end, score, strand
+  - Do NOT include “phase” for the gene feature (phase is for CDS)
+  - Use a one-based integer type for start and end
+
+- Unified attributes (Column 9 harmonization across NCBI/Ensembl):
+
+  - molecular_type: any_of: [BioType, string] with small controlled vocabulary (see enums)
+
+  - source_id: schema:identifier for the authority-specific identifier for this gene
+
+  - Optional harmonized attributes with annotations for original provenance:
+
+    - ensembl_gene_id (annotations: ensembl_attr: gene_id)
+    - ncbi_gene_id (range: uriorcurie; annotations: ncbi_attr: Dbxref(GeneID))
+    - biotype or a richer “gene_biotype” value (if provided), but map/roll-up to molecular_type for a simple classification
+    - symbol, name, description, synonym, xref, version
+    - locus_tag, pseudo (boolean), pseudogene_subtype, note
+
+  - Provide slot_usage with exact_mappings/narrow_mappings to GFF3 attributes where applicable (e.g., ID, Name, Alias, Dbxref) and to the GFF3 columns (seqid, source, type, start, end, score, strand).
+
+- Constraints:
+
+  - type should be constrained to “gene” via an enum (e.g., GeneFeatureType with permissible value “gene”)
+  - start/end should use a custom integer type with minimum_value: 1
+
+2. GenomeAnnotation (is_a: genome)
+
+- Dataset-level metadata:
+  - slots: version, digest, content_url, authority
+- Link to assembly:
+  - attributes:
+    - reference_assembly: required, inlined, any_of: [GenomeAssembly, string]
+- Authority should be controlled by a small enum (e.g., AuthorityType with ENSEMBL, NCBI)
+
+3. GenomeAssembly (is_a: named thing; mixins: [thing with taxon])
+
+- slots: version, strain
+- The mixin ensures taxon context is available for the assembly
+
+4. AnnotationCollection (tree_root: true)
+
+- attributes:
+
+  - annotations: multivalued, inlined_as_list: true, range: GeneAnnotation
+  - genome_annotations: multivalued, inlined_as_list: true, range: GenomeAnnotation
+  - genome_assemblies: multivalued, inlined_as_list: true, range: GenomeAssembly
+
+Slots to define (non-exhaustive; use clear descriptions and ranges):
+
+- seqid (maps to GFF3 Column 1; consider accession.version semantics)
+- source (GFF3 Column 2)
+- type (GFF3 Column 3; constrained to gene)
+- start (GFF3 Column 4; range: one_based_int with minimum_value: 1)
+- end (GFF3 Column 5; range: one_based_int with minimum_value: 1)
+- score (GFF3 Column 6; range: float)
+- strand (GFF3 Column 7; use StrandEnum with +, -, ., ?)
+- molecular_type (any_of: [BioType, string])
+- authority (range: AuthorityType)
+- source_id (slot_uri: schema:identifier)
+- strain (string)
+- referenced_in (as described above)
+- reference_assembly (as described above)
+
+Types:
+
+- one_based_int: typeof: integer; minimum_value: 1
+
+Enums:
+
+- StrandEnum: +, -, ., ?
+- GeneFeatureType: permissible value: gene
+- BioType: permissible values: protein_coding, noncoding (keep intentionally small and simple)
+- AuthorityType: permissible values: ENSEMBL, NCBI
+
+Prefixes and imports:
+
+- prefixes: linkml, bican, schema, NCBIGene, NCBIAssembly, NCBITaxon (add others as needed for GFF3 or Ensembl identifiers)
+- imports: linkml:types, bican_biolink, bican_core
+- default_prefix: bican
+- default_range: string
+
+Attribute harmonization guidance:
+
+- For each unifying gene attribute, add annotations indicating original source keys (e.g., ncbi_attr: gene_biotype; ensembl_attr: biotype).
+- For identifiers, ensure ncbi_gene_id is uriorcurie (e.g., NCBIGene:####) and ensembl_gene_id is a string (e.g., ENSG..., ENSMUSG...).
+- Use exact_mappings or narrow_mappings to GFF3 keys (e.g., gff3:ID, gff3:Name, gff3:Dbxref) for clarity.
+
+Deliverables:
+
+- A single LinkML YAML file placed in linkml-schema-ai-tools defining:
+
+  - The classes: GeneAnnotation, GenomeAnnotation, GenomeAssembly, AnnotationCollection
+  - The slots, types, and enums listed above
+  - Clear descriptions for classes/slots and mapping annotations to GFF3
+
+- Include top-level schema metadata (id, name, description, version, prefixes, imports)
+
+- Ensure semantic reuse of terms consistent with Biolink/BICAN naming where appropriate
+
+Testing:
+
+- Run:
+
+  - linkml lint
+  - linkml generate pydantic
+
+- Both commands must execute successfully without errors or warnings.
+
+Constraints:
+
+- DO NOT READ FROM ANY OTHER FILES beyond the three Inputs specified above.
+
 ### Model 
  - openai/gpt-5
diff --git a/linkml-schema-ai-tools/ensembl_gff3.md b/linkml-schema-ai-tools/ensembl_gff3.md
@@ -0,0 +1,127 @@
+#### README ####
+
+-----------------------
+GFF FLATFILE DUMPS
+-----------------------
+Gene annotation is provided in GFF3 format. Detailed specification of
+the format is maintained by the Sequence Ontology:
+https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
+
+GFF3 files are validated using GenomeTools: http://genometools.org
+
+For chromosomal assemblies, in addition to a file containing all
+genes, there are per-chromosome files. If a predicted geneset is
+available (generated by Genscan and other ab initio tools), these
+genes are in a separate 'abinitio' file.
+
+
+The 'type' of gene features is:
+ * "gene" for protein-coding genes
+ * "ncRNA_gene" for RNA genes
+ * "pseudogene" for pseudogenes
+The 'type' of transcript features is:
+ * "mRNA" for protein-coding transcripts
+ * a specific type or RNA transcript such as "snoRNA" or "lnc_RNA"
+ * "pseudogenic_transcript" for pseudogenes
+All transcripts are linked to "exon" features.
+Protein-coding transcripts are linked to "CDS", "five_prime_UTR", and
+"three_prime_UTR" features.
+
+Attributes for feature types:
+(square brackets indicate data which is not available for all features)
+ * region types:
+    * ID: Unique identifier, format "<region_type>:<region_name>"
+    * [Alias]: A comma-separated list of aliases, usually including the
+      INSDC accession
+    * [Is_circular]: Flag to indicate circular regions
+ * gene types:
+    * ID: Unique identifier, format "gene:<gene_stable_id>"
+    * biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene"
+    * gene_id: Ensembl gene stable ID
+    * version: Ensembl gene version
+    * [Name]: Gene name
+    * [description]: Gene description
+ * transcript types:
+    * ID: Unique identifier, format "transcript:<transcript_stable_id>"
+    * Parent: Gene identifier, format "gene:<gene_stable_id>"
+    * biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene"
+    * transcript_id: Ensembl transcript stable ID
+    * version: Ensembl transcript version
+    * [Note]: If the transcript sequence has been edited (i.e. differs
+      from the genomic sequence), the edits are described in a note.
+ * exon
+    * Parent: Transcript identifier, format "transcript:<transcript_stable_id>"
+    * exon_id: Ensembl exon stable ID
+    * version: Ensembl exon version
+    * constitutive: Flag to indicate if exon is present in all
+      transcripts
+    * rank: Integer that show the 5'->3' ordering of exons
+ * CDS
+    * ID: Unique identifier, format "CDS:<protein_stable_id>"
+    * Parent: Transcript identifier, format "transcript:<transcript_stable_id>"
+    * protein_id: Ensembl protein stable ID
+    * version: Ensembl protein version
+
+Metadata:
+ * genome-build - Build identifier of the assembly e.g. GRCh37.p11
+ * genome-version - Version of this assembly e.g. GRCh37
+ * genome-date - The date of the release of this assembly e.g. 2009-02
+ * genome-build-accession - Genome accession e.g. GCA_000001405.14
+ * genebuild-last-updated - Date of the last genebuild update e.g. 2013-09
+
+-----------
+FILE NAMES
+------------
+The files are consistently named following this pattern:
+   <species>.<assembly>.<_version>.gff3.gz
+
+<species>:       The systematic name of the species. 
+<assembly>:      The assembly build name.
+<version>:       The version of Ensembl from which the data was exported.
+gff3 : All files in these directories are in GFF3 format
+gz : All files are compacted with GNU Zip for storage efficiency.
+
+e.g. 
+Homo_sapiens.GRCh38.81.gff3.gz
+
+For the predicted gene set, an additional abinitio flag is added to the name file.
+<species>.<assembly>.<version>.abinitio.gff3.gz
+
+e.g.
+Homo_sapiens.GRCh38.81.abinitio.gff3.gz
+
+------------------
+Example GFF3 output
+------------------
+
+##gff-version 3
+#!genome-build  Pmarinus_7.0
+#!genome-version Pmarinus_7.0
+#!genome-date 2011-01
+#!genebuild-last-updated 2013-04
+
+GL476399        Pmarinus_7.0    supercontig     1       4695893 .       .       .       ID=supercontig:GL476399;Alias=scaffold_71
+GL476399        ensembl gene    2596494 2601138 .       +       .       ID=gene:ENSPMAG00000009070;Name=TRYPA3;biotype=protein_coding;description=Trypsinogen A1%3B Trypsinogen a3%3B Uncharacterized protein  [Source:UniProtKB/TrEMBL%3BAcc:O42608];logic_name=ensembl;version=1
+GL476399        ensembl transcript      2596494 2601138 .       +       .       ID=transcript:ENSPMAT00000010026;Name=TRYPA3-201;Parent=gene:ENSPMAG00000009070;biotype=protein_coding;version=1
+GL476399        ensembl exon    2596494 2596538 .       +       .       Name=ENSPMAE00000087923;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;rank=1;version=1
+GL476399        ensembl exon    2598202 2598361 .       +       .       Name=ENSPMAE00000087929;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=2;ensembl_phase=1;rank=2;version=1
+GL476399        ensembl exon    2599023 2599282 .       +       .       Name=ENSPMAE00000087937;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=2;rank=3;version=1
+GL476399        ensembl exon    2599814 2599947 .       +       .       Name=ENSPMAE00000087952;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;rank=4;version=1
+GL476399        ensembl exon    2600895 2601138 .       +       .       Name=ENSPMAE00000087966;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;rank=5;version=1
+GL476399        ensembl CDS     2596499 2596538 .       +       0       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
+GL476399        ensembl CDS     2598202 2598361 .       +       2       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
+GL476399        ensembl CDS     2599023 2599282 .       +       1       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
+GL476399        ensembl CDS     2599814 2599947 .       +       2       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
+GL476399        ensembl CDS     2600895 2601044 .       +       0       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
+GL476399        ensembl five_prime_UTR  2596494 2596498 .       +       .       Parent=transcript:ENSPMAT00000010026
+GL476399        ensembl three_prime_UTR 2601045 2601138 .       +       .       Parent=transcript:ENSPMAT00000010026
+
+
+--------------------------------------
+Locus Reference Genomic Sequence (LRG)
+--------------------------------------
+This is a manually curated project that contains stable and un-versioned reference sequences designed specifically for reporting sequence variants with clinical implications.
+The sequences of each locus (also called LRG) are chosen in collaboration with research and diagnostic laboratories, LSDB (locus specific database) curators and mutation consortia with expertise in the region of interest.
+LRG website: http://www.lrg-sequence.org
+LRG data are freely available in several formats (FASTA, BED, XML, Tabulated) at this address: http://www.lrg-sequence.org/downloads
+