You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: linkml-schema-ai-tools/README.md
+226-1Lines changed: 226 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,13 +39,238 @@ Reuse: linkml-schema/bican_biolink.yaml (a biolink subset commonly used in BICAN
39
39
### Model
40
40
- openai/gpt-5
41
41
42
-
## Run 3
42
+
## Run 3
43
43
### User Prompt
44
44
Task: create a linkml model
45
45
Background: You are an expert in data modeling and the tool LinkML.
46
46
Goal: Given the 'linkml-schema-ai-tools/gff3.md' file, create a linkml model to represent the data present in a gff3 file. We are only interested in representing feature types that are 'genes' and the associated information stored in the 'attributes' column.
47
47
Example Data: 'data/GCF_000003025.6_Sscrofa11.1_genomic.gff' is an example of how data is represented in a gff3 file. Use this file to help refine the model.
48
48
Reuse: linkml-schema/bican_biolink.yaml (a biolink subset commonly used in BICAN) and linkml-schema/bican_core.yaml (BICAN core metadata such as versioning and checksums)
49
49
Testing: Test the generated linkml model by running 2 commands. First run : 'linkml lint' and then run 'linkml generate pydantic'.
50
+
### Model
51
+
- openai/gpt-5
52
+
53
+
## Run 4
54
+
### User Prompt
55
+
Task:
56
+
Design a LinkML schema that models the metadata of a GFF3 file, focusing specifically on gene-level features and their associated attributes.
57
+
58
+
Background:
59
+
You are an expert in data modeling and the LinkML framework. The goal is to capture the essential structure and semantics of gene features as represented in GFF3 files.
60
+
61
+
Inputs:
62
+
Use the following reference files to understand the GFF3 structure and attribute conventions:
63
+
64
+
linkml-schema-ai-tools/gff3.md — general GFF3 specification summary
65
+
66
+
linkml-schema-ai-tools/ncbi_gff3.txt — documentation from NCBI GFF3 format
67
+
68
+
linkml-schema-ai-tools/ensembl_gff3.md — documentation from Ensembl GFF3 format
69
+
70
+
71
+
Goal:
72
+
73
+
Create a LinkML schema that represents the GFF3 file structure, but only for features of type “gene”.
74
+
75
+
Model the core GFF3 columns (e.g., seqid, source, type, start, end, score, strand, phase) as appropriate.
76
+
77
+
Focus primarily on modeling the attributes column, unifying attribute definitions and types between NCBI and Ensembl conventions. For aunified attributes, also provide mapping to original attribute name from NCBI and Ensembl.
78
+
79
+
Include clear descriptions, slot ranges, and enums where applicable.
80
+
81
+
Reuse existing entities, mixins, and patterns from:
82
+
83
+
linkml-schema/bican_biolink.yaml — Biolink subset used in BICAN
A complete LinkML schema file (YAML) defining the GFF3Gene class (or similar), associated slots, and unified attribute representation. Place schema in a new file in the linkml-schema-ai-tools directory.
Ensure semantic reuse of terms consistent with Biolink and BICAN naming conventions.
96
+
97
+
Testing:
98
+
After creating the schema, validate it by running the following commands:
99
+
100
+
linkml lint
101
+
linkml generate pydantic
102
+
103
+
104
+
Ensure both commands execute successfully without errors or warnings.
105
+
### Model
106
+
- openai/gpt-5
107
+
108
+
## Run 5
109
+
### User Prompt
110
+
Task: Design a LinkML schema that models gene annotations derived from GFF3 gene features and organizes them into a reusable “genome annotation” package. The schema should:
111
+
112
+
1. Represent individual genes (derived from GFF3 rows where type=gene)
113
+
2. Represent a genome annotation dataset that the genes are referenced from
114
+
3. Represent the reference genome assembly used by the dataset
115
+
4. Provide a top-level collection to hold multiple gene annotations, genome annotations, and assemblies
116
+
117
+
Background: You are an expert in data modeling and the LinkML framework. The goal is to capture the essential structure and semantics of gene features from GFF3, while structuring the schema into a gene record class that references a dataset class (the genome annotation) with assembly context and minimal, clear controlled vocabularies.
118
+
119
+
Inputs: Use ONLY the following reference files to understand GFF3 structure and attribute conventions:
120
+
121
+
- linkml-schema-ai-tools/gff3.md — general GFF3 specification summary
122
+
- linkml-schema-ai-tools/ncbi_gff3.txt — documentation from NCBI GFF3 format
123
+
- linkml-schema-ai-tools/ensembl_gff3.md — documentation from Ensembl GFF3 format
124
+
125
+
Goal and required structure:
126
+
127
+
- Provide the following classes and relationships:
128
+
129
+
1. GeneAnnotation (is_a: gene)
130
+
131
+
- Represents a single gene derived from a GFF3 row with type=gene.
132
+
- Must include: a small set of unifying attributes from GFF3’s attributes column, minimal biotype-like classification, and an explicit reference to the genome annotation dataset it came from.
133
+
- Should include gene identity and optional location context; coordinates are allowed but keep them minimal and gene-appropriate.
134
+
135
+
2. GenomeAnnotation (is_a: genome)
136
+
137
+
- Represents a genome annotation dataset (e.g., an authority’s release of gene annotations).
138
+
- Must include dataset-level metadata such as version, digest, content_url, and authority.
139
+
- Must reference a genome assembly.
140
+
141
+
3. GenomeAssembly (is_a: named thing; mixin: thing with taxon)
142
+
143
+
- Represents the genome assembly used by the genome annotation.
144
+
- Include minimal assembly metadata (e.g., version, strain).
145
+
146
+
4. AnnotationCollection (tree_root: true)
147
+
148
+
- A root container class for transporting sets of GeneAnnotation, GenomeAnnotation, and GenomeAssembly instances.
149
+
- Provide list attributes for each of these item types, using inlined_as_list: true.
150
+
151
+
Class details:
152
+
153
+
1. GeneAnnotation (is_a: gene)
154
+
155
+
- Core identity and provenance:
156
+
157
+
- slots: source_id, molecular_type
158
+
- attributes:
159
+
- referenced_in: required, inlined, any_of: [GenomeAnnotation, string] (the genome annotation dataset this gene came from)
160
+
161
+
- GFF3 gene columns (keep minimal and appropriate for “gene”):
162
+
163
+
- seqid, source, type (constrain to “gene”), start, end, score, strand
164
+
- Do NOT include “phase” for the gene feature (phase is for CDS)
165
+
- Use a one-based integer type for start and end
166
+
167
+
- Unified attributes (Column 9 harmonization across NCBI/Ensembl):
168
+
169
+
- molecular_type: any_of: [BioType, string] with small controlled vocabulary (see enums)
170
+
171
+
- source_id: schema:identifier for the authority-specific identifier for this gene
172
+
173
+
- Optional harmonized attributes with annotations for original provenance:
- Provide slot_usage with exact_mappings/narrow_mappings to GFF3 attributes where applicable (e.g., ID, Name, Alias, Dbxref) and to the GFF3 columns (seqid, source, type, start, end, score, strand).
182
+
183
+
- Constraints:
184
+
185
+
- type should be constrained to “gene” via an enum (e.g., GeneFeatureType with permissible value “gene”)
186
+
- start/end should use a custom integer type with minimum_value: 1
This is a manually curated project that contains stable and un-versioned reference sequences designed specifically for reporting sequence variants with clinical implications.
124
+
The sequences of each locus (also called LRG) are chosen in collaboration with research and diagnostic laboratories, LSDB (locus specific database) curators and mutation consortia with expertise in the region of interest.
125
+
LRG website: http://www.lrg-sequence.org
126
+
LRG data are freely available in several formats (FASTA, BED, XML, Tabulated) at this address: http://www.lrg-sequence.org/downloads
0 commit comments