From d5e6b6150493b04b486b4eb5bc0934a7ef264797 Mon Sep 17 00:00:00 2001 From: woollard Date: Fri, 13 Oct 2023 11:18:03 +0100 Subject: [PATCH 01/34] added tag_querying documentation for the new tagging functionality. --- .../programmatic-access/tag_querying.rst | 262 ++++++++++++++++++ 1 file changed, 262 insertions(+) create mode 100644 retrieval/programmatic-access/tag_querying.rst diff --git a/retrieval/programmatic-access/tag_querying.rst b/retrieval/programmatic-access/tag_querying.rst new file mode 100644 index 00000000..9df39684 --- /dev/null +++ b/retrieval/programmatic-access/tag_querying.rst @@ -0,0 +1,262 @@ +======================== +Text Tags On ENA Objects +======================== + +----------------- +Table of Contents +----------------- + +* What are Tags and Why are they Useful? +* How many Tags can an Object Possess? +* What Tags are Available? +* How are the Tags Created? +* Miscellaneous + +.. _my-reference-label: + +-------------------------------------- +What are Tags and Why are they Useful? +-------------------------------------- +The tags are controlled textual annotations provided to objects, such as sample and taxonomy. + +The purpose of these is to make searching and filtering much easier. In ENA they are often used to determine object membership of certain data portals. Vice versa they can also be used to easily access vignettes of data from which to build a new data portal rapidly. + +Examples: + +* Find all pathogenic samples by using the “pathogen” tag (this is used to drive the data coverage of the `Pathogens Portal `_.) +* Use “marine:high_confidence” tag to find all samples that are highly likely to be from the marine environment. +* Find all records in ENA data that have a corresponding record cross-referenced to the `WoRMS - World Register of Marine Species `_, by searching “xref:worms”. + +The tagging system has proved useful in determining the object membership of certain domain specific data portals such as Pathogens Portal. Conversely they can also be used to easily obtain vignettes of data from which to build a new data portal rapidly. + +------------------------------------ +How many Tags can an Object Possess? +------------------------------------ +An object such as a sample can have zero or multiple tags. + +A sample for example could be tagged as both “marine:high_confidence” and “terrestrial:low_confidence”. + +------------------------ +What Tags are Available? +------------------------ + +Most of the sample and taxonomy tags have the format: high_level_tag:low_level_tag. The high level tag is often used to provide some extra context to the more granular tag. + + +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Table of Object High Level Tags +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + +.. csv-table:: High Level Tags + :header: "high level tag", "description", "object type" + :widths: 20, 300, 50 + + "pathogen", "The sample has been automatically determined to belong to the Pathogens Portal", "assembly; sample; sequence; study; secondary_study; taxonomy" + "coastal_brackish", "The sample has been automatically determined by evaluation of GPS and other parameters to have some evidence of being collected from either a coastal or brackish environment.", "read_run; sample; taxonomy" + "freshwater", "The sample has been automatically determined by evaluation of GPS and other parameters to have some evidence of being collected from a freshwater environment.", "read_run; sample; taxonomy" + "marine", "The sample has been automatically determined by evaluation of GPS and other parameters to have some evidence of being collected from a marine environment.", "read_run; sample; taxonomy" + "terrestrial", "The sample has been automatically determined by evaluation of GPS and other parameters to have some evidence of being collected from a terrestrial environment.", "read_run; sample; taxonomy" + "datahub", "The sample has been automatically determined to belong to a datahub. Currently tags have been generated for `FAANG `_ and `Pathogen `_", "analysis; read_run; sample; secondary_study" + "xref", "The sample has been referenced in an external to the EMBL-EBI repository. Currently tags have been generated for WORMS and UniEUK.", "Depends on how the user submitted" + "covid19", "The sample has been automatically determined to belong to the COVID19 portal.", "analysis; read_run; sample; sequence; study" + + + +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Table of All Object High and Low Level Tags +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +These are at several types of objects especially sample and taxonomy. Please see the object types in the previous (high level tag) +table, to see what they apply to. + +.. list-table:: Object High and Low Level Tags + :widths: 15 10 30 10 10 + :header-rows: 1 + + * - tag presentation + - high level tag + - low level tag + - description + - comment + * - pathogen + - pathogen + - + - Object is some type of pathogen + - There will likely be other low level tags to provide context. + * - pathogen:priority + - pathogen + - priority + - + - + * - pathogen:bacterium + - pathogen + - bacterium + - Object is of a bacterium organism. + - At time of documentation the bacterium is not specifically pathogenic. + * - pathogen:fungus + - pathogen + - fungus + - Object is of a fungus orgnism. + - At time of documentation the fungus is not specifically pathogenic. + * - pathogen:helminth + - pathogen + - helminth + - + - At time of documentation the helminth is not specifically pathogenic. + * - pathogen:protozoan + - pathogen + - protozoan + - Object is of a protozon organism. + - At time of documentation the protozoan is not specifically pathogenic. + * - pathogen:virus + - pathogen + - virus + - Object is of a virus organism. + - At time of documentation the virus is not specifically pathogenic. + * - coastal_brackish + - coastal_brackish + - + - Some evidence that the object is “coastal or brackish” environment associated. + - There will likely be other low level tags to provide context. + * - coastal_brackish:high_confidence + - coastal_brackish + - high_confidence + - strong evidence that the object is “coastal or brackish” environment associated. + - + * - coastal_brackish:medium_confidence + - coastal_brackish + - medium_confidence + - moderate evidence that the object is “coastal or brackish” environment associated. + - + * - coastal_brackish:low_confidence + - coastal_brackish + - low_confidence + - weak evidence that the object is “coastal or brackish” environment associated. + - + * - freshwater + - freshwater + - + - Some evidence that it is “freshwater” environment assosciated + - There will likely be other low level tags to provide context. + * - freshwater:high_confidence + - freshwater + - high_confidence + - Strong evidence that the object is freshwater environment associated. + - + * - freshwater:medium_confidence + - freshwater + - medium_confidence + - moderate evidence that the object is freshwater environment associated. + - + * - freshwater:low_confidence + - freshwater + - low_confidence + - weak evidence that the object is freshwater environment associated. + - + * - marine + - marine + - + - Some evidence that it is “marine” environment assosciated + - There will likely be other low level tags to provide context. + * - marine:high_confidence + - marine + - high_confidence + - Strong evidence that the object is marine environment associated. + - + * - marine:medium_confidence + - marine + - medium_confidence + - moderate evidence that the object is marine environment associated. + - + * - marine:low_confidence + - marine + - low_confidence + - weak evidence that the object is marine environment associated. + - + * - terrestrial + - terrestrial + - + - Some evidence that it is terrestrial(land) environment associated. + - There will likely be other low level tags to provide context. + * - terrestrial:high_confidence + - terrestrial + - high_confidence + - Strong evidence that the object is terrestrial(land) environment associated. + - + * - terrestrial:medium_confidence + - terrestrial + - medium_confidence + - moderate evidence that the object is terrestrial(land) environment associated. + - + * - terrestrial:low_confidence + - terrestrial + - low_confidence + - weak evidence that the object is terrestrial(land) environment associated. + - + * - datahub:faang + - datahub + - Faang + - Is a `Functional Annotation of ANimal Genomes project (FAANG) `_ sample and present in that datahub + - + * - datahub:metagenome + - datahub + - metagenome + - Is a metagenome and present in that datahub + - + * - xref:arrayexpress + - xref + - arrayexpress + - Object associated with an `ArrayExpress `_ record + - A xref is available that links to ArrayExpress + * - xref:europepmc + - xref + - europepmc + - Object associated with a `European PubmedCentral `_ record + - A xref is available that links to European PubmedCentral + * - xref:pubmed + - xref + - pubmed + - Object associated with an `NCBI Pubmed `_ record + - A xref is available that links to NCBI Pubmed + * - xref:worms + - xref + - worms + - Object associated with a `WoRMS `_ record + - + * - xref:unieuk + - xref + - unieuk + - Object associated with a `UNIEUK /(Universal taxonomic framework and integrated reference gene databases for Eukaryotic biology, ecology, and evolution ) `_ record + - A xref is available that links to UNIEUK + * - covid19 + - + - covid19 + - Object associated with covid19 + - + * - covid19Host + - + - covid19Host + - Object associated with a covid19 Host + - + +------------------------- +How are the Tags Created? +------------------------- + +The tags are typically assigned by automatic processes analysing the user supplied metadata around an object. + +For example, the identification of “marine” sample records is systematically assessed by a combination of geo-coordinates and taxonomic evidence. We can further qualify such identification by a level of confidence which is dictated by a combination of the evidence available on the record to support said assertion. + +This is an evolving and continuously improving process, where the algorithms and the rule-sets used for classification can be updated as new insights are obtained and thus results in the assigned tags being regularly refreshed. The flexibility of this system allows for new classifications to be easily created allowing the definition of new, high-level contextual groupings for ENA data making the process of discovery more intuitive for certain user communities. + + +------------- +Miscellaneous +------------- + +The tags are all less than 21 Unicode characters in length. + +N.B. The tags described in this page are not to be confused with Locus Tags. + + From 8d30ebe910ac3b56318cdfe615756103f556ad03 Mon Sep 17 00:00:00 2001 From: woollard Date: Tue, 17 Oct 2023 14:21:11 +0100 Subject: [PATCH 02/34] ignoring ../.vscode/ --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 090a1f02..6d8e917e 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ .idea .DS_Store +.vscode/ From e93de1b399e357d322fc072acfff11a4b542d6f3 Mon Sep 17 00:00:00 2001 From: woollard Date: Tue, 16 Jan 2024 15:53:23 +0000 Subject: [PATCH 03/34] Created initial markdown for the sample checklist introductions and background --- .../sample_checklist_introduction.md | 25 +++++++++++++++++++ 1 file changed, 25 insertions(+) create mode 100644 submit/samples/SampleChecklists/sample_checklist_introduction.md diff --git a/submit/samples/SampleChecklists/sample_checklist_introduction.md b/submit/samples/SampleChecklists/sample_checklist_introduction.md new file mode 100644 index 00000000..9923296d --- /dev/null +++ b/submit/samples/SampleChecklists/sample_checklist_introduction.md @@ -0,0 +1,25 @@ +# Introduction + +Sample checklists are used to ensure that both the minimum core metadata and metadata specific to different sample types are submitted to ENA. Please see: [background to sample checklists in ENA](https://ena-browser-docs.readthedocs.io/en/latest/browser/sample-checklists.html) and the available [ENA sample checklists](https://www.ebi.ac.uk/ena/browser/checklists). + +The [Genome Standards Consortium(GSC)](http://www.gensc.org//pages/projects/mixs-gsc-project.html) works with many communities to generate the “Minimum Information about any (X) Sequence” (MIxS) specifications”. ENA and other INSDC members implement the MIxS standards. Essentially these consist of: +* Community specific checklists, but with each having a core of shared metadata terms. +* Metadata terms of a specific name and definition. +* Sometimes a required pattern for the value, ranging from an integer to free text. + +## Working together on Improving Standards +ENA collaborates with [GSC](http://www.gensc.org//pages/projects/mixs-gsc-project.html), [INSDC](https://www.insdc.org/) and other standards bodies to help meet our increasingly diverse user needs and increase interoperability. The sequence technologies continue to evolve at pace and scientists apply them to help investigate basic biology, disease and biodiversity. + +There are some considerations with these standards especially in that the actual implementation varies in different organisations. Generally we try to minimise the differences to increase interoperability. Here are some examples: +* In ENA, we use the **long term name**(called "title" in GSC MIxS) rather than the **short term name**. This is because some of the short names are ambiguous abbreviations, so the longer names provide more clarity. +* In MIxS, the checklists are called **combinations**, these consist of **core** terms and **extension** terms. In ENA subset of these terms will not be in the sample checklist e.g. Taxonomy is handled separately. +* In ENA, some terms have broader concepts than the MIxS e.g. we use **depth** term more generally rather than just **soil depth** we also use the same term to cover **depth below sea level** +* There are several MIxS terms such as **miscellaneous attribute**, which are not used in the ENA checklists, as they are ambiguous and not interoperable. +We do regularly mutually share suggested changes to definitions, term naming or additional terms. + +## Time Scales of Updates +We try to get the balance of being stable and conservative, whilst still being responsive enough to community's needs. +* Generally ENA and other INSDC members commit to checklist updates following the major MIxS releases e.g. 4.0, 5.0, 6.0, 7.0. These are typically every 2 to 3 years. + * Updates, even with much automation can take many weeks of full time equivalent work to add and quality control. + * Sometimes terms change names and then change back again between sub-releases. +* If important terms, improved term definitions or even checklists are needed by ENA's user communities, we often rapidly add those in. From e6a1f9fee8c5bc61261aad0bbfb4ce6e56bc3a1e Mon Sep 17 00:00:00 2001 From: woollard Date: Tue, 16 Jan 2024 15:53:49 +0000 Subject: [PATCH 04/34] Created initial markdown for the sample checklist MIxS_V6.2 update --- .../2024-01-31:Incorporating_MIxS_V6.2.md | 81 +++++++++++++++++++ 1 file changed, 81 insertions(+) create mode 100644 submit/samples/SampleChecklists/SampleChecklistUpdates/2024-01-31:Incorporating_MIxS_V6.2.md diff --git a/submit/samples/SampleChecklists/SampleChecklistUpdates/2024-01-31:Incorporating_MIxS_V6.2.md b/submit/samples/SampleChecklists/SampleChecklistUpdates/2024-01-31:Incorporating_MIxS_V6.2.md new file mode 100644 index 00000000..12f16d2f --- /dev/null +++ b/submit/samples/SampleChecklists/SampleChecklistUpdates/2024-01-31:Incorporating_MIxS_V6.2.md @@ -0,0 +1,81 @@ +# ENA Checklists Update Incorporating MIxS V6.2 +Checklists Updated: January 2024 + +## Summary of ENA Checklists after the MIxS v6.2 Update +* Four new MIxS checklists have been added to ENA: GSC MIxS Agriculture, GSC MIxS Food and Production, GSC MIxS Symbiont, and GSC MIxS Hydrocarbon. +* Fifteen existing MIxS checklists in ENA, had new checklists terms added. + * Three had many new terms: GSC MIxS built environment(66), GSC MIxS plant-associated(24) and GSC MIxS sediment(14). + * Twelve checklists had between 1 and 8 new terms added. +* 368 new MIxS terms were added to the ENA checklist system. There are now 1031 ENA sample checklist terms. +* 47 aliases(synonyms) of terms were added, e.g. where the MIxS term name had changed, or there was now a MIxS term for the same concept as an existing legacy ENA term. Wherever appropriate we use the MIxS term. + +This and similar metadata updates are important to both: +1. meet the needs of the diverse data submitters to ENA and +2. ensure interoperability for ENA submitted metadata with that of other INSDC members and other portals. Please see the background to sample checklists in ENA for more information. + +This will take effect from 31-Jan-2024. + +--- +## Introduction +[Please read this background about sample level checklists](../sample_checklist_introduction.md) and GSC MIxS. + +A growing proportion of ENA's sample level checklists are from MIxS, currently the MIxS are 22 of the 52 sample checklists. Most of the other sources of ENA’s checklists are legacy. + + +## Four New MIxS Derived Checklists in ENA + +| New checklist Name in ENA | Deeper background to the checklist creation | Comment for ENA | +|------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| GSC MIxS Agriculture | [Community-Driven Metadata Standards for Agricultural Microbiome Research](https://apsjournals.apsnet.org/doi/10.1094/PBIOMES-09-19-0051-P) | | +| GSC MIxS Food and Production | | Built from five MIxS lists packages as much overlap(food-human foods, food-farm environment, food-food production facility, food-animal and animal feed) N.B. A dozen terms are currently excluded, as they were mainly agriculture and or soil sample related. | +| GSC MIxS Symbiont | [MIxS-SA: a MIxS extension defining the minimum information standard for sequence data from symbiont-associated micro-organisms](https://www.nature.com/articles/s43705-022-00092-w) | | +| GSC MIxS Hydrocarbon | [MIxS-HCR: a MIxS extension defining a minimal information standard for sequence data from environments pertaining to hydrocarbon resources](https://www.nature.com/articles/s43705-022-00092-w) | All added apart from “additional info” | + +## Fifteen existing MIxS checklists in ENA have had new checklists terms added. + +* For twelve checklists between 1 and 8 new terms were added to these GSC MIxS checklists: air, host, human-associated, human-gut, human-oral, human-vaginal, microbial mat biofilm, miscellaneous natural or artificial environment, soil, wastewater sludge, and water +* For the following three checklists there was a more substantial addition: + * 66 terms being added to the GSC MIxS built environment + * 24 terms added to the GSC MIxS plant-associated + * 14 terms added to the GSC MIxS sediment + + +# Summary Tables of Terms counts and Terms added Existing Checklist + +## Summary Table of Terms ( all sample based ) +| Count | What | +|-------|--------------------------------------------------------------------------| +| 1031 | total terms now in ENA | +| 368 | new terms not in ENA were added from MIxS | +| 47 | aliases added | +| 16 | existing definitions updated | +| 3 | MIxS v6.2 terms were not added to ENA, such as "miscellaneous attribute" | + +## Table of Terms added to which checklist ( all sample based ) +Only listing the terms where there were additional terms to existing checklists. + + +| Checklist | New or existing | Comment | +|--------------------------------------------------------------------------------------------------|-------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **GSC MIxS** Agriculture | New | N.B. From four or so MIxS packages | +| **GSC MIxS** Food and Production | New |
  • Combined from several MIxS lists as so much overlap
  • about a dozen terms, seemed out of place: agriculture and or soil looked better bets, so excluded those | +| **GSC MIxS** Symbiont | New | | +| **GSC MIxS** Hydrocarbon | New | All added apart from “additional info” | +| **GSC MIxS** air | existing | new terms added:
  • depth taxonomic
  • classification | +| **GSC MIxS** built environment | existing | new terms added: