Merge pull request #56 from poseidon-framework/poseidon27

stschiff · web-flow · commit 7898bb7e903a · 2023-01-31T10:12:42.000+01:00
Poseidon v2.7.0
diff --git a/POSEIDON_yml_fields.tsv b/POSEIDON_yml_fields.tsv
@@ -19,6 +19,8 @@ indFileChkSum	1	genotypeData	md5 checksum of the indFile	String		FALSE
 snpSet	1	genotypeData	Can be either 1240K, HumanOrigins or Other depending on the list of SNPs used	String	(1240K|HumanOrigins|Other)	FALSE
 jannoFile	0		relative path to jannoFile	String	Path	FALSE
 jannoFileChkSum	0		md5 checksum of the jannoFile	String		FALSE
+sequencingSourceFile	0		relative path to sequencingSourceFile	String	Path	FALSE
+sequencingSourceFileChkSum	0		md5 checksum of the sequencingSourceFile	String		FALSE
 bibFile	0		relative path to bibFile	String	Path	FALSE
 bibFileChkSum	0		md5 checksum of the bibFile	String		FALSE
 readmeFile	0		relative path to readmeFile	String	Path	FALSE
diff --git a/README.md b/README.md
@@ -148,4 +148,6 @@ V 1.1.0: The authors of @Gassenhauer_2021 made some previously restricted sample
 V 1.0.0: Creation of the package
 ```
 
+## The Sequencing Source file
 
+Poseidon 2.7.0 added an option to specify sequencing source data. This is a tab-separated table, much like the Janno file, but following a different schema, specified in the file `sequencingSourceFile_columns.tsv`. Note that the primary entities in this table are Sequencing entities (typically corresponding to DNA libraries or even multiple runs/lanes of the same library). The link to the Individuals listed in the Janno-file are made through a foreign-key relationship into `Poseidon_ID`.
diff --git a/sequencingSourceFile_columns.tsv b/sequencingSourceFile_columns.tsv
@@ -0,0 +1,22 @@
+sequencingSourceFile_column_name	description	data_type	multi	choice	range	choice_options	range_lower	range_upper	mandatory	unique
+
+Poseidon_ID	The Poseidon_ID field that this sequencing entity corresponds to, from the Janno-file.	String	FALSE	FALSE	FALSE				TRUE	FALSE
+sample_accession	The sample accession code as used in INSDC databases, including ENA and SRA (Example: SAMEA7050454)	String	FALSE	FALSE	FALSE				TRUE	TRUE
+study_accession	The study accession code as used in INSDC databases, including ENA and SRA (Example: PRJEB39316)	String	FALSE	FALSE	FALSE				FALSE	FALSE
+run_accession	The run accession code as used in INSDC databases, including ENA and SRA (Example: ERR4331996)	String	FALSE	FALSE	FALSE				FALSE	FALSE
+sample_alias	The sample alias defined by the submitter	String	FALSE	FALSE	FALSE				FALSE	FALSE
+secondary_sample_accession	A secondary sample accession, as used at the ENA for historical reasons (Example: ERS4811084)	String	FALSE	FALSE	FALSE				FALSE	TRUE
+first_public	The date (YYYY-MM-DD) this sample was first made public	Date	FALSE	FALSE	FALSE				FALSE	FALSE
+last_updated	The date (YYYY-MM-DD) this sample was last updated	Date	FALSE	FALSE	FALSE				FALSE	FALSE
+instrument_model	The name of the instrument used (Example: Illumina HiSeq 2500)	String	FALSE	FALSE	FALSE				FALSE	FALSE
+library_layout	The library layout of the sequencing entity (Example: SINGLE)	String	FALSE	FALSE	FALSE				FALSE	FALSE
+library_source	The source of the DNA library (Example: GENOMIC)	String	FALSE	FALSE	FALSE				FALSE	FALSE
+instrument_platform	The platform brand or type of the sequencer (Example: ILLUMINA)	String	FALSE	FALSE	FALSE				FALSE	FALSE
+library_name	This is the library name the submitter has entered. Can sometimes be useful to figure out which Poseidon_ID this entity belongs to	String	FALSE	FALSE	FALSE				FALSE	FALSE
+library_strategy	The strategy used to create the library (Example: WGS)	String	FALSE	FALSE	FALSE				FALSE	FALSE
+fastq_ftp	The FTP-link(s) (URL) to the FASTQ file(s) (Example: ftp.sra.ebi.ac.uk/vol1/fastq/ERR433/009/ERR4332639/ERR4332639.fastq.gz)	URL	TRUE	FALSE	FALSE				FALSE	FALSE
+fastq_aspera	The Aspera-link (URL) to the FASTQ-file(s). (Example: fasp.sra.ebi.ac.uk:/vol1/fastq/ERR433/009/ERR4332639/ERR4332639.fastq.gz)	URL	TRUE	FALSE	FALSE				FALSE	FALSE
+fastq_bytes	The number of bytes of the FASTQ-file(s) in bytes	Integer	TRUE	FALSE	TRUE		0	Inf	FALSE	FALSE
+fastq_md5	The MD5 hash(es) of the FASTQ-file(s)	String	TRUE	FALSE	FALSE				FALSE	FALSE
+read_count	The number of reads	Integer	FALSE	FALSE	TRUE		0	Inf	FALSE	FALSE
+submitted_ftp	The URL(s) to the originally submitted file(s) before it got converted to FASTQ. This can sometimes be helpful for processing	String	TRUE	FALSE	FALSE				FALSE	FALSE