Skip to content

Commit 9db1d1a

Browse files
authored
Merge pull request #27 from poseidon-framework/more_freedom
updated and simplified format definition
2 parents e866f6c + f8c92c3 commit 9db1d1a

File tree

2 files changed

+105
-120
lines changed

2 files changed

+105
-120
lines changed

POSEIDON_yml_fields.tsv

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,23 @@
11
field level parent description type format mandatory unique
2-
poseidonVersion 0 poseidon v.2 package version (e.g. 2.0.1) String TRUE TRUE
2+
poseidonVersion 0 Poseidon package format version (e.g. 2.0.1) String TRUE TRUE
33
title 0 title of the package String TRUE TRUE
44
description 0 some descriptive words about the package String FALSE TRUE
5-
contributor 0 list of contributors to the package (not the data producer/publication author, but the poseidon package creator), each with name and email Array TRUE TRUE
5+
contributor 0 list of contributors to the package (not the data producer/publication author, but the Poseidon package creator), each with name and email Array TRUE TRUE
66
name 1 contributor name of one contributor String TRUE FALSE
77
email 1 contributor email of one contributor (must be a valid email address) String Email TRUE FALSE
8-
packageVersion 0 version of the package (should be changed/incremented when the package is changed) String FALSE TRUE
9-
lastModified 0 date of last modification of the poseidon package (should be updated when the package is changed) Date YYYY-MM-DD FALSE TRUE
10-
bibFile 0 file name (.bib) String Path FALSE TRUE
8+
packageVersion 0 version of the package (should be changed/incremented when the package is changed) String TRUE TRUE
9+
lastModified 0 date of last modification of the Poseidon package (should be updated when the package is changed) Date YYYY-MM-DD TRUE TRUE
1110
genotypeData 0 genotype file name section TRUE TRUE
12-
format 1 genotypeData file format definition, only allows PLINK right now String TRUE TRUE
13-
genoFile 1 genotypeData file name (.bed) String Path TRUE TRUE
14-
snpFile 1 genotypeData file name (.bim) String Path TRUE TRUE
15-
indFile 1 genotypeData file name (.fam String Path TRUE TRUE
16-
jannoFile 0 file name (.janno) String Path TRUE TRUE
11+
format 1 genotypeData file format definition, allows EIGENSTRAT and PLINK String TRUE TRUE
12+
genoFile 1 genotypeData relative path to genoFile String Path TRUE TRUE
13+
genoFileChkSum 1 genotypeData md5 checksum of the genoFile String FALSE TRUE
14+
snpFile 1 genotypeData relative path to snpFile String Path TRUE TRUE
15+
snpFileChkSum 1 genotypeData md5 checksum of the snpFile String FALSE TRUE
16+
indFile 1 genotypeData relative path to indFile String Path TRUE TRUE
17+
indFileChkSum 1 genotypeData md5 checksum of the indFile String FALSE TRUE
18+
jannoFile 0 relative path to jannoFile String Path FALSE TRUE
19+
jannoFileChkSum 1 genotypeData md5 checksum of the jannoFile String FALSE TRUE
20+
bibFile 0 relative path to bibFile String Path FALSE TRUE
21+
bibFileChkSum 1 genotypeData md5 checksum of the bibFile String FALSE TRUE
22+
readmeFile 0 relative path to readmeFile String Path FALSE TRUE
23+
changelogFile 0 relative path to changelogFile String Path FALSE TRUE

README.md

Lines changed: 88 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,15 @@
1-
# Poseidon v.2: DAG Genotype Data Organisation
1+
# Poseidon: Genotype Data Organisation
22

3-
Poseidon v.2 is a solution for genotype data organisation established within the Department of Archaeogenetics at the Max Planck Institute for the Science of Human History (MPI-SHH) in Jena.
3+
Poseidon is a solution for genotype data organisation established within the Department of Archaeogenetics at the Max Planck Institute for the Science of Human History (MPI-SHH) in Jena.
44

55
- [The Poseidon package](#the-poseidon-package)
6-
- [Structure](#structure)
7-
- [The `POSEIDON.yml` file](#the-poseidonyml-file-mandatory)
8-
- [The `X.janno` file](#the-xjanno-file-mandatory)
9-
- [The `X.bed`, `X.bim`, `X.fam` files](#the-xbed-xbim-xfam-files-mandatory)
10-
- [The `README.txt` file](#the-readmetxt-file-optional)
11-
- [The `CHANGELOG.txt` file](#the-changelogtxt-file-optional)
12-
- [The `LITERATURE.bib` file](#the-literaturebib-file-optional)
13-
- [Naming Poseidon v.2 `package`s](#naming-poseidon-v2-packages)
14-
15-
***
6+
* [Structure](#structure)
7+
* [The `POSEIDON.yml` file](#the--poseidonyml--file)
8+
* [Genotype data](#genotype-data)
9+
* [The `.janno` file](#the--janno--file)
10+
* [The `.bib` file](#the--bib--file)
11+
* [The `README.txt` file](#the--readmetxt--file)
12+
* [The `CHANGELOG.txt` file](#the--changelogtxt--file)
1613

1714
## The Poseidon package
1815

@@ -22,156 +19,137 @@ All ancient and modern data are distributed into so-called packages, which are d
2219

2320
Every package should have the following files:
2421

25-
- The `POSEIDON.yml` file
26-
- The `X.janno` file
27-
- The `X.bed`, `X.bim`, `X.fam` files
22+
- A `POSEIDON.yml` file to formally define the package
23+
- Genotype data in eigenstrat or plink format
24+
- A `.janno` file to store context information
25+
- A `.bib` file for literature references
2826

29-
It also can contain the following files:
27+
It can also contain the following files:
3028

31-
- The `README.txt` file
32-
- The `CHANGELOG.txt` file
33-
- The `LITERATURE.bib` file
29+
- A `README.txt` file for arbitrary context information
30+
- A `CHANGELOG.txt` file to document changes to the package
3431

3532
Example:
3633

3734
```
3835
Switzerland_LNBA_Roswita/POSEIDON.yml
39-
Switzerland_LNBA_Roswita/Switzerland_LNBA.janno
4036
Switzerland_LNBA_Roswita/Switzerland_LNBA.plink.bed
4137
Switzerland_LNBA_Roswita/Switzerland_LNBA.plink.bim
4238
Switzerland_LNBA_Roswita/Switzerland_LNBA.plink.fam
39+
Switzerland_LNBA_Roswita/Switzerland_LNBA.janno
40+
Switzerland_LNBA_Roswita/Switzerland_LNBA.bib
4341
Switzerland_LNBA_Roswita/README.txt
4442
Switzerland_LNBA_Roswita/CHANGELOG.txt
45-
Switzerland_LNBA_Roswita/LITERATURE.bib
4643
```
4744

48-
### The `POSEIDON.yml` file [mandatory]
45+
### The `POSEIDON.yml` file
4946

50-
The `POSEIDON.yml` file lists metainformation in a standardized, machine-readable format.
47+
The `POSEIDON.yml` file lists relative file paths and metainformation in a standardized, machine-readable format.
5148

52-
- The `POSEIDON.yml` file must be a valid [YAML file](https://yaml.org/).
53-
- The fields of the `POSEIDON.yml` file are documented in the [POSEIDON_yml_fields.tsv file](https://github.com/poseidon-framework/poseidon2-schema/blob/master/POSEIDON_yml_fields.tsv) in this repository.
49+
- It must be a valid [YAML file](https://yaml.org/).
50+
- Its fields of the `POSEIDON.yml` file are documented in the [POSEIDON_yml_fields.tsv file](https://github.com/poseidon-framework/poseidon2-schema/blob/master/POSEIDON_yml_fields.tsv) in this repository.
5451

5552
Example:
5653

5754
```
58-
poseidonVersion: 2.0.1
59-
title: Schiffels_2016
60-
description: Genetic data published in Schiffels et al. 2016
55+
poseidonVersion: 2.0.2
56+
title: Switzerland_LNBA_Roswita
57+
description: LNBA Switzerland genetic data not yet published # optional
6158
contributor:
62-
- name: Stephan Schiffels
63-
email: stephan.schiffels@institute.org
59+
- name: Roswita Malone
60+
email: roswita.malone@institute.org
6461
- name: Paul Panther
65-
66-
packageVersion: 1.12
67-
lastModified: 2020-02-28
68-
bibFile: LITERATURE.bib
62+
63+
packageVersion: 1.1.2
64+
lastModified: 2021-01-28
6965
genotypeData:
7066
format: PLINK
71-
genoFile: Schiffels_2016.bed
72-
snpFile: Schiffels_2016.bim
73-
indFile: Schiffels_2016.fam
74-
jannoFile : Schiffels_2016.janno
67+
genoFile: Switzerland_LNBA_Roswita.bed
68+
genoFileChkSum: 95b093eefacc1d6499afcfe89b15d56c # optional
69+
snpFile: Switzerland_LNBA_Roswita.bim
70+
snpFileChkSum: 6771d7c873219039ba3d5bdd96031ce3 # optional
71+
indFile: Switzerland_LNBA_Roswita.fam
72+
indFileChkSum: f77dc756666dbfef3bb35191ae15a167 # optional
73+
jannoFile : Switzerland_LNBA_Roswita.janno
74+
jannoFileChkSum: 555d7733135ebcabd032d581381c5d6f # optional
75+
bibFile: sources.bib
76+
bibFileChkSum: 70cd3d5801cee8a93fc2eb40a99c63fa # optional
77+
readmeFile: README.txt # optional
78+
changelogFile: CHANGELOG.txt # optional
7579
```
7680

7781
When a package is modified in any way (e.g. updates of the context information in the `.janno` file), then the `packageVersion` field should be incremented and the `lastModified` field updated to the current date.
7882

79-
### The `X.janno` file [mandatory]
80-
81-
The `.janno` file is a UTF-8 encoded, tab-separated text file with a header line. It holds a clearly defined set of context information (columns) for each sample (rows) in a package.
82-
83-
- The variables (columns), variable types and possible content of the janno file are documented in the [janno_columns.tsv file](https://github.com/poseidon-framework/poseidon2-schema/blob/master/janno_columns.tsv) in this repository.
84-
- A `.janno` file must have all of these columns in exactly this order with exactly these column names.
85-
- If information is unknown or a variable does not apply for a certain sample, then the respective cell(s) can be filled with the NULL value `n/a`. Ideally, a `.janno` file should have the least number of n/a-values possible.
86-
- The order of the samples (rows) in the `.janno` file must be equal to the order in the files that hold the genetic data.
87-
- The values in the columns **Individual_ID** and **Group_Name** must be equal to the terms used in the first and second column of the `.fam` file.
88-
- Multiple columns of the `.janno` file are list columns that hold multiple values (either strings or numerics) separated by `;`
89-
- The decimal separator for all floating point numbers is `.`
83+
### Genotype data
9084

91-
### The `X.bed`, `X.bim`, `X.fam` files [mandatory]
85+
Genotype data in Poseidon packages is stored either in PLINK (binary) or EIGENSTRAT format.
9286

93-
Binary plink genotype files consisting of [`.bed` (PLINK binary biallelic genotype table)](https://www.cog-genomics.org/plink/1.9/formats#bed), [`.bim` (PLINK extended MAP file)](https://www.cog-genomics.org/plink/1.9/formats#bim) and [`.fam` (PLINK sample information)](https://www.cog-genomics.org/plink/1.9/formats#fam).
87+
| | PLINK (binary) | EIGENSTRAT |
88+
|---|---|---|
89+
| genotype file | [`.bed` (binary biallelic genotype table)](https://www.cog-genomics.org/plink/1.9/formats#bed) | [`.geno` (genotype file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67)
90+
| SNP file | [`.bim` (extended MAP file)](https://www.cog-genomics.org/plink/1.9/formats#bim) | [`.snp` (snp file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) |
91+
| individual file | [`.fam` (sample information)](https://www.cog-genomics.org/plink/1.9/formats#fam) | [`.ind` (indiv file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) |
9492

95-
### The `README.txt` file [optional]
93+
### The `.janno` file
9694

97-
The README.txt file contains arbitrary, human-readable information.
98-
99-
Example:
95+
The `.janno` file is a tab-separated text file with a header line. It holds a clearly defined set of context information (columns) for each sample (rows) in a package.
10096

101-
```
102-
This package contains a rather interesting set of samples.
103-
@Uebertruplf_2021 even claimed that they are the most important for this particular area and time period.
104-
```
97+
- The variables (columns), variable types and possible content of the janno file are documented in the [janno_columns.tsv file](https://github.com/poseidon-framework/poseidon2-schema/blob/master/janno_columns.tsv) in this repository.
98+
- A `.janno` file must have all of these columns in exactly this order with exactly these column names.
99+
- If information is unknown or a variable does not apply for a certain sample, then the respective cell(s) can be filled with the NULL value `n/a`.
100+
- The order of the samples (rows) in the `.janno` file must be equal to the order in the files that hold the genetic data.
101+
- The values in the columns **Individual_ID** and **Group_Name** must be equal to the terms used in the genetic data files.
102+
- Multiple columns of the `.janno` file are list columns that hold multiple values (either strings or numerics) separated by `;`.
103+
- The decimal separator for all floating point numbers is `.`.
105104

106-
### The `CHANGELOG.txt` file [optional]
105+
### The `.bib` file
107106

108-
Documentation of important changes in the history of a package.
107+
[BibTeX](http://www.bibtex.org/) file with all references listed in the `.janno` file. The bibtex keys must fit to ones used in the `.janno` file.
109108

110109
Example:
111110

112111
```
113-
- 2021_10_01: Fixed a spelling mistake in the site name "Hosenacker"->"Rosenacker".
114-
- 2021_05_05: The authors of @Gassenhauer_2021 made some previously restricted samples for their publication available later and we added them.
115-
- 2021_03_08: Creation of the package.
112+
@article{CassidyPNAS2015,
113+
doi = {10.1073/pnas.1518445113},
114+
url = {https://doi.org/10.1073%2Fpnas.1518445113},
115+
year = 2015,
116+
month = {dec},
117+
publisher = {Proceedings of the National Academy of Sciences},
118+
volume = {113},
119+
number = {2},
120+
pages = {368--373},
121+
author = {Lara M. Cassidy and Rui Martiniano and Eileen M. Murphy and Matthew D. Teasdale and James Mallory and Barrie Hartwell and Daniel G. Bradley},
122+
title = {Neolithic and Bronze Age migration to Ireland and establishment of the insular Atlantic genome},
123+
journal = {Proceedings of the National Academy of Sciences}
124+
}
116125
```
117126

118-
### The `LITERATURE.bib` file [optional]
119-
120-
Bibtex file with all references mentioned in `POSEIDON.yml`, `README.txt` and `CHANGELOG.txt`
121-
122-
***
123-
124-
## Naming Poseidon v.2 `package`s
127+
### The `README.txt` file
125128

126-
The naming of packages should follow a simple scheme:
129+
Informal information accompanying the package.
127130

128-
Ancient published: YEAR_NAME_IDENTIFIER
129-
130-
```
131-
2018_Lamnidis_Fennoscandia
132-
2019_Wang_Caucasus
133-
2019_Flegontov_PaleoEskimo
134-
```
135-
136-
Ancient unpublished: IDENTIFIER_NAME
131+
Example:
137132

138133
```
139-
Switzerland_LNBA_Roswita
140-
Italy_Mesolithic_Paul
141-
SouthEastAsia_Simon
134+
This package contains a rather interesting set of samples relevant for the peopling of the Territory of Christmas Island in the Indian Ocean. We consider this especially relevant, because ...
142135
```
143136

144-
Modern published: YEAR_(NAME)_IDENTIFIER
137+
### The `CHANGELOG.txt` file
145138

146-
```
147-
2015_1000_Genomes-1240K_haploid_pulldown
148-
2016_Mallick_SGDP1240K_diploid_pulldown
149-
2014_Lazaridis_HOmodern
150-
2016_Lazaridis_HOmodern
151-
2019_Flegontov_HO_NewSiberian
152-
2018_Lipson_SEA
153-
```
139+
Documentation of important changes in the history of a package.
154140

155-
Modern unpublished: IDENTIFIER_NAME
141+
Example:
156142

157143
```
158-
Eurasia_newHO_Paul
159-
Afrika_newHO_Andrea
160-
```
161-
162-
Identifiers can be somewhat informal as long as the project is ongoing, they just need to be unique. As soon as a project gets published, we create a final version of the respective package with the YEAR_NAME_IDENTIFIER label.
163-
164-
External projects can be integrated similarly by using their publication name, or by temporary internal identifiers such as `Iron_Age_Boston_Share`.
144+
## 1.2.0
145+
- Fixed a spelling mistake in the site name "Hosenacker"->"Rosenacker".
165146
166-
***
147+
## 1.1.1
148+
- Added mtDNA contamination estimation to .janno file
167149
168-
## DAG internal procedures
150+
## 1.1.0
151+
- The authors of @Gassenhauer_2021 made some previously restricted samples for their publication available later and we added them.
169152
170-
Individual contributors would create packages in dedicated poseidon folders in their user project directories, e.g. `/project1/user/xyz/poseidon/2018_Lamnidis_Fennoscandia`. That way, subfolders belong to individual maintainers and be writable only by them.
171-
172-
The poseidon admins would then link these packages into the official `/projects1/poseidon` repo, located on the HPC storage unit of the MPI-SHH, where we distinguish ancient and modern genotype data:
173-
174-
```
175-
/projects1/poseidon/ancient/…
176-
/projects1/poseidon/modern/…
153+
## 1.0.0
154+
- Creation of the package.
177155
```

0 commit comments

Comments
 (0)