-
Notifications
You must be signed in to change notification settings - Fork 22
Load real UK Biobank data
When loading real UK Biobank data, you could find this error:
2018-08-01 23:53:52,219 - ukbrest - INFO - Working on /var/lib/phenotype/example15_00.csv
[...]
2018-08-01 23:53:52,378 - ukbrest - WARNING - No encodings.txt found, assuming utf-8
2018-08-01 23:53:52,530 - ukbrest - ERROR - Unicode decoding error when reading CSV file. Activate debug to show more details.
That means the CSV has a wrong unicode. To fix it, you need to specify the correct encoding
for that file in a text file named encodings.txt in your phenotype folder
(where you have your CSV/HTML files). For the example message below (where the file being loaded is example15_00.csv), the content of your encodings.txt file should be:
example15_00.csv latin1
The encodings.txt file has one line per CSV file. You just need to specify those with encoding
mismatches, for the rest utf-8 is used.
Once the loading process finishes, you can get all the data field codings by connecting to the PostgreSQL database and exporting a list of codings:
\copy (select distinct coding from fields where coding is not null) to /tmp/all_codings.txt (format csv)The file /tmp/all_codings.txt is just a list of coding numbers, one per line, that you can use
to download all coding files using the download_codings.sh script:
$ mkdir /tmp/codings && cd /tmp/codings
$ [UKBREST_CODE]/utils/scripts/download_codings.sh /tmp/all_codings.txtWhen you downloaded all coding files (with names like coding_100329.tsv for coding code 100329), place them in a
folder named /mnt/ukbrest/phenotype/codings and run this command:
docker run --rm --net ukb \
-v /full/path/phenotype/:/var/lib/phenotype \
-e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
hakyimlab/ukbrest --load-codingsYou'll see an output like this one:
2018-07-09 19:19:50,353 - ukbrest - INFO - Loading codings from /var/lib/phenotype/codings
2018-07-09 19:19:51,121 - ukbrest - INFO - Processing coding file: coding_489.tsv
2018-07-09 19:19:51,190 - ukbrest - INFO - Processing coding file: coding_238.tsv
[...]
Once finished, you'll have in your database a table called codings, that will let you link your data with,
for instance, ICD10 codes (through data-coding 19 in this case).
you can load other types of samples data, like Sample-QC and relatedness (See
this page for more
information). To do that, create a folder
/mnt/ukbrest/phenotype/samples_data and copy the Sample-QC file (ukb_sqc_vZ.txt)
with a new the file name samplesqc.txt (note that this file does not have a samples ID column, so you must add this
column using the .fam file from your application; read more about that
here). And also copy the relatedness file (ukbA_rel_sP.txt)
with name relatedness.txt. Although the names samplesqc.txt and relatedness.txt are not mandatory, you must
specify the .txt extension to let ukbrest find the files and load them. Finally, run this command:
docker run --rm --net ukb \
-v /mnt/ukbrest/genotype/:/var/lib/genotype \
-v /mnt/ukbrest/phenotype/:/var/lib/phenotype \
-e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
miltondp/ukbrest --load-samples-data --identifier-columns relatedness.txt:ID1,ID2A new table for each file will be created, that you can later use to make your queries.
With this method you can load other kinds of data about samples. Just put the files in the samples_data folder with
.txt extension and then run the command above. You can specify the ID columns with --identifier-columns (the format
is file1.txt:column1 file2.txt:column2), skip some columns with --skip-columns (the format is
file1.txt:column1 file2.txt:column2,column3), and specify file separators with --separators
(file1.txt:, file2.txt:;).
TODO
TODO
TODO
load data dictionary (for this I need to ask people to create a conda environment; and also move code I have into this repository)
- Mention ukbconv commands to get csv and html files.
- What happens when CSV files have overlapping data-fields.
- Move content from here, about authentication and SSL to this page.