Skip to content

Load real UK Biobank data

Milton Pividori edited this page Aug 3, 2018 · 30 revisions

Load real UK Biobank data

TODO list for this page:

  • Mention ukbconv commands to get csv and html files.
  • What happens when CSV files have overlapping data-fields.
  • Move content from here, about authentication and SSL to this page.

Unicode decoding errors

When loading real UK Biobank data, you could find this error:

2018-08-01 23:53:52,219 - ukbrest - INFO - Working on /var/lib/phenotype/example15_00.csv
[...]
2018-08-01 23:53:52,378 - ukbrest - WARNING - No encodings.txt found, assuming utf-8
2018-08-01 23:53:52,530 - ukbrest - ERROR - Unicode decoding error when reading CSV file. Activate debug to show more details.

That means the CSV has a wrong unicode. To fix it, you need to specify the correct encoding for that file in a text file named encodings.txt in your phenotype folder (where you have your CSV/HTML files). For the example message below (where the file being loaded is example15_00.csv), the content of your encodings.txt file should be:

example15_00.csv latin1

The encodings.txt file has one line per CSV file. You just need to specify those with encoding mismatches, for the rest utf-8 is used.

Codings

Once the loading process finishes, you can get all the data field codings by connecting to the PostgreSQL database and exporting a list of codings:

\copy (select distinct coding from fields where coding is not null) to /mnt/all_codings.txt (format csv)

The file /mnt/all_codings.txt is just a list of coding numbers, one per line, that you can use to download all coding files using the download_codings.sh script:

$ mkdir /tmp/codings && cd /mnt/codings
$ [...]/misc/download_codings.sh all_codings.txt

When you downloaded all coding files (with names like coding_100329.tsv for coding code 100329), place them in a folder named /mnt/ukbrest/phenotype/codings and run this command:

docker run --rm --net ukb \
  -v /full/path/phenotype/:/var/lib/phenotype \
  -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
  hakyimlab/ukbrest --load-codings

You'll see an output like this one:

2018-07-09 19:19:50,353 - ukbrest - INFO - Loading codings from /var/lib/phenotype/codings
2018-07-09 19:19:51,121 - ukbrest - INFO - Processing coding file: coding_489.tsv
2018-07-09 19:19:51,190 - ukbrest - INFO - Processing coding file: coding_238.tsv
[...]

Once finished, you'll have in your database a table called codings, that will let you link your data with, for instance, ICD10 codes (through data-coding 19 in this case).

Loading other types of data

you can load other types of samples data, like Sample-QC and relatedness (See this page for more information). To do that, create a folder /mnt/ukbrest/phenotype/samples_data and copy the Sample-QC file (ukb_sqc_vZ.txt) with a new the file name samplesqc.txt (note that this file does not have a samples ID column, so you must add this column using the .fam file from your application; read more about that here). And also copy the relatedness file (ukbA_rel_sP.txt) with name relatedness.txt. Although the names samplesqc.txt and relatedness.txt are not mandatory, you must specify the .txt extension to let ukbrest find the files and load them. Finally, run this command:

docker run --rm --net ukb \
  -v /mnt/ukbrest/genotype/:/var/lib/genotype \
  -v /mnt/ukbrest/phenotype/:/var/lib/phenotype \
  -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
  miltondp/ukbrest --load-samples-data --identifier-columns relatedness.txt:ID1,ID2

A new table for each file will be created, that you can later use to make your queries. With this method you can load other kinds of data about samples. Just put the files in the samples_data folder with .txt extension and then run the command above. You can specify the ID columns with --identifier-columns (the format is file1.txt:column1 file2.txt:column2), skip some columns with --skip-columns (the format is file1.txt:column1 file2.txt:column2,column3), and specify file separators with --separators (file1.txt:, file2.txt:;).

Load withdrawals

TODO

Some useful SQL functions

TODO

Load data dictionary

TODO

load data dictionary (for this I need to ask people to create a conda environment; and also move code I have into this repository)

Clone this wiki locally