Skip to content

Error converting raw/bim to h5/matrix_header files #12

@saurabhbelsare

Description

@saurabhbelsare

Hi, I'm trying to run a toy example with GEPSi, but am running into an error at the first step, to convert my genotypes to h5 format the way GEPSi expects. Here is my toy vcf generated from an msprime simulation:

##fileformat=VCFv4.2
##source=tskit 0.4.1
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0indv	tsk_1indv	tsk_2indv	tsk_3indv	tsk_4indv	tsk_5indv	tsk_6indv	tsk_7indv	tsk_8indv	tsk_9indv
1	409	.	T	A	.	PASS	.	GT	0|1	0|0	0|1	0|0	0|0	0|0	0|0	0|0	1|0	0|0
1	752	.	C	T	.	PASS	.	GT	0|0	1|1	0|0	1|1	1|1	1|0	1|1	1|0	0|0	0|1
1	1043	.	C	A	.	PASS	.	GT	0|0	1|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0
1	1207	.	A	C	.	PASS	.	GT	0|0	1|0	0|0	0|1	0|0	1|0	0|0	0|0	0|0	0|0
1	1952	.	C	A	.	PASS	.	GT	1|0	0|0	0|0	0|0	0|0	0|1	0|0	0|0	0|0	0|0
1	2521	.	T	C	.	PASS	.	GT	0|0	0|0	0|0	0|1	0|0	1|0	0|0	0|0	0|0	0|0
1	2542	.	C	T	.	PASS	.	GT	1|0	0|0	0|0	0|0	0|0	0|1	0|0	0|0	0|0	0|0
1	2707	.	C	T	.	PASS	.	GT	0|0	0|1	0|0	1|0	1|1	0|0	1|1	1|0	0|0	0|1
1	4352	.	C	T	.	PASS	.	GT	0|0	0|0	1|0	0|0	0|0	0|0	0|0	0|1	0|1	1|0
1	5416	.	T	A	.	PASS	.	GT	0|0	1|1	0|0	1|1	1|1	1|0	1|1	1|0	0|0	0|1
1	6656	.	A	G	.	PASS	.	GT	1|1	0|0	1|1	0|0	0|0	0|1	0|0	0|1	1|1	1|0
1	7850	.	T	A	.	PASS	.	GT	0|0	0|0	1|0	0|0	0|0	0|0	0|0	0|1	0|1	1|0
1	9263	.	A	T	.	PASS	.	GT	0|0	1|1	0|0	1|1	1|1	1|0	1|1	1|0	0|0	0|1

I converted this vcf to .raw and .bim files using both, plink1.9 and plink2 with these commands:

plink --vcf output.vcf --recode A --out output
plink --vcf output.vcf --make-just-bim --out output

I try to convert this to the h5/matrix_header format using the following command:

gepsi genotype --data_path ./ --matrix_name output.raw --snplist_name output.bim --ignore_gene_map --features "NONE" --annotation_name NONE

However, I'm getting errors at this step. They are different errors for plink1.9 and plink2.
The error while trying to convert raw/bim files generated from plink1.9 is

OverflowError: can't convert negative value to hsize_t
Exception ignored in: 'tables.utilsextension.malloc_dims'
Traceback (most recent call last):
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/carray.py", line 250, in _g_create_common
    self._v_objectid = self._create_carray(self._v_new_title)
OverflowError: can't convert negative value to hsize_t
OverflowError: can't convert negative value to hsize_t
Exception ignored in: 'tables.utilsextension.malloc_dims'
Traceback (most recent call last):
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/carray.py", line 250, in _g_create_common
    self._v_objectid = self._create_carray(self._v_new_title)
OverflowError: can't convert negative value to hsize_t
Traceback (most recent call last):
  File "/home/sbelsare/software/anaconda3/envs/gepsi/bin/gepsi", line 8, in <module>
    sys.exit(main())
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/main.py", line 22, in main
    data = genotype_sim.simulate_genotype()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 36, in simulate_genotype
    return self.generate_genotype_file()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 87, in generate_genotype_file
    sf, risk_alleles = self.script_result_clean()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 136, in script_result_clean
    array_c = f.create_earray(f.root, 'data', tables.IntCol(), (0,num_feat), expectedrows=self.num_patients,filters=tables.Filters(complib='zlib', complevel=1))
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/file.py", line 1380, in create_earray
    ptobj = EArray(parentnode, name,
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/earray.py", line 158, in __init__
    super(EArray, self).__init__(parentnode, name, atom, shape, title,
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/carray.py", line 219, in __init__
    super(Array, self).__init__(parentnode, name, new, filters,
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/leaf.py", line 286, in __init__
    super(Leaf, self).__init__(parentnode, name, _log)
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/node.py", line 264, in __init__
    self._v_objectid = self._g_create()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/earray.py", line 182, in _g_create
    return self._g_create_common(self._v_expectedrows)
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/carray.py", line 250, in _g_create_common
    self._v_objectid = self._create_carray(self._v_new_title)
  File "tables/hdf5extension.pyx", line 1366, in tables.hdf5extension.Array._create_carray
tables.exceptions.HDF5ExtError: HDF5 error back trace

  File "H5S.c", line 1507, in H5Screate_simple
    invalid dataspace information

End of HDF5 error back trace

Problems creating the EArray.
Closing remaining open files:./genotype_100k_NONE.h5...done

and the error while trying to convert raw/bim files generated from plink2 is:

Traceback (most recent call last):
  File "/home/sbelsare/software/anaconda3/envs/gepsi/bin/gepsi", line 8, in <module>
    sys.exit(main())
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/main.py", line 22, in main
    data = genotype_sim.simulate_genotype()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 36, in simulate_genotype
    return self.generate_genotype_file()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 87, in generate_genotype_file
    sf, risk_alleles = self.script_result_clean()
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 139, in script_result_clean
    self.h5_append(df,f)
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/scripts/genotype_simulator.py", line 183, in h5_append
    f.root.data.append(a)
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/earray.py", line 219, in append
    self._check_shape_append(nparr)
  File "/home/sbelsare/software/anaconda3/envs/gepsi/lib/python3.8/site-packages/tables/earray.py", line 196, in _check_shape_append
    raise ValueError(("the shapes of the appended object and the "
ValueError: the shapes of the appended object and the ``/data`` EArray differ in non-enlargeable dimension 1
Closing remaining open files:./genotype_100k_NONE.h5...done

Which version of plink should I use, and how should I fix the corresponding error to get the h5/matrix_header files?

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions