|
1 | | -zarr |
| 1 | +Zarr |
2 | 2 | ==== |
3 | 3 |
|
4 | | -A minimal implementation of chunked, compressed, N-dimensional arrays |
5 | | -for Python. |
6 | | - |
7 | | -* Source code: https://github.com/alimanfoo/zarr |
8 | | -* Download: https://pypi.python.org/pypi/zarr |
9 | | -* Release notes: https://github.com/alimanfoo/zarr/releases |
| 4 | +Zarr is a Python package providing an implementation of compressed, |
| 5 | +chunked, N-dimensional arrays, designed for use in parallel |
| 6 | +computing. See the `documentation <http://zarr.readthedocs.io/>`_ for |
| 7 | +more information. |
10 | 8 |
|
11 | 9 | .. image:: https://travis-ci.org/alimanfoo/zarr.svg?branch=master |
12 | 10 | :target: https://travis-ci.org/alimanfoo/zarr |
13 | | - |
14 | | -Installation |
15 | | ------------- |
16 | | - |
17 | | -Installation requires Numpy and Cython pre-installed. Can only be |
18 | | -installed on Linux currently. |
19 | | - |
20 | | -Install from PyPI:: |
21 | | - |
22 | | - $ pip install -U zarr |
23 | | - |
24 | | -Install from GitHub:: |
25 | | - |
26 | | - $ pip install -U git+https://github.com/alimanfoo/zarr.git@master |
27 | | - |
28 | | -Status |
29 | | ------- |
30 | | - |
31 | | -Experimental, proof-of-concept. This is alpha-quality software. Things |
32 | | -may break, change or disappear without warning. |
33 | | - |
34 | | -Bug reports and suggestions welcome. |
35 | | - |
36 | | -Design goals |
37 | | ------------- |
38 | | - |
39 | | -* Chunking in multiple dimensions |
40 | | -* Resize any dimension |
41 | | -* Concurrent reads |
42 | | -* Concurrent writes |
43 | | -* Release the GIL during compression and decompression |
44 | | - |
45 | | -Usage |
46 | | ------ |
47 | | - |
48 | | -Create an array: |
49 | | - |
50 | | -.. code-block:: |
51 | | - |
52 | | - >>> import numpy as np |
53 | | - >>> import zarr |
54 | | - >>> z = zarr.empty(shape=(10000, 1000), dtype='i4', chunks=(1000, 100)) |
55 | | - >>> z |
56 | | - zarr.ext.SynchronizedArray((10000, 1000), int32, chunks=(1000, 100)) |
57 | | - cname: blosclz; clevel: 5; shuffle: 1 (BYTESHUFFLE) |
58 | | - nbytes: 38.1M; cbytes: 0; initialized: 0/100 |
59 | | -
|
60 | | -Fill it with some data: |
61 | | - |
62 | | -.. code-block:: |
63 | | - |
64 | | - >>> z[:] = np.arange(10000000, dtype='i4').reshape(10000, 1000) |
65 | | - >>> z |
66 | | - zarr.ext.SynchronizedArray((10000, 1000), int32, chunks=(1000, 100)) |
67 | | - cname: blosclz; clevel: 5; shuffle: 1 (BYTESHUFFLE) |
68 | | - nbytes: 38.1M; cbytes: 2.0M; ratio: 19.3; initialized: 100/100 |
69 | | -
|
70 | | -Obtain a NumPy array by slicing: |
71 | | - |
72 | | -.. code-block:: |
73 | | - |
74 | | - >>> z[:] |
75 | | - array([[ 0, 1, 2, ..., 997, 998, 999], |
76 | | - [ 1000, 1001, 1002, ..., 1997, 1998, 1999], |
77 | | - [ 2000, 2001, 2002, ..., 2997, 2998, 2999], |
78 | | - ..., |
79 | | - [9997000, 9997001, 9997002, ..., 9997997, 9997998, 9997999], |
80 | | - [9998000, 9998001, 9998002, ..., 9998997, 9998998, 9998999], |
81 | | - [9999000, 9999001, 9999002, ..., 9999997, 9999998, 9999999]], dtype=int32) |
82 | | - >>> z[:100] |
83 | | - array([[ 0, 1, 2, ..., 997, 998, 999], |
84 | | - [ 1000, 1001, 1002, ..., 1997, 1998, 1999], |
85 | | - [ 2000, 2001, 2002, ..., 2997, 2998, 2999], |
86 | | - ..., |
87 | | - [97000, 97001, 97002, ..., 97997, 97998, 97999], |
88 | | - [98000, 98001, 98002, ..., 98997, 98998, 98999], |
89 | | - [99000, 99001, 99002, ..., 99997, 99998, 99999]], dtype=int32) |
90 | | - >>> z[:, :100] |
91 | | - array([[ 0, 1, 2, ..., 97, 98, 99], |
92 | | - [ 1000, 1001, 1002, ..., 1097, 1098, 1099], |
93 | | - [ 2000, 2001, 2002, ..., 2097, 2098, 2099], |
94 | | - ..., |
95 | | - [9997000, 9997001, 9997002, ..., 9997097, 9997098, 9997099], |
96 | | - [9998000, 9998001, 9998002, ..., 9998097, 9998098, 9998099], |
97 | | - [9999000, 9999001, 9999002, ..., 9999097, 9999098, 9999099]], dtype=int32) |
98 | | -
|
99 | | -Resize the array and add more data: |
100 | | - |
101 | | -.. code-block:: |
102 | | - |
103 | | - >>> z.resize(20000, 1000) |
104 | | - >>> z |
105 | | - zarr.ext.SynchronizedArray((20000, 1000), int32, chunks=(1000, 100)) |
106 | | - cname: blosclz; clevel: 5; shuffle: 1 (BYTESHUFFLE) |
107 | | - nbytes: 76.3M; cbytes: 2.0M; ratio: 38.5; initialized: 100/200 |
108 | | - >>> z[10000:, :] = np.arange(10000000, dtype='i4').reshape(10000, 1000) |
109 | | - >>> z |
110 | | - zarr.ext.SynchronizedArray((20000, 1000), int32, chunks=(1000, 100)) |
111 | | - cname: blosclz; clevel: 5; shuffle: 1 (BYTESHUFFLE) |
112 | | - nbytes: 76.3M; cbytes: 4.0M; ratio: 19.3; initialized: 200/200 |
113 | | -
|
114 | | -For convenience, an ``append()`` method is also available, which can be used to |
115 | | -append data to any axis: |
116 | | - |
117 | | -.. code-block:: |
118 | | - |
119 | | - >>> a = np.arange(10000000, dtype='i4').reshape(10000, 1000) |
120 | | - >>> z = zarr.array(a, chunks=(1000, 100)) |
121 | | - >>> z.append(a+a) |
122 | | - >>> z |
123 | | - zarr.ext.SynchronizedArray((20000, 1000), int32, chunks=(1000, 100)) |
124 | | - cname: blosclz; clevel: 5; shuffle: 1 (BYTESHUFFLE) |
125 | | - nbytes: 76.3M; cbytes: 3.6M; ratio: 21.2; initialized: 200/200 |
126 | | - >>> z.append(np.vstack([a, a]), axis=1) |
127 | | - >>> z |
128 | | - zarr.ext.SynchronizedArray((20000, 2000), int32, chunks=(1000, 100)) |
129 | | - cname: blosclz; clevel: 5; shuffle: 1 (BYTESHUFFLE) |
130 | | - nbytes: 152.6M; cbytes: 7.6M; ratio: 20.2; initialized: 400/400 |
131 | | -
|
132 | | -Persistence |
133 | | ------------ |
134 | | - |
135 | | -Create a persistent array (data stored on disk): |
136 | | - |
137 | | -.. code-block:: |
138 | | -
|
139 | | - >>> path = 'example.zarr' |
140 | | - >>> z = zarr.open(path, mode='w', shape=(10000, 1000), dtype='i4', chunks=(1000, 100)) |
141 | | - >>> z[:] = np.arange(10000000, dtype='i4').reshape(10000, 1000) |
142 | | - >>> z |
143 | | - zarr.ext.SynchronizedPersistentArray((10000, 1000), int32, chunks=(1000, 100)) |
144 | | - cname: blosclz; clevel: 5; shuffle: 1 (BYTESHUFFLE) |
145 | | - nbytes: 38.1M; cbytes: 2.0M; ratio: 19.3; initialized: 100/100 |
146 | | - mode: w; path: example.zarr |
147 | | -
|
148 | | -There is no need to close a persistent array. Data are automatically flushed |
149 | | -to disk. |
150 | | - |
151 | | -If you're working with really big arrays, try the 'lazy' option: |
152 | | - |
153 | | -.. code-block:: |
154 | | -
|
155 | | - >>> path = 'big.zarr' |
156 | | - >>> z = zarr.open(path, mode='w', shape=(1e8, 1e7), dtype='i4', chunks=(1000, 1000), lazy=True) |
157 | | - >>> z |
158 | | - zarr.ext.SynchronizedLazyPersistentArray((100000000, 10000000), int32, chunks=(1000, 1000)) |
159 | | - cname: blosclz; clevel: 5; shuffle: 1 (BYTESHUFFLE) |
160 | | - nbytes: 3.6P; cbytes: 0; initialized: 0/1000000000 |
161 | | - mode: w; path: big.zarr |
162 | | -
|
163 | | -See the `persistence documentation <PERSISTENCE.rst>`_ for more |
164 | | -details of the file format. |
165 | | - |
166 | | -Tuning |
167 | | ------- |
168 | | - |
169 | | -``zarr`` is optimised for accessing and storing data in contiguous |
170 | | -slices, of the same size or larger than chunks. It is not and probably |
171 | | -never will be optimised for single item access. |
172 | | - |
173 | | -Chunks sizes >= 1M are generally good. Optimal chunk shape will depend |
174 | | -on the correlation structure in your data. |
175 | | - |
176 | | -``zarr`` is designed for use in parallel computations working |
177 | | -chunk-wise over data. Try it with `dask.array |
178 | | -<http://dask.pydata.org/en/latest/array.html>`_. If using in a |
179 | | -multi-threaded, set zarr to use blosc in contextual mode:: |
180 | | - |
181 | | - >>> zarr.set_blosc_options(use_context=True) |
182 | | - |
183 | | -Acknowledgments |
184 | | ---------------- |
185 | | - |
186 | | -``zarr`` uses `c-blosc <https://github.com/Blosc/c-blosc>`_ internally for |
187 | | -compression and decompression and borrows code heavily from |
188 | | -`bcolz <http://bcolz.blosc.org/>`_. |
189 | | - |
0 commit comments