Skip to content

Commit 5d395ab

Browse files
committed
description of dataset sizes
1 parent 445e198 commit 5d395ab

File tree

1 file changed

+37
-34
lines changed

1 file changed

+37
-34
lines changed

README.md

Lines changed: 37 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,29 @@
1-
# Parallel Data Concatenation for High Energy Physics Data Analysis
1+
# Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis
22

3-
This software package contains C++ programs for concatenating multiple HDF5
4-
files into a single one by appending individual datasets one after another.
3+
This software package contains C++ programs for concatenating HDF5 datasets
4+
across multiple files into a single file by appending individual datasets one
5+
after another.
56

67
## Input HDF5 Files
78
* Each file contains multiple groups, each representing a "relational database
89
table".
9-
* Each group contains multiple datasets, each representing a column of the
10+
* Each group contains multiple datasets. The number of datasets in a group can
11+
be different from others. Each dataset can be considered as a column of the
1012
database table.
1113
* Datasets in the same group are 2D arrays sharing the same size of 1st
12-
dimension (most significant). The size of 2nd dimension may be different.
13-
* Some of the datasets are actually 1D arrays whose 2nd dimension if of size 1.
14-
* Datasets can be of size zero, i.e. either dimension is of size 0.
15-
* All the files have the same "schema", i.e. same structure of groups and
16-
datasets.
17-
* A dataset in an input file may be of different 1st dimension size from the
18-
one in other files, while the 2nd dimension should be of the same size
19-
across files.
14+
(most significant) dimension. The 2nd dimension size may be different.
15+
* Some of the datasets are actually 1D arrays whose 2nd dimension is of size 1.
16+
* Datasets can be of size zero, i.e. the 1st dimension being of size 0.
17+
* All the files have the same "schema", i.e. same numbers of groups and
18+
datasets with the same names.
19+
* The size of 1st dimension of a dataset in an input file may be different from
20+
the dataset with the same name in other files. The 2nd dimension should be of
21+
the same size across all input files.
2022

2123
## Software Requirements
2224
* A C++ compiler that support ISO C++0x standard or higher
2325
* MPI C and C++ compilers
24-
* An HDF5 library version 1.10.5 and later built with parallel I/O feature enabled
26+
* An HDF5 library version 1.10.5 and later built with parallel I/O feature enabled
2527

2628
## Instructions to Build
2729
0. If building from a git clone of this repository, then run command below first.
@@ -42,7 +44,7 @@ files into a single one by appending individual datasets one after another.
4244
2. Run command "make" to create the executable file named "ph5_concat"
4345

4446
## Command to Run
45-
* Run command and command-line options are:
47+
* Command-line options are:
4648
```
4749
mpiexec -n <np> ./ph5_concat [-h|-q|-d|-r|-s|-p|-x] [-t num] [-m size] [-k name] [-z level] [-b size] [-o outfile] [-i infile]
4850
@@ -84,7 +86,27 @@ files into a single one by appending individual datasets one after another.
8486
read by all processes collectively (i.e. shared-file reads) and then all
8587
processes collectively write to the output file.
8688

87-
## An example output shown on screen from a run on Cori using 128 MPI processes.
89+
## Sample input and output files
90+
* There are four sample input files provided in folder `examples`.
91+
+ examples/sample_input_1.h5
92+
+ examples/sample_input_2.h5
93+
+ examples/sample_input_3.h5
94+
+ examples/sample_input_4.h5
95+
* Sample run commands
96+
```
97+
mpiexec -n 2 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5
98+
mpiexec -n 4 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5 -k evt
99+
```
100+
The output shown on screen is stored in `examples/sample_stdout.txt`.
101+
* Sample output files
102+
+ The output files from concatenating the 4 sample files are available in
103+
`examples/sample_output.h5` whose metadata dumped from command below is
104+
also available in `examples/sample_output.metadata`.
105+
```
106+
h5dump -Hp sample_output.h5
107+
```
108+
109+
## An example timing output from a run on Cori using 128 MPI processes.
88110
```
89111
% srun -n 128 ./ph5_concat -i ./nd_list_128.txt -o /scratch1/FS_1M_128/nd_out.h5 -b 512 -k evt -x
90112

@@ -141,25 +163,6 @@ files into a single one by appending individual datasets one after another.
141163
Close output files total: 0.4799
142164
End-to-end: 314.8095
143165
```
144-
## Sample input and output files
145-
* There are four sample input files provided in folder `examples`.
146-
+ examples/sample_input_1.h5
147-
+ examples/sample_input_2.h5
148-
+ examples/sample_input_3.h5
149-
+ examples/sample_input_4.h5
150-
* Sample run commands
151-
```
152-
mpiexec -n 2 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5
153-
mpiexec -n 4 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5 -k evt
154-
```
155-
The output shown on screen is stored in `examples/sample_stdout.txt`.
156-
* Sample output files
157-
+ The output files from concatenating the 4 sample files are available in
158-
`examples/sample_output.h5` whose metadata dumped from command below is
159-
also available in `examples/sample_output.metadata`.
160-
```
161-
h5dump -Hp sample_output.h5
162-
```
163166
164167
## Questions/Comments:
165168
* Sunwoo Lee <[email protected]>

0 commit comments

Comments
 (0)