1- # Parallel Data Concatenation for High Energy Physics Data Analysis
1+ # Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis
22
3- This software package contains C++ programs for concatenating multiple HDF5
4- files into a single one by appending individual datasets one after another.
3+ This software package contains C++ programs for concatenating HDF5 datasets
4+ across multiple files into a single file by appending individual datasets one
5+ after another.
56
67## Input HDF5 Files
78* Each file contains multiple groups, each representing a "relational database
89 table".
9- * Each group contains multiple datasets, each representing a column of the
10+ * Each group contains multiple datasets. The number of datasets in a group can
11+ be different from others. Each dataset can be considered as a column of the
1012 database table.
1113* Datasets in the same group are 2D arrays sharing the same size of 1st
12- dimension (most significant). The size of 2nd dimension may be different.
13- * Some of the datasets are actually 1D arrays whose 2nd dimension if of size 1.
14- * Datasets can be of size zero, i.e. either dimension is of size 0.
15- * All the files have the same "schema", i.e. same structure of groups and
16- datasets.
17- * A dataset in an input file may be of different 1st dimension size from the
18- one in other files, while the 2nd dimension should be of the same size
19- across files.
14+ (most significant) dimension . The 2nd dimension size may be different.
15+ * Some of the datasets are actually 1D arrays whose 2nd dimension is of size 1.
16+ * Datasets can be of size zero, i.e. the 1st dimension being of size 0.
17+ * All the files have the same "schema", i.e. same numbers of groups and
18+ datasets with the same names .
19+ * The size of 1st dimension of a dataset in an input file may be different from
20+ the dataset with the same name in other files. The 2nd dimension should be of
21+ the same size across all input files.
2022
2123## Software Requirements
2224* A C++ compiler that support ISO C++0x standard or higher
2325* MPI C and C++ compilers
24- * An HDF5 library version 1.10.5 and later built with parallel I/O feature enabled
26+ * An HDF5 library version 1.10.5 and later built with parallel I/O feature enabled
2527
2628## Instructions to Build
27290 . If building from a git clone of this repository, then run command below first.
@@ -42,7 +44,7 @@ files into a single one by appending individual datasets one after another.
42442 . Run command "make" to create the executable file named "ph5_concat"
4345
4446## Command to Run
45- * Run command and command -line options are:
47+ * Command -line options are:
4648 ```
4749 mpiexec -n <np> ./ph5_concat [-h|-q|-d|-r|-s|-p|-x] [-t num] [-m size] [-k name] [-z level] [-b size] [-o outfile] [-i infile]
4850
@@ -84,7 +86,27 @@ files into a single one by appending individual datasets one after another.
8486 read by all processes collectively (i.e. shared-file reads) and then all
8587 processes collectively write to the output file.
8688
87- ## An example output shown on screen from a run on Cori using 128 MPI processes.
89+ ## Sample input and output files
90+ * There are four sample input files provided in folder ` examples ` .
91+ + examples/sample_input_1.h5
92+ + examples/sample_input_2.h5
93+ + examples/sample_input_3.h5
94+ + examples/sample_input_4.h5
95+ * Sample run commands
96+ ```
97+ mpiexec -n 2 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5
98+ mpiexec -n 4 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5 -k evt
99+ ```
100+ The output shown on screen is stored in ` examples/sample_stdout.txt ` .
101+ * Sample output files
102+ + The output files from concatenating the 4 sample files are available in
103+ ` examples/sample_output.h5 ` whose metadata dumped from command below is
104+ also available in ` examples/sample_output.metadata ` .
105+ ```
106+ h5dump -Hp sample_output.h5
107+ ```
108+
109+ ## An example timing output from a run on Cori using 128 MPI processes.
88110 ```
89111 % srun -n 128 ./ph5_concat -i ./nd_list_128.txt -o /scratch1/FS_1M_128/nd_out.h5 -b 512 -k evt -x
90112
@@ -141,25 +163,6 @@ files into a single one by appending individual datasets one after another.
141163 Close output files total: 0.4799
142164 End-to-end: 314.8095
143165 ```
144- ## Sample input and output files
145- * There are four sample input files provided in folder ` examples ` .
146- + examples/sample_input_1.h5
147- + examples/sample_input_2.h5
148- + examples/sample_input_3.h5
149- + examples/sample_input_4.h5
150- * Sample run commands
151- ```
152- mpiexec -n 2 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5
153- mpiexec -n 4 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5 -k evt
154- ```
155- The output shown on screen is stored in ` examples/sample_stdout.txt ` .
156- * Sample output files
157- + The output files from concatenating the 4 sample files are available in
158- ` examples/sample_output.h5 ` whose metadata dumped from command below is
159- also available in ` examples/sample_output.metadata ` .
160- ```
161- h5dump -Hp sample_output.h5
162- ```
163166
164167## Questions/Comments:
165168
0 commit comments