- Load the environment
/projects/pace/condaEnvs/pacePreprocess2 - Load the module
openmpi/4.0.4/gcc-8.4.0
- Python v3.7.6
- Pytorch v1.7.1
- numpy v1.19.1
- matplotlib v3.2.2
- mpi4py v3.0.3
- scikit-learn v0.21.0
- openMPI v4.0.4
The purpose of the tool is to perform a smart downselection of a large number of datapoints. Typically, large numerical simulations generate billions, or even trillions of datapoints. However, there may be redundancy in the dataset which unecessarily constrains memory and computing requirements. Here, redundancy is defined as closeness in feature space. The method is called phase-space sampling.
bash run2D.sh: Example of downsampling a 2D combustion dataset. First the downsampling is performed (mpiexec -np 4 python main_iterative.py input). Then the loss function for each flow iteration is plotted (python plotLoss.py input). Finally, the samples are visualized (python visualizeDownSampled_subplots.py input). All figures are saved under the folder Figures.
To avoid package management yourself bash run2D_poetry.sh. This requires poetry
The code is GPU+MPI-parallelized: a) the dataset is loaded and shuffled in parallel b) the probability evaluation (the most expensive step) is done in parallel c) downsampling is done in parallel d) only the training is offloaded to a GPU if available. Memory usage of root processor is higher than other since it is the only one in charge of the normalizing flow training and sampling probability adjustment. To run the code in parallel, mpiexec -np num_procs python main_iterative.py input.
In the code, arrays with suffix _ denote data distributed over the processors.
The computation of nearest neighbor distance is parallelized using the sklearn implementation. It will be accelerated on systems where hyperthreading is enabled (your laptop, but NOT the Eagle HPC)
When using GPU+MPI-parallelism on Eagle, you need to specify the number of MPI tasks (srun -n 36 python main_iterative.py)
When using MPI-parallelism alone on Eagle, you do not need to specify the number of MPI tasks (srun python main_iterative.py)
Running on GPU only accelerate execution by ~30% for the examples provided here. Running with many MPI-tasks linearly decreases the execution time for probability evaluation, as well as memory per core requirements.
Parallelization tested with up to 36 cores on Eagle.
Parallelization tested with up to 4 cores on MacOS Catalina v10.15.7.
We provide the data for running a 2D downsampling example. The data is located at data/combustion2DToDownsampleSmall.npy.
The dataset to downsample has size
All hyperparameters can be controlled via an input file (see run2D.sh).
We recommend fixing the number of flow calculation iteration to 2.
When increasing the number of dimensions, we recommend adjusting the hyperparameters. A 2-dimensional example (input) and an 11-dimensional (highdim/input11D) example are provided to guide the user.
It may not be obvious to evaluate how uniformly distributed are the obtained phase-space samples. During the code execution, a mean dist is displayed. This corresponds to the average distance to the nearest neighbor of each data point. The higher the distance, the more uniformly distributed is the dataset. At first, the distance is shown for a random sampling case. Then it is displayed at every iteration. The mean distance should be higher than for the random sampling case. In addition, the second iteration should lead better mean distance than the first one. A warning message is displayed in case the second flow iteration did not improve the sampling. An error message is displayed in case the last flow iteration did not improve the sampling compared to the random case.
The computational cost associated with the nearest neighbor computations scales as
During training of the normalizing flow, the negative log likelihood is displayed. The user should ensure that the normalizing flow has learned something useful about the distribution by ensuring that the loss is close to being converged. The log of the loss is displayed as a csv file in the folder TrainingLog. The loss of the second training iteration should be higher than the first iteration. If this is not the case or if more iterations are needed, the normalizing flow trained may need to be better converged. A warning message will be issued in that case.
A script is provided to visualize the losses. Execute python plotLoss.py input where input is the name of the input file used to perform the downsampling.
Suppose one wants to downsample an dataset where
Next, the code uses the probability map to define a sampling probability which downselect samples that uniformly span the feature space. The probability map is obtained by training a Neural Spline Flow which implementation was obtained from Neural Spline Flow repository. The number of samples in the final dataset can be controlled via the input file.
For comparison, a random sampling gives the following result
Input file is provided in highdim/input11D
The folder data-efficientML is NOT necessary for using the phase-space sampling package. It only contains the code necessary to reproduce the results shown in the paper:
Published version (open access)
Preprint version (open access)
@article{hassanaly2023UIPS,
title={Uniform-in-Phase-Space Data Selection with Iterative Normalizing Flows},
author={Hassanaly, Malik and Perry, Bruce A. and Mueller, Michael E. and Yellapantula, Shashank},
journal={Data-centric Engineering},
pages={e11},
volume={4},
year={2023},
publisher={Cambridge University Press}
}
Overview presentation in documentation/methodOverview.pptx
Malik Hassanaly: (malik.hassanaly!at!nrel!gov)





