BLAST-WarpX
diff --git a/‎Docs/source/building/cori.rst‎
Lines changed: 34 additions & 0 deletions b/‎Docs/source/building/cori.rst‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎Docs/source/building/summit.rst‎
Lines changed: 1 addition & 1 deletion b/‎Docs/source/building/summit.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎Docs/source/running_cpp/parallelization.rst‎
Lines changed: 82 additions & 0 deletions b/‎Docs/source/running_cpp/parallelization.rst‎
Lines changed: 82 additions & 0 deletions
diff --git a/‎Docs/source/running_cpp/parameters.rst‎
Lines changed: 10 additions & 0 deletions b/‎Docs/source/running_cpp/parameters.rst‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎Docs/source/running_cpp/running_cpp.rst‎
Lines changed: 1 addition & 0 deletions b/‎Docs/source/running_cpp/running_cpp.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎Docs/source/visualization/yt.rst‎
Lines changed: 1 addition & 1 deletion b/‎Docs/source/visualization/yt.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎Examples/Modules/gaussian_beam/gaussian_beam_PICMI.py‎
Lines changed: 1 addition & 1 deletion b/‎Examples/Modules/gaussian_beam/gaussian_beam_PICMI.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎Examples/Tests/Langmuir/langmuir2d_PICMI.py‎
Lines changed: 3 additions & 1 deletion b/‎Examples/Tests/Langmuir/langmuir2d_PICMI.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎Examples/Tests/Langmuir/langmuir_PICMI.py‎
Lines changed: 3 additions & 1 deletion b/‎Examples/Tests/Langmuir/langmuir_PICMI.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎Examples/Tests/Langmuir/langmuir_PICMI_rt.py‎
Lines changed: 3 additions & 1 deletion b/‎Examples/Tests/Langmuir/langmuir_PICMI_rt.py‎
Lines changed: 3 additions & 1 deletion
@@ -51,6 +51,40 @@ In order to compile for the **Knight's Landing (KNL) architecture**:
         module swap PrgEnv-intel PrgEnv-gnu
         make -j 16 COMP=gnu
 
+GPU Build
+---------
+
+To compile on the experimental GPU nodes on Cori, you first need to purge
+your modules, most of which won't work on the GPU nodes.
+
+   ::
+
+	module purge
+
+Then, you need to load the following modules:
+
+    ::
+
+        module load esslurm cuda pgi openmpi/3.1.0-ucx
+
+Currently, you need to use OpenMPI; mvapich2 seems not to work.
+
+Then, you need to use slurm to request access to a GPU node:
+
+    ::
+
+        salloc -C gpu -N 1 -t 30 -c 10 --gres=gpu:1 --mem=30GB -A m1759
+       
+This reserves 10 logical cores (5 physical), 1 GPU, and 30 GB of RAM for your job.
+Note that you can't cross-compile for the GPU nodes - you have to log on to one
+and then build your software.
+
+Finally, navigate to the base of the WarpX repository and compile in GPU mode:
+
+    ::
+
+        make -j 16 COMP=pgi USE_GPU=TRUE
+
 
 Building WarpX with openPMD support
 -----------------------------------
 
@@ -15,7 +15,7 @@ correct branch:
     git clone --branch master https://bitbucket.org/berkeleylab/picsar.git
     git clone --branch development https://github.com/AMReX-Codes/amrex.git
 
-Then, use the following set of commands to compile:
+Then, ``cd`` into the directory ``WarpX`` and use the following set of commands to compile:
 
 ::
 
 
@@ -0,0 +1,82 @@
+Parallelization in  WarpX
+=========================
+
+When running a simulation, the domain is split into independent 
+rectangular sub-domains (called **grids**). This is the way AMReX, a core 
+component of WarpX, handles parallelization and/or mesh refinement. Furthermore, 
+this decomposition makes load balancing possible: each MPI rank typically computes 
+a few grids, and a rank with a lot of work can transfer one or several **grids** 
+to their neighbors. 
+
+A user 
+does not specify this decomposition explicitly. Instead, the user gives hints to 
+the code, and the actual decomposition is determined at runtime, depending on 
+the parallelization. The main user-defined parameters are 
+``amr.max_grid_size`` and ``amr.blocking_factor``. 
+
+AMReX ``max_grid_size`` and ``blocking_factor``
+-----------------------------------------------
+
+* ``amr.max_grid_size`` is the maximum number of points per **grid** along each 
+  direction (default ``amr.max_grid_size=32`` in 3D).
+
+* ``amr.blocking_factor``: The size of each **grid** must be divisible by the 
+  `blocking_factor` along all dimensions (default ``amr.blocking_factor=8``). 
+  Note that the ``max_grid_size`` also has to be divisible by ``blocking_factor``.
+
+These parameters can have a dramatic impact on the code performance. Each 
+**grid** in the decomposition is surrounded by guard cells, thus increasing the 
+amount of data, computation and communication. Hence having a too small 
+``max_grid_size``, may ruin the code performance.
+
+On the other hand, a too-large ``max_grid_size`` is likely to result in a single 
+grid per MPI rank, thus preventing load balancing. By setting these two 
+parameters, the user wants to give some flexibility to the code while avoiding 
+pathological behaviors.
+
+For more information on this decomposition, see the 
+`Gridding and Load Balancing <https://amrex-codes.github.io/amrex/docs_html/ManagingGridHierarchy_Chapter.html>`__ 
+page on AMReX documentation. 
+
+For specific information on the dynamic load balancer used in WarpX, visit the 
+`Load Balancing <https://amrex-codes.github.io/amrex/docs_html/LoadBalancing.html>`__ 
+page on the AMReX documentation.
+
+The best values for these parameters strongly depends on a number of parameters, 
+among which numerical parameters:
+
+* Algorithms used (Maxwell/spectral field solver, filters, order of the 
+  particle shape factor)
+
+* Number of guard cells (that depends on the particle shape factor and 
+  the type and order of the Maxwell solver, the filters used, `etc.`)
+
+* Number of particles per cell, and the number of species
+
+and MPI decomposition and computer architecture used for the run:
+
+* GPU or CPU
+
+* Number of OpenMP threads
+
+* Amount of high-bandwidth memory.
+
+Below is a list of experience-based parameters 
+that were observed to give good performance on given supercomputers.
+
+Rule of thumb for 3D runs on NERSC Cori KNL
+-------------------------------------------
+
+For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell 
+solver on Cori KNL for a well load-balanced problem (in our case laser 
+wakefield acceleration simulation in a boosted frame in the quasi-linear 
+regime), the following set of parameters provided good performance:
+
+* ``amr.max_grid_size=64`` and ``amr.blocking_factor=64`` so that the size of 
+  each grid is fixed to ``64**3`` (we are not using load-balancing here).
+
+* **8 MPI ranks per KNL node**, with ``OMP_NUM_THREADS=8`` (that is 64 threads 
+  per KNL node, i.e. 1 thread per physical core, and 4 cores left to the 
+  system).
+
+* **2 grids per MPI**, *i.e.*, 16 grids per KNL node.
@@ -531,6 +531,11 @@ Numerics and algorithms
      - ``0``: Vectorized version
      - ``1``: Non-optimized version
 
+    .. warning::
+
+        The vectorized version does not run on GPU. Use
+		``algo.charge_deposition=1`` when running on GPU.
+       
 * ``algo.field_gathering`` (`integer`)
     The algorithm for field gathering:
 
@@ -649,6 +654,11 @@ Diagnostics and output
     perform on-the-fly conversion to the laboratory frame, when running
     boosted-frame simulations)
 
+* ``warpx.lab_data_directory`` (`string`)
+    The directory in which to save the lab frame data when using the
+    **back-transformed diagnostics**. If not specified, the default is
+    is `lab_frame_data`.
+    
 * ``warpx.num_snapshots_lab`` (`integer`)
     Only used when ``warpx.do_boosted_frame_diagnostic`` is ``1``.
     The number of lab-frame snapshots that will be written.
 
@@ -8,3 +8,4 @@ Running WarpX as an executable
    examples
    parameters
    profiling
+   parallelization
@@ -18,7 +18,7 @@ or with the `Anaconda distribution <https://anaconda.org/>`__ of python (recomme
 
 ::
 
-    conda install yt
+    conda install -c conda-forge yt
 
 Visualizing the data
 --------------------
 
@@ -17,7 +17,7 @@
 total_charge = 8.010883097437485e-07
 
 beam_rms_size = 0.25
-electron_beam_divergence = -0.04
+electron_beam_divergence = -0.04*picmi.c
 
 em_order = 3
 
 
@@ -9,7 +9,9 @@
 xmax = +20.e-6
 ymax = +20.e-6
 
-uniform_plasma = picmi.UniformDistribution(density=1.e25, upper_bound=[0., None, None], directed_velocity=[0.1, 0., 0.])
+uniform_plasma = picmi.UniformDistribution(density = 1.e25,
+                                           upper_bound = [0., None, None],
+                                           directed_velocity = [0.1*picmi.c, 0., 0.])
 
 electrons = picmi.Species(particle_type='electron', name='electrons', initial_distribution=uniform_plasma)
 
 
@@ -14,7 +14,9 @@
 ymax = +20.e-6
 zmax = +20.e-6
 
-uniform_plasma = picmi.UniformDistribution(density=1.e25, upper_bound=[0., None, None], directed_velocity=[0.1, 0., 0.])
+uniform_plasma = picmi.UniformDistribution(density = 1.e25,
+                                           upper_bound = [0., None, None],
+                                           directed_velocity = [0.1*picmi.c, 0., 0.])
 
 electrons = picmi.Species(particle_type='electron', name='electrons', initial_distribution=uniform_plasma)
 
 
@@ -14,7 +14,9 @@
 ymax = +20.e-6
 zmax = +20.e-6
 
-uniform_plasma = picmi.UniformDistribution(density=1.e25, upper_bound=[0., None, None], directed_velocity=[0.1, 0., 0.])
+uniform_plasma = picmi.UniformDistribution(density = 1.e25,
+                                           upper_bound = [0., None, None],
+                                           directed_velocity = [0.1*picmi.c, 0., 0.])
 
 electrons = picmi.Species(particle_type='electron', name='electrons', initial_distribution=uniform_plasma)