-
Couldn't load subscription status.
- Fork 5
License
Couldn't load subscription status.
lgpang/clvisc
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
= CLVisc: a (3+1)D viscous hydrodynamic program parallelized on GPU using OpenCL =
The program is used to simulate the evolution of strongly coupled quark gluon plasma produced in relativistic heavy ion collisions.
Please cite the following paper if you used CLVisc for publications or reused part of its code,
@article{Pang:2018zzo,
author = "Pang, Long-Gang and Petersen, Hannah and Wang, Xin-Nian",
title = "{Pseudorapidity distribution and decorrelation of
anisotropic flow within CLVisc hydrodynamics}",
year = "2018",
eprint = "1802.04449",
archivePrefix = "arXiv",
primaryClass = "nucl-th",
SLACcitation = "%%CITATION = ARXIV:1802.04449;%%"
}
Copyright (C) 2018, Long-Gang Pang
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
==Installation==
First of all, get CLVisc using:
git clone https://gitlab.com/snowhitiger/PyVisc.git
1. Install OpenCL
(1) For MacBook Pro, OpenCL is supported by default, skip this step.
(2) For Linux using Nvidia GPU, install CUDA -- Shipped with OpenCL. url: https://developer.nvidia.com/cuda-downloads
(3) For Linux using AMD GPU, install AMD APP SDK from http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/
(4) For super cluster with GPUs, ask the IT-help people for the OpenCL/Cuda support.
2. Download and install the latest Anaconda from https://www.continuum.io/downloads
Important: please choose Python2.7 (although most of the code work well with Python3.*)
Notice: in case you use Python2.7 from other sources, please also install matplotlib, h5py, pandas, sympy.
These 4 packages are delivered with Anaconda by default.
3. Install PyOpenCL
`conda install -c conda-forge pyopencl`
Till now you can run ideal.py and viscous.py in pyvisc/ directory to run one ideal and one viscous hydro event,
the hydrodynamic evolution will produce and print the evolution history and freeze out hyper-surface in result/
directory. In order to calculate smooth particle spectra or sample hadrons from hyper-surface, one needs to
additionally install *cmake* and *gsl*.
4. Install cmake
(1) For MacBook,
Run in Terminal app:
`ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null`
and press *enter/return* key. Wait for the command to finish.
Run:
`brew install cmake`
Done! You can now use cmake.
(2) For Linux, search on google
5. Install gsl library
(1) For MacBook, `brew install gsl`
(2) For Linux, search on google
6. Event-by-event hydro using trento initial condition
(1) Install trento
cd 3rdparty/trento_with_participant_plane/
mkdir build
cd build
cmake ..
make
make install
(2) Compile the MC sampling spectra calculation subroutines
cd sampler/
mkdir build
cd build
cmake ..
make
(3) Compile the Smooth spectra calcualation subroutines
cd CLSmoothSpec/
mkdir build
cd build
cmake ..
make
Notice: this step will fail with MacOS version > 10.8 because apple depreciated some OpenCL functions.
Please use this subroutine on a GPU cluster/Linux machine.
This program will be updated in the future to consider the MacOS updates.
(4) In PyVisc/pyvisc/, modify the output path in ebe_trento.py and run
#python ebe_trento.py collision_sys centrality gpu_id num_of_events
python ebe_trento.py auau200 0_5 0 100
python ebe_trento.py pbpb2760 20_30 0 100
python ebe_trento.py pbpb5020 30_40 0 100
7. If there is error : No module named 'mako', one can install Mako using
pip install --user Mako
8. Modify cache_dir in cache.py if the cluster does not have /tmp directory
anaconda2/lib/python2.7/site-packages/pyopencl-2016.2-py2.7-linux-x86_64.egg/pyopencl/cache.py
322 def _create_built_program_from_source_cached(ctx, src, options_bytes,
323 devices, cache_dir, include_path):
324 from os.path import join
325
326 if cache_dir is None:
327 import appdirs
328 #cache_dir = join(appdirs.user_cache_dir("pyopencl", "pyopencl"),
329 # "pyopencl-compiler-cache-v2-py%s" % (
330 # ".".join(str(i) for i in sys.version_info),))
331 cache_dir = '/lustre/nyx/hyihp/lpang/tmp/'
==Examples==
1. cd pyvisc
python ideal.py
2. cd pyvisc
python visc.py
Notice: the visc.py has huge GPU memory demands and can only be run on GPUs whose memory > 5GB.
3. cd python
modify ebe_trento.py to run event-by-event hydrodynamics with Trento initial condition
==The BSZ dependence==
For ideal hydrodynamics, with lattice 385*385*115, per step running time is:
BSZ 8 16 32 64 128
Ideal(s) 0.37 0.218 0.178 0.155 0.157
Visc(s)-GPU 3.12 1.65 1.17 1.01 1.17
Visc(s)-CPU 6.64 6.45 6.63 7.0 7.58
==The importance of concurrent reading from Global memory.==
Here I used NX=NY=NZ=201 for a test, in principle the time cost for visc_src_alongx,
visc_src_alongy, visc_src_alongz should have no difference. However, from line profiler
by using: {{{kernpro -l -v visc.py }}}
One gets 41.9 vs 38.4 vs 6.9 for x, y and z direction.
Why there is so big difference? It can be explained by the order of the data in global memory,
where we use:
{{{
for (int i = 0; i < NX; i++ )
for (int j = 0; j < NY; j++ )
for (int k = 0; k < NZ; k++ ) {
pimn[i*NY*NZ + j*NZ + K] = some number;
}
}}}
The data is continues along z direction, which makes it much faster to read from
global memory to local memory in z direction than x and y due to concurrent.
Total time: 19.2475 s
File: visc.py
Function: IS_stepUpdate at line 165
Line # Hits Time Per Hit % Time Line Contents
==============================================================
165 @profile
166 def IS_stepUpdate(self, step):
167 #print "ideal update finished"
168 52 152 2.9 0.0 NX, NY, NZ, BSZ = self.cfg.NX, self.cfg.NY, self.cfg.NZ, self.cfg.BSZ
169
170 52 143954 2768.3 0.7 self.kernel_IS.visc_src_christoffel(self.queue, (NX*NY*NZ,), None,
171 52 134 2.6 0.0 self.d_IS_src, self.d_pi[step], self.ideal.d_ev[step],
172 52 581359 11180.0 3.0 self.ideal.tau, np.int32(step)).wait()
173
174 52 159298 3063.4 0.8 self.kernel_IS.visc_src_alongx(self.queue, (BSZ, NY, NZ), (BSZ, 1, 1),
175 52 143 2.8 0.0 self.d_IS_src, self.d_udx, self.d_pi[step], self.ideal.d_ev[step],
176 52 8055724 154917.8 41.9 self.eos_table, self.ideal.tau).wait()
177
178 #print "udx along x"
179
180 51 156991 3078.3 0.8 self.kernel_IS.visc_src_alongy(self.queue, (NX, BSZ, NZ), (1, BSZ, 1),
181 51 151 3.0 0.0 self.d_IS_src, self.d_udy, self.d_pi[step], self.ideal.d_ev[step],
182 51 7381515 144735.6 38.4 self.eos_table, self.ideal.tau).wait()
183
184 #print "udy along y"
185 51 157382 3085.9 0.8 self.kernel_IS.visc_src_alongz(self.queue, (NX, NY, BSZ), (1, 1, BSZ),
186 51 137 2.7 0.0 self.d_IS_src, self.d_udz, self.d_pi[step], self.ideal.d_ev[step],
187 51 1329880 26076.1 6.9 self.eos_table, self.ideal.tau).wait()
188
189 #print "udz along z"
190 51 302246 5926.4 1.6 self.kernel_IS.update_pimn(self.queue, (NX*NY*NZ,), None,
191 51 141 2.8 0.0 self.d_pi[3-step], self.d_goodcell, self.d_pi[1], self.d_pi[step],
192 51 82 1.6 0.0 self.ideal.d_ev[1], self.ideal.d_ev[2], self.d_udiff,
193 51 94 1.8 0.0 self.d_udx, self.d_udy, self.d_udz, self.d_IS_src,
194 51 978116 19178.7 5.1 self.eos_table, self.ideal.tau, np.int32(step)
195 ).wait()
==Usage of vloadn to speed up global data access==
In kernel_visc.cl, one needs to load (pitt, pitx, pity, pitz) and (pixt, pixx, pixy, pixz) in src_alongx,
needs to load (pitt, pitx, pity, pitz) and (piyt, piyx, piyy, piyz) in src_alongy,
needs to load (pitt, pitx, pity, pitz) and (pizt, pizx, pizy, pizz) in src_alongz;
Since the data are stored in for(i, j, k) order, so loading data along z is faster than along y.
However, self.kernel_visc.kt_src_alongx is much faster than loading data along y.
This may be caused by continues address for pixx, pixy, pizx.
I tried to use vload4 but it does not speed up the code, which means the compiler already did the optimization.
Total time: 8.60907 s
File: visc.py
Function: visc_stepUpdate at line 124
Line # Hits Time Per Hit % Time Line Contents
==============================================================
124 @profile
125 def visc_stepUpdate(self, step):
126 ''' Do step update in kernel with KT algorithm for visc evolution
127 Args:
128 gpu_ev_old: self.d_ev[1] for the 1st step,
129 self.d_ev[2] for the 2nd step
130 step: the 1st or the 2nd step in runge-kutta
131 '''
132 # upadte d_Src by KT time splitting, along=1,2,3 for 'x','y','z'
133 # input: gpu_ev_old, tau, size, along_axis
134 # output: self.d_Src
135 108 324 3.0 0.0 NX, NY, NZ, BSZ = self.cfg.NX, self.cfg.NY, self.cfg.NZ, self.cfg.BSZ
136 108 334097 3093.5 3.9 self.kernel_visc.kt_src_christoffel(self.queue, (NX*NY*NZ, ), None,
137 108 278 2.6 0.0 self.ideal.d_Src, self.ideal.d_ev[step],
138 108 162 1.5 0.0 self.d_pi[step], self.eos_table,
139 108 854615 7913.1 9.9 self.ideal.tau, np.int32(step)
140 ).wait()
141
142 108 296185 2742.5 3.4 self.kernel_visc.kt_src_alongx(self.queue, (BSZ, NY, NZ), (BSZ, 1, 1),
143 108 275 2.5 0.0 self.ideal.d_Src, self.ideal.d_ev[step],
144 108 166 1.5 0.0 self.d_pi[step], self.eos_table,
145 108 1313623 12163.2 15.3 self.ideal.tau).wait()
146
147 108 296962 2749.6 3.4 self.kernel_visc.kt_src_alongy(self.queue, (NX, BSZ, NZ), (1, BSZ, 1),
148 108 251 2.3 0.0 self.ideal.d_Src, self.ideal.d_ev[step],
149 108 167 1.5 0.0 self.d_pi[step], self.eos_table,
150 108 2435409 22550.1 28.3 self.ideal.tau).wait()
151
152 108 296962 2749.6 3.4 self.kernel_visc.kt_src_alongz(self.queue, (NX, NY, BSZ), (1, 1, BSZ),
153 108 261 2.4 0.0 self.ideal.d_Src, self.ideal.d_ev[step],
154 108 180 1.7 0.0 self.d_pi[step], self.eos_table,
155 108 1093978 10129.4 12.7 self.ideal.tau).wait()
156
157 # if step=1, T0m' = T0m + d_Src*dt, update d_ev[2]
158 # if step=2, T0m = T0m + 0.5*dt*d_Src, update d_ev[1]
159 # Notice that d_Src=f(t,x) at step1 and
160 # d_Src=(f(t,x)+f(t+dt, x(t+dt))) at step2
161 # output: d_ev[] where need_update=2 for step 1 and 1 for step 2
162 108 409861 3795.0 4.8 self.kernel_visc.update_ev(self.queue, (NX*NY*NZ, ), None,
163 108 277 2.6 0.0 self.ideal.d_ev[3-step], self.ideal.d_ev[1],
164 108 169 1.6 0.0 self.d_pi[0], self.d_pi[3-step],
165 108 147 1.4 0.0 self.ideal.d_Src,
166 108 1274723 11803.0 14.8 self.eos_table, self.ideal.tau, np.int32(step)).wait()
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published