Skip to content

Commit 49dec72

Browse files
Merge pull request #715 from lattice/release/0.9.x
Release/0.9.x
2 parents 9425ca6 + 341c238 commit 49dec72

File tree

574 files changed

+106140
-98460
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

574 files changed

+106140
-98460
lines changed

.gitignore

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,13 @@
66
tests/*_test
77
make.inc
88
milc_interface/*
9+
*#*
910
*.pyc
11+
tunecache.tsv
12+
profile.tsv
13+
config.log
14+
CMakeCache.txt
15+
CMakeFiles
16+
externals
17+
.tags*
18+
autom4te.cache/*

CMakeLists.txt

Lines changed: 381 additions & 137 deletions
Large diffs are not rendered by default.

LICENSE

Lines changed: 70 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11

2-
Copyright (c) 2009-2016 QUDA Developers
2+
Copyright (c) 2009-2017, QUDA Developers
33

44
Permission is hereby granted, free of charge, to any person obtaining a copy
55
of this software and associated documentation files (the "Software"), to deal
@@ -22,7 +22,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
2222
QUDA is supported by NVIDIA, and includes the NVIDIA-licensed
2323
libraries cub and generics.
2424

25-
Copyright (c) 2011-2015, NVIDIA Corporation
25+
Copyright (c) 2011-2017, NVIDIA Corporation
2626
All rights reserved.
2727

2828
Redistribution and use in source and binary forms, with or without
@@ -81,10 +81,12 @@ code segements are supplied under the following license:
8181
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
8282

8383
QUDA leverages Google Test for unit testing, contained within tests/gtest.h
84-
and tests/gtest-all.cc. These files are supplied under the following
84+
and tests/gtest-all.cc and FindEigen from ceres-solver, contained in
85+
cmake/FindEigen.cmake.
86+
These files are supplied under the following
8587
license:
8688

87-
Copyright 2008, Google Inc.
89+
Copyright 2008,2015, Google Inc.
8890
All rights reserved.
8991

9092
Redistribution and use in source and binary forms, with or without
@@ -113,6 +115,62 @@ license:
113115
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
114116
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
115117

118+
QUDA uses the FINDNVML.cmake script contained in cmake/FindNVML.cmake.
119+
This file is supplied under the following license.
120+
121+
Jiri Kraus, NVIDIA Corp (nvidia.com - jkraus)
122+
Copyright (c) 2008 - 2014 NVIDIA Corporation. All rights reserved.
123+
124+
This code is licensed under the MIT License. See the FindNVML.cmake script
125+
for the text of the license.
126+
127+
The MIT License
128+
129+
License for the specific language governing rights and limitations under
130+
Permission is hereby granted, free of charge, to any person obtaining a
131+
copy of this software and associated documentation files (the "Software"),
132+
to deal in the Software without restriction, including without limitation
133+
the rights to use, copy, modify, merge, publish, distribute, sublicense,
134+
and/or sell copies of the Software, and to permit persons to whom the
135+
Software is furnished to do so, subject to the following conditions:
136+
137+
The above copyright notice and this permission notice shall be included
138+
in all copies or substantial portions of the Software.
139+
140+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
141+
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
142+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
143+
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
144+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
145+
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
146+
DEALINGS IN THE SOFTWARE.
147+
148+
QUDA uses the branchlut integer-to-char conversion functions taken
149+
from itoa benchmark of Milo Yip (https://github.com/miloyip/itoa-benchmark).
150+
Located in include/uint_to_char.h, these routines are supplied under
151+
the following license.
152+
153+
Copyright (C) 2014 Milo Yip
154+
155+
Permission is hereby granted, free of charge, to any person
156+
obtaining a copy of this software and associated documentation
157+
files (the "Software"), to deal in the Software without
158+
restriction, including without limitation the rights to use, copy,
159+
modify, merge, publish, distribute, sublicense, and/or sell copies
160+
of the Software, and to permit persons to whom the Software is
161+
furnished to do so, subject to the following conditions:
162+
163+
The above copyright notice and this permission notice shall be
164+
included in all copies or substantial portions of the Software.
165+
166+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
167+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
168+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
169+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
170+
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
171+
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
172+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
173+
DEALINGS IN THE SOFTWARE.
116174

117175
QUDA leverages LLNL's PMPI wrapper generator for generation of NVTX
118176
wrappers to enable the Visual Profiler to display MPI timeline
@@ -178,3 +236,11 @@ following license:
178236
those of the United States Government or Lawrence Livermore
179237
National Security, LLC, and shall not be used for advertising or
180238
product endorsement purposes.
239+
240+
241+
Additional Notices
242+
243+
QUDA utilizes Maxim Milakov's int_fastdiv library for fast run-time
244+
integer division. This is distributed under the Apache License,
245+
Version 2.0. See declaration at top of int_fastdiv.h for license
246+
specifics.

Makefile renamed to Makefile.LastResort

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,12 @@
1-
all: make.inc lib tests
1+
all:
2+
@echo 'Please use cmake to build quda, see the README file.'
3+
@exit 1
4+
5+
old: make.inc lib tests
26

37
make.inc:
4-
@echo 'Please run configure to create make.inc before building.'
8+
@echo 'Please run autoconf and ./configure to create make.inc before building.'
9+
@acho 'Note that this is DEPRECTATED and you are on your own.'
510
@exit 1
611

712
lib:

NEWS

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,94 @@
1+
Version 0.9.0 - 24 July 2018
2+
3+
- Add support for CUDA 9.x: QUDA 0.9.0 is supported on CUDA 7.0-9.2.
4+
5+
- Continued focus on optimization of multi-GPU execution, with
6+
particular emphasis on Dslash scaling. For more details on
7+
optimizing multi-GPU performance, see
8+
https://github.com/lattice/quda/wiki/Multi-GPU-Support
9+
10+
- On systems that support it, QUDA now uses direct peer-to-peer
11+
communication between GPUs with in the same node. The Dslash policy
12+
autotuner will ascertain the optimal commuication route to take,
13+
whether it be to route through CPU memory, use DMA copy engines or
14+
directly write the halo buffer to neighboring GPUs.
15+
16+
- On systems that support it, QUDA will take advantage of GPU Direct
17+
RDMA. This is enabled through setting the environment variable
18+
QUDA_ENABLE_GDR=1 which will augment the dslash tuning policies to
19+
include policies using GPU-aware MPI to facilitate direct GPU-NIC
20+
communication. This can improve strong scaling by up to 3x.
21+
22+
- Improved precision when using half precision (use rounding instead
23+
of truncation when converting to/from float).
24+
25+
- Add support for symmetric preconditioning for 4-d preconditioned
26+
Shamir and Mobius Dirac operators.
27+
28+
- Added initial support for multi-right-hand-side staggered Dirac
29+
operator (treat the rhs index as a fifth dimension).
30+
31+
- Added initial implementation of block CG linear solver.
32+
33+
- Added BiCGStab(l) linear solver. The parameter "l" corresponds to
34+
the size of the space to perform GCR-style residual minimization.
35+
This is typically much better behaved than BiCGStab for the Wilson
36+
and Wilson-clover linear systems.
37+
38+
- Initial version of adaptive multigrid fully implemented into QUDA.
39+
40+
- Creation of multi-blas and multi-reduction framework, this is
41+
essential for high performance for pipelined, block and
42+
communication-avoiding solvers that work on "matrices of vectors" as
43+
opposed to "scalars of vectors". The max tile size used by the
44+
multi-blas framework is set by QUDA_MAX_MULTI_BLAS_N cmake
45+
parameter, which default to 4 for reduced compile time. For
46+
production use of such solvers, this should be increase to 8..16.
47+
48+
- Optimization of multi-shift solver using multi-blas framework to permit
49+
kernel fusion of all shift updates.
50+
51+
- Complete rewrite and optimization of clover inversion, HISQ force
52+
kernels, HISQ link fattening algorithms using accessors.
53+
54+
- QUDA can now directly load/store from MILC's site structure array.
55+
This removes the need to unpack and pack data prior to calling QUDA,
56+
and dramatically reduces CPU overhead.
57+
58+
- Removal of legacy data structures and kernels. In particular
59+
original single-GPU only ASQTAD fermion force has been removed.
60+
61+
- Implementation of STOUT fattening kernel.
62+
63+
- Significant improvement to the cmake build system to improve
64+
compilation speed and aid productivity. In particular, QUDA now
65+
supports being built as a shared library which greatly reduces link
66+
time.
67+
68+
- Autoconf and configure build system is no longer supported.
69+
70+
- Automated unit testing of dslash_test and blas_test are now enabled
71+
using ctest.
72+
73+
- Adds support for MPS, enabled through setting the environment
74+
variable QUDA_ENABLE_MPS=1. This allow GPUs to be oversubscribed by
75+
multiple processes, which can improve overall job throughput.
76+
77+
- Implemented self-profiler that builds on top of autotuning
78+
framework. Kernel profile is output to profile_n.tsv, where n=0,
79+
with n incremented with each call to saveProfile (which dumps the
80+
profile to disk). An equivalent algorithm policy profile is output
81+
to profile_async_n.tsv which contains policies such as a complete
82+
dslash. Filename prefix and path can be overridden using
83+
QUDA_PROFILE_OUTPUT_BASE environment variable.
84+
85+
- Implemented simple tracing facility that dumps the flow of kernels
86+
called through a single execution to trace.tsv. Enabled with
87+
environment variable QUDA_ENABLE_TRACE=1.
88+
89+
- Multiple bug fixes and clean up to the library. Many of these are
90+
listed here: https://github.com/lattice/quda/milestone/15?closed=1
91+
192
Version 0.8.0 - 1st February 2016
293

394
- Removed all Tesla-generation GPU support from QUDA (sm_1x). As a

0 commit comments

Comments
 (0)