lattice
diff --git a/‎.gitignore‎
Lines changed: 9 additions & 0 deletions b/‎.gitignore‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎CMakeLists.txt‎
Lines changed: 381 additions & 137 deletions b/‎CMakeLists.txt‎
Lines changed: 381 additions & 137 deletions
diff --git a/‎LICENSE‎
Lines changed: 70 additions & 4 deletions b/‎LICENSE‎
Lines changed: 70 additions & 4 deletions
diff --git a/‎Makefile‎ renamed to ‎Makefile.LastResort‎
Lines changed: 7 additions & 2 deletions b/‎Makefile‎ renamed to ‎Makefile.LastResort‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎NEWS‎
Lines changed: 91 additions & 0 deletions b/‎NEWS‎
Lines changed: 91 additions & 0 deletions
@@ -6,4 +6,13 @@
 tests/*_test
 make.inc
 milc_interface/*
+*#*
 *.pyc
+tunecache.tsv
+profile.tsv
+config.log
+CMakeCache.txt
+CMakeFiles
+externals
+.tags*
+autom4te.cache/*
@@ -1,5 +1,5 @@
 
-Copyright (c) 2009-2016 QUDA Developers
+Copyright (c) 2009-2017, QUDA Developers
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -22,7 +22,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 QUDA is supported by NVIDIA, and includes the NVIDIA-licensed
 libraries cub and generics.
 
-Copyright (c) 2011-2015, NVIDIA Corporation
+Copyright (c) 2011-2017, NVIDIA Corporation
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
@@ -81,10 +81,12 @@ code segements are supplied under the following license:
   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 QUDA leverages Google Test for unit testing, contained within tests/gtest.h
-and tests/gtest-all.cc.  These files are supplied under the following
+and tests/gtest-all.cc and FindEigen from ceres-solver, contained in
+cmake/FindEigen.cmake.
+These files are supplied under the following
 license:
 
-  Copyright 2008, Google Inc.
+  Copyright 2008,2015, Google Inc.
   All rights reserved.
 
   Redistribution and use in source and binary forms, with or without
@@ -113,6 +115,62 @@ license:
   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+QUDA uses the FINDNVML.cmake script contained in cmake/FindNVML.cmake.
+This file is supplied under the following license.
+
+  Jiri Kraus, NVIDIA Corp (nvidia.com - jkraus)
+  Copyright (c) 2008 - 2014 NVIDIA Corporation.  All rights reserved.
+
+  This code is licensed under the MIT License.  See the FindNVML.cmake script
+  for the text of the license.
+
+  The MIT License
+
+  License for the specific language governing rights and limitations under
+  Permission is hereby granted, free of charge, to any person obtaining a
+  copy of this software and associated documentation files (the "Software"),
+  to deal in the Software without restriction, including without limitation
+  the rights to use, copy, modify, merge, publish, distribute, sublicense,
+  and/or sell copies of the Software, and to permit persons to whom the
+  Software is furnished to do so, subject to the following conditions:
+
+  The above copyright notice and this permission notice shall be included
+  in all copies or substantial portions of the Software.
+
+  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+  OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+  FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+  THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+  FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+  DEALINGS IN THE SOFTWARE.
+
+QUDA uses the branchlut integer-to-char conversion functions taken
+from itoa benchmark of Milo Yip (https://github.com/miloyip/itoa-benchmark).
+Located in include/uint_to_char.h, these routines are supplied under
+the following license.
+
+    Copyright (C) 2014 Milo Yip
+
+    Permission is hereby granted, free of charge, to any person
+    obtaining a copy of this software and associated documentation
+    files (the "Software"), to deal in the Software without
+    restriction, including without limitation the rights to use, copy,
+    modify, merge, publish, distribute, sublicense, and/or sell copies
+    of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+
+    The above copyright notice and this permission notice shall be
+    included in all copies or substantial portions of the Software.
+
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+    EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+    MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+    NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+    HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+    DEALINGS IN THE SOFTWARE.
 
 QUDA leverages LLNL's PMPI wrapper generator for generation of NVTX
 wrappers to enable the Visual Profiler to display MPI timeline
@@ -178,3 +236,11 @@ following license:
      those of the United States Government or Lawrence Livermore
      National Security, LLC, and shall not be used for advertising or
      product endorsement purposes.
+
+
+  Additional Notices
+
+  QUDA utilizes Maxim Milakov's int_fastdiv library for fast run-time
+  integer division.  This is distributed under the Apache License,
+  Version 2.0.  See declaration at top of int_fastdiv.h for license
+  specifics.
@@ -1,7 +1,12 @@
-all: make.inc lib tests
+all: 
+	@echo 'Please use cmake to build quda, see the README file.'
+	@exit 1
+
+old:	make.inc lib tests
 
 make.inc:
-	@echo 'Please run configure to create make.inc before building.'
+	@echo 'Please run autoconf and ./configure to create make.inc before building.'
+	@acho 'Note that this is DEPRECTATED and you are on your own.'
 	@exit 1
 
 lib:
 
@@ -1,3 +1,94 @@
+Version 0.9.0 - 24 July 2018
+
+- Add support for CUDA 9.x: QUDA 0.9.0 is supported on CUDA 7.0-9.2.
+
+- Continued focus on optimization of multi-GPU execution, with
+  particular emphasis on Dslash scaling.  For more details on
+  optimizing multi-GPU performance, see
+  https://github.com/lattice/quda/wiki/Multi-GPU-Support
+
+- On systems that support it, QUDA now uses direct peer-to-peer
+  communication between GPUs with in the same node.  The Dslash policy
+  autotuner will ascertain the optimal commuication route to take,
+  whether it be to route through CPU memory, use DMA copy engines or
+  directly write the halo buffer to neighboring GPUs. 
+
+- On systems that support it, QUDA will take advantage of GPU Direct
+  RDMA.  This is enabled through setting the environment variable
+  QUDA_ENABLE_GDR=1 which will augment the dslash tuning policies to
+  include policies using GPU-aware MPI to facilitate direct GPU-NIC
+  communication.  This can improve strong scaling by up to 3x.
+
+- Improved precision when using half precision (use rounding instead
+  of truncation when converting to/from float).
+
+- Add support for symmetric preconditioning for 4-d preconditioned
+  Shamir and Mobius Dirac operators.
+
+- Added initial support for multi-right-hand-side staggered Dirac
+  operator (treat the rhs index as a fifth dimension).
+
+- Added initial implementation of block CG linear solver.
+
+- Added BiCGStab(l) linear solver.  The parameter "l" corresponds to
+  the size of the space to perform GCR-style residual minimization.
+  This is typically much better behaved than BiCGStab for the Wilson
+  and Wilson-clover linear systems.
+
+- Initial version of adaptive multigrid fully implemented into QUDA.
+
+- Creation of multi-blas and multi-reduction framework, this is
+  essential for high performance for pipelined, block and
+  communication-avoiding solvers that work on "matrices of vectors" as
+  opposed to "scalars of vectors".  The max tile size used by the
+  multi-blas framework is set by QUDA_MAX_MULTI_BLAS_N cmake
+  parameter, which default to 4 for reduced compile time.  For
+  production use of such solvers, this should be increase to 8..16.
+
+- Optimization of multi-shift solver using multi-blas framework to permit
+  kernel fusion of all shift updates.
+
+- Complete rewrite and optimization of clover inversion, HISQ force
+  kernels, HISQ link fattening algorithms using accessors.
+
+- QUDA can now directly load/store from MILC's site structure array.
+  This removes the need to unpack and pack data prior to calling QUDA,
+  and dramatically reduces CPU overhead.
+
+- Removal of legacy data structures and kernels.  In particular
+  original single-GPU only ASQTAD fermion force has been removed.
+
+- Implementation of STOUT fattening kernel.
+
+- Significant improvement to the cmake build system to improve
+  compilation speed and aid productivity.  In particular, QUDA now
+  supports being built as a shared library which greatly reduces link
+  time.
+
+- Autoconf and configure build system is no longer supported.
+
+- Automated unit testing of dslash_test and blas_test are now enabled
+  using ctest.
+
+- Adds support for MPS, enabled through setting the environment
+  variable QUDA_ENABLE_MPS=1.  This allow GPUs to be oversubscribed by
+  multiple processes, which can improve overall job throughput.
+
+- Implemented self-profiler that builds on top of autotuning
+  framework.  Kernel profile is output to profile_n.tsv, where n=0,
+  with n incremented with each call to saveProfile (which dumps the
+  profile to disk).  An equivalent algorithm policy profile is output
+  to profile_async_n.tsv which contains policies such as a complete
+  dslash.  Filename prefix and path can be overridden using
+  QUDA_PROFILE_OUTPUT_BASE environment variable.
+
+- Implemented simple tracing facility that dumps the flow of kernels
+  called through a single execution to trace.tsv.  Enabled with
+  environment variable QUDA_ENABLE_TRACE=1.
+
+- Multiple bug fixes and clean up to the library. Many of these are
+  listed here: https://github.com/lattice/quda/milestone/15?closed=1
+
 Version 0.8.0 - 1st February 2016
 
 - Removed all Tesla-generation GPU support from QUDA (sm_1x).  As a