High-performance matrix fields on arbitrary D-dimensional lattices in Julia.
- Per-site matrices (size
NC1×NC2) stored in column-major layout:
(NC1, NC2, X, Y, Z, …) - MPI domain decomposition via a Cartesian communicator (halo width
nw, periodic BCs). - GPU-ready through JACC.jl (portable CPU/GPU kernels; CUDA/ROCm/Threads).
- Fast, allocation-free indexing helpers for kernels:
DIndexer,linearize,delinearize,shiftindices.
This package focuses on scalable, halo-exchange–based lattice algorithms with minimal allocations and clean multi-backend execution.
Applications: This package is designed to support large-scale simulations on structured lattices. A key application area is lattice QCD, where gauge fields and fermion fields are represented as matrix-valued objects on a multi-dimensional lattice. In future developments, LatticeMatrices.jl is planned to be integrated into Gaugefields.jl and LatticeDiracOperators.jl, providing the underlying data structures and linear algebra kernels for gauge and fermion dynamics.
Current limitation. Multi‑GPU execution and hybrid MPI+threads parallelism are experimental and not yet thoroughly tested; treat them as provisional.
pkg> add LatticeMatricesRequirements:
- Julia ≥ 1.11
using LatticeMatrices
# Build an indexer for a D-dimensional lattice (1-based indices)
gsize = (16, 16, 16, 16) # global lattice size
d = DIndexer(gsize) # computes row-major "strides" internally
# Convert between linear and multi-index (1-based)
L = linearize(d, (1, 1, 1, 1)) # -> 1
ix = delinearize(d, 4) # -> (4, 1, 1, 1) on this shape
# Apply shifts componentwise
p = shiftindices((4, 1, 1, 1), (1, 0, 0, 0)) # -> (5, 1, 1, 1)Signatures
struct DIndexer{D,dims,strides} end
DIndexer(dims_in::NTuple{D,<:Integer}) where {D}
DIndexer(dims_in::AbstractVector{<:Integer})
# 1-based linearization/delinearization (no heap allocs; GPU-friendly)
linearize(::DIndexer{D,dims,strides}, idx::NTuple{D,Int32})::Int32
delinearize(::DIndexer{D,dims,strides}, L::Integer, offset::Int32=0)::NTuple{D,Int32}
# elementwise shifting for index tuples
shiftindices(indices, shift)delinearize(...; offset)is handy to map into halo regions, e.g. passoffset = nw.
The core container stores a halo-padded array on each rank and manages halo exchange without MPI derived datatypes (faces are packed into contiguous buffers).
using LatticeMatrices, MPI, JACC, LinearAlgebra
JACC.@init_backend
MPI.Init()
dim = 4
gsize = ntuple(_ -> 16, dim) # global spatial size per dimension
nw = 1 # ghost width
NC = 3 # per-site matrix size (NC×NC)
# Choose a Cartesian process grid (PEs) of length `dim`
nprocs = MPI.Comm_size(MPI.COMM_WORLD)
n1 = max(nprocs ÷ 2, 1)
PEs = ntuple(i -> i == 1 ? n1 : (i == 2 ? nprocs ÷ n1 : 1), dim)
# Construct an empty lattice matrix (device array via JACC.zeros)
M = LatticeMatrix(NC, NC, dim, gsize, PEs; nw, elementtype=ComplexF64)
# Or initialize from an existing array (broadcast to ranks)
A = rand(ComplexF64, NC, NC, gsize...)
M2 = LatticeMatrix(A, dim, PEs; nw)
# Halo exchange across all spatial dimensions
set_halo!(M)
# Global gather helpers (host reconstruction on rank 0)
G = gather_matrix(M; root=0) # rank 0: Array(NC, NC, gsize...)
Gall = gather_and_bcast_matrix(M; root=0) # all ranks receive the same ArrayKey type
struct LatticeMatrix{D,T,AT,NC1,NC2,nw,DI} <: Lattice{D,T,AT}
nw::Int
phases::SVector{D,T} # per-direction phase (applied at wrap boundaries)
NC1::Int
NC2::Int
gsize::NTuple{D,Int}
cart::MPI.Comm # Cartesian communicator
coords::NTuple{D,Int} # 0-based Cartesian coords
dims::NTuple{D,Int} # process grid (PEs)
nbr::NTuple{D,NTuple{2,Int}} # neighbors (minus, plus)
A::AT # local array (NC1, NC2, X, Y, Z, …) with halos
buf::Vector{AT} # four face buffers per spatial dim
myrank::Int
PN::NTuple{D,Int} # local interior size per dim (no halos)
comm::MPI.Comm # original communicator
indexer::DI # DIndexer for global sizes
endConstructors
LatticeMatrix(NC1, NC2, dim, gsize, PEs;
nw=1, elementtype=ComplexF64, phases=ones(dim), comm0=MPI.COMM_WORLD)
LatticeMatrix(A, dim, PEs; nw=1, phases=ones(dim), comm0=MPI.COMM_WORLD)- Layout:
(NC1, NC2, X, Y, Z, …); halos are the outernwcells on each spatial dim. - Phases: wrap-around phases per dimension (applied on the boundary faces during exchange).
- Exchange:
set_halo!(ls)callsexchange_dim!(ls, d)for each spatial dimensiond.
Per-site matrix operations follow BLAS-like semantics. The test suite shows full coverage (plain/adjoint inputs, shifted views):
# Random per-site matrices
A1 = rand(ComplexF64, NC, NC, gsize...)
A2 = rand(ComplexF64, NC, NC, gsize...)
A3 = rand(ComplexF64, NC, NC, gsize...)
M1 = LatticeMatrix(NC, NC, dim, gsize, PEs; nw)
M2 = LatticeMatrix(A2, dim, PEs; nw)
M3 = LatticeMatrix(A3, dim, PEs; nw)
# Choose a site (using DIndexer + halos)
indexer = DIndexer(gsize)
L = 4
idx_halo = Tuple(delinearize(indexer, L, Int32(nw))) # with halo offset
idx_core = Tuple(delinearize(indexer, L, Int32(0))) # core (no halo)
# Reference (host) product at a single site:
a1 = A1[:, :, idx_core...]
a2 = A2[:, :, idx_core...]
a3 = A3[:, :, idx_core...]
mul!(a1, a2, a3)
# Lattice product (device-backed); updates M1.A at that site:
mul!(M1, M2, M3)
m1 = M1.A[:, :, idx_halo...]
@assert a1 ≈ m1
# Matrix exponential at each site (in-place):
expt!(M1, M2, 1)
m1 = M1.A[:, :, idx_halo...]
a1 = exp(a2)
@assert a1 ≈ m1
# Trace and sum over all sites (returns a scalar)
println(tr(M1))
Adjoints and shifted operands are supported via lightweight wrappers:
M2p = Shifted_Lattice(M2, (1, 0, 0, 0)) # shift by +1 along X (periodic)
mul!(M1, M2', M3p) # all combinations in tests:
# (A, B, C), (A, B', C), (A, B, C'), etc.Convenience
# Reduced sums (interior region only)
s = allsum(M) # MPI.Reduce to root (returns the global sum on rank 0)using LatticeMatrices, MPI, JACC, LinearAlgebra
JACC.@init_backend
MPI.Init()
dim = 2
gsize = (8, 8)
NC = 3
PEs = (2, 2) # process grid (2×2)
M1 = LatticeMatrix(NC, NC, dim, gsize, PEs)
M2 = LatticeMatrix(rand(ComplexF64, NC, NC, gsize...), dim, PEs)
M3 = LatticeMatrix(rand(ComplexF64, NC, NC, gsize...), dim, PEs)
mul!(M1, M2, M3) # per-site product: M1 = M2 * M3shift = (1, 0) # shift by +1 along X
M2s = Shifted_Lattice(M2, shift)
mul!(M1, M2s, M3) # M1 = (M2 shifted) * M3The shift is applied with periodic wrapping across the global lattice size.
mul!(M1, M2', M3) # M1 = adjoint(M2) * M3
mul!(M1, M2, M3') # M1 = M2 * adjoint(M3)
mul!(M1, M2', M3') # M1 = adjoint(M2) * adjoint(M3)All combinations of shifted and adjoint operands are supported and tested in test/runtests.jl.
(above v0.3: experimental) We provide Enzyme-based AD extensions and test cases. See test/adtest/ad.jl for a concrete comparison between
automatic differentiation and numerical differentiation using calc_action_loopfn. The loop body is factored
into a small helper function (_calc_action_step!), which makes Enzyme AD more reliable for loop-heavy code.
Example (runs the AD vs numerical comparison with calc_action_loopfn):
using Enzyme
using LatticeMatrices, MPI, JACC
JACC.@init_backend
MPI.Init()
include("test/adtest/ad.jl") # runs main() in the scriptNote: the AD result here follows Enzyme's complex differentiation convention. For a complex variable
U = X + iY, the gradient reported by Enzyme is
dS/dUij = dS/dXij + i dS/dYij.
Exactly what test/runtests.jl does:
# CPU single process
julia --project -e 'using Pkg; Pkg.test("LatticeMatrices")'
# MPI (choose ranks and an MPI launcher)
mpiexec -n 4 julia --project test/runtests.jl
# With GPUs (example; make sure CUDA/ROCm works and select a JACC backend)
julia --project -e 'using JACC; JACC.@init_backend; using Pkg; Pkg.test()'Internally, the tests:
- sweep
dim = 1:4andNC = 2:4, - construct
LatticeMatrixobjects on a Cartesian gridPEs, - verify
mul!for all nine combinations with/without adjoint and with/without shifts, - use
DIndexerto map between linear and multi-indices, including halo offsets.
# Indexing
DIndexer(::NTuple{D,<:Integer})
DIndexer(::AbstractVector{<:Integer})
linearize(::DIndexer{D,dims,strides}, ::NTuple{D,Int32})::Int32
delinearize(::DIndexer{D,dims,strides}, ::Integer, ::Int32=0)::NTuple{D,Int32}
shiftindices(indices, shift)
# Lattice
LatticeMatrix(NC1, NC2, dim, gsize, PEs; nw=1, elementtype=ComplexF64,
phases=ones(dim), comm0=MPI.COMM_WORLD)
LatticeMatrix(A, dim, PEs; nw=1, phases=ones(dim), comm0=MPI.COMM_WORLD)
set_halo!(ls)
exchange_dim!(ls, d::Int)
gather_matrix(ls; root=0)::Union{Array{T},Nothing}
gather_and_bcast_matrix(ls; root=0)::Array{T}
allsum(ls) # Reduce(SUM) to root over interior
# Lightweight wrappers
struct Shifted_Lattice{D,shift}; data::D; end
struct Adjoint_Lattice{D}; data::D; end
# Base.adjoint(::Lattice) and Base.adjoint(::Shifted_Lattice) return Adjoint_LatticeMIT (see LICENSE).
Built on the excellent Julia HPC stack: MPI.jl, JACC.jl, and the Julia standard libraries.
LatticeMatrices.jl uses [JACC.jl] for performance‑portable execution. Follow JACC’s recommended flow to select one backend per project/session:
- Set a backend (writes/updates
LocalPreferences.tomland adds the backend package):
julia> import JACC
julia> JACC.set_backend("cuda") # or "amdgpu" or "threads" (default)- Initialize at top level so your code doesn’t need backend‑specific imports:
import JACC
JACC.@init_backend # must be at top-level scope- Switching backends. Re-run
JACC.set_backend("amdgpu")(or"threads","cuda") in the same project to switch; this updatesLocalPreferences.toml. Restart your Julia session so extensions load for the new backend, then callJACC.@init_backendagain.
Notes:
- Without calling
@init_backend, using a non-"threads"backend will raise errors likeget_backend(::Val(:cuda))when invoking JACC functions.JACC.array/JACC.array_type()help you stay backend‑agnostic in your APIs.
References: JACC quick start and usage in the upstream README.