Skip to content

Parallel assembly in JuliaFEM #219

@KristofferC

Description

@KristofferC

The following is a roadmap/outline for how threaded assembly could start to be implemented in JuliaFEM. The scope is limited to shared memory parallel assembly. It will be continuously updated.

Sample assembly code

In order to make it easier to examplify things, here is some smaple code for an assembly routine taken from JuliaFEM:

function assemble_elements!{B}(problem::Problem{Heat}, assembly::Assembly,
                               elements::Vector{Element{B}}, time::Float64)

    bi = BasisInfo(B)
    ndofs = length(B)
    Ke = zeros(ndofs, ndofs)
    fe = zeros(ndofs)

    for element in elements
        fill!(Ke, 0.0)
        fill!(fe, 0.0)
        for ip in get_integration_points(element)
            J, detJ, N, dN = element_info!(bi, element, ip, time)
            k = element("thermal conductivity", ip, time)
            Ke += ip.weight * k*dN'*dN * detJ
            if haskey(element, "heat source")
                f = element("heat source", ip, time)
                fe += ip.weight * N'*f * detJ
            end
        end

        gdofs = get_gdofs(problem, element)
        add!(assembly.K, gdofs, gdofs, Ke)
        add!(assembly.f, gdofs, fe)
    end
end

Goal

The goal of parallel assembly is, of course, to be able to exploit all threads on the computer to speed up the assembly. Linear solvers are in practice always able to use multiple threads so if the assembly process is serial, that might be a bottleneck in the FEM simulation. There are two main difficulties with parallel assembly, getting it correct and getting it fast.

Correctness

As always with parallel code, one has to be careful not to introduce data races, which means that two threads might write to the same location at the same time. This means that every write operation must be audited to see if there is potential for another thread to write to that location in the same threaded loop. In the assembly rouine above we have writes to two types of data structures which I call "local buffers" and "global buffers"

Local buffers

Local buffers are small enough in size such that allocating a copy of them on each thread is not
prohibitivly expensive. In the code above these would be bi, Ke and fe.
For a parallell assembly this would be allocated

To simplify things, I would suggest that each problem defines a struct

struct HeatAssemblyBuffers{B}
   bi::BasisInfo{B}
   Ke::Matrix{Float64}
   fe::Vector{Float64}
end

HeatAssembly(B, ndofs::Int) = HeatAssembly(BasisInfo(B), zeros(ndofs, ndofs), zeros(ndofs))

We could then generate buffers for all threads as

[HeatAssemblyBuffers(B, ndofs) for i in 1:Threads.nthreads()]

If each thread uses its own HeatAssemblyBuffer then there should be no data race when writing to local buffers.

Global buffers

Global buffers are data structure that are large enough that we don't want to allocate them in every thread.
This include the global stiffness matrix and the global "force" vector. If we do a parallel assembly naively by just threading the loop over the elements the data race is obvious. Two threads working on elements that share degrees of freedom will try to e.g. assemble into the global force vector at the same location at the same time. The solution to this is to make sure that this never happens.

One way of solving the above problem is "mesh coloring". By assigning a "color" to each element such that no element with the same color share degrees of freedom, we can assemble the elements of that color in parallel.
We then wait for all the threads to be done before moving on to the next color. There are many algorithms for coloring meshes but to start with, a greedy one that just goes element by element and take the first color that is allowed should be good enough. An implementation of this can be found at https://github.com/KristofferC/JuAFEM.jl/blob/master/src/Grid/coloring.jl.

Another way to solve the problem above is to not assemble into a global structure but have e.g. a COO structure for the global stifness that each thread push into, and then in the end, sum all those together.
This is, however, quite likely bad for performance.

Performance

Allocations local buffers

Julia contains a "stop the world" GC which means that every time the GC runs, all threads are stopped and wait for it to finish. If the assembly loop allocates a lot, and we run multiple threads, the GC will have to run quite often, likely removing much of the speedup in using multiple threads. It is therefore very important that the loop over the element doesn't allocate. I would argue, it should allocate 0.

As an exmaple, even if we rewrite the line

Ke += ip.weight * k*dN'*dN * detJ

to the more efficient

Ke .+= (ip.weight * k * detJ) .* (dN' * dN)

this still need to allocate a temporary for (dN' * dN).
There are some solutions to this:

  1. Use something like StaticArrays for the matrices which do not allocate.
  2. Use something like Tensors.jl which do not allocate
  3. Preallocate a buffer and use in-place matrix multiplication:
    mul!(dN_dN, dN', dN)
    Ke .+= (ip.weight * k * detJ) .* dN_dN

I would suggest trying to do point 1 or 2 because having to allocate so many in-place buffers can make the code hard to read and static data structures fits well for this problem.

Allocations global buffers

Using the coloring technique to assemble directly into a CSC sparse matrix means that there is zero allocation from this assembly step. The same matrix can be used for all time step. Using the method
where each thread has its own COO matrix to assemble into, we need to convert this to CSC and add all the contributions together for every assembly step which will likely be expensive and require a lot of memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions