Skip to content

Conversation

@jeremylt
Copy link
Member

fixes #1865

@jeremylt
Copy link
Member Author

Ok, need to do testing, but this should do BP 5/6 type problems with the nodes collocated with the qpts much faster

@jeremylt jeremylt force-pushed the jeremy/shared-collocated branch from 38020cd to 84326e1 Compare July 28, 2025 19:56
@jeremylt jeremylt force-pushed the jeremy/shared-collocated branch from 84326e1 to 0ccda8e Compare July 28, 2025 20:08
@jeremylt
Copy link
Member Author

jeremylt commented Jul 28, 2025

A quick check on my laptop shows the expected outcome - doing half as many tensor contractions is faster

After:

$ ./build/petsc-bps -ceed /gpu/cuda/gen -problem bp6

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 2
    Quadrature  Order (Q)                   : 2
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 8
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 8
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : CONVERGED_RTOL
    Total KSP Iterations                    : 2
    Final rnorm                             : 1.227144e-30
  Performance:
    L2 Error                                : 8.049086e-16
    CG Solve Time                           : 0.000509553 (0.000509553) sec
    DoFs/Sec in CG                          : 0.0942002 (0.0942002) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 3
    Quadrature  Order (Q)                   : 3
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 125
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 125
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : CONVERGED_RTOL
    Total KSP Iterations                    : 4
    Final rnorm                             : 5.373627e-15
  Performance:
    L2 Error                                : 2.222572e-01
    CG Solve Time                           : 0.000622976 (0.000622976) sec
    DoFs/Sec in CG                          : 2.4078 (2.4078) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 4
    Quadrature  Order (Q)                   : 4
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 512
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 512
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 3.090721e-01
  Performance:
    L2 Error                                : 2.069256e-02
    CG Solve Time                           : 0.000727691 (0.000727691) sec
    DoFs/Sec in CG                          : 10.5539 (10.5539) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 5
    Quadrature  Order (Q)                   : 5
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 1331
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 1331
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 1.302704e+00
  Performance:
    L2 Error                                : 8.785086e-02
    CG Solve Time                           : 0.000770822 (0.000770822) sec
    DoFs/Sec in CG                          : 25.9009 (25.9009) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 6
    Quadrature  Order (Q)                   : 6
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 2744
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 2744
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 1.599026e+00
  Performance:
    L2 Error                                : 1.322905e-01
    CG Solve Time                           : 0.00101692 (0.00101692) sec
    DoFs/Sec in CG                          : 40.4751 (40.4751) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 7
    Quadrature  Order (Q)                   : 7
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 4913
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 4913
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 1.935311e+00
  Performance:
    L2 Error                                : 2.045741e-01
    CG Solve Time                           : 0.00109257 (0.00109257) sec
    DoFs/Sec in CG                          : 67.4509 (67.4509) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 8
    Quadrature  Order (Q)                   : 8
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 8000
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 8000
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 1.877897e+00
  Performance:
    L2 Error                                : 2.588382e-01
    CG Solve Time                           : 0.00095697 (0.00095697) sec
    DoFs/Sec in CG                          : 125.396 (125.396) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 9
    Quadrature  Order (Q)                   : 9
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 12167
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 12167
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 1.981151e+00
  Performance:
    L2 Error                                : 3.197081e-01
    CG Solve Time                           : 0.00143622 (0.00143622) sec
    DoFs/Sec in CG                          : 127.074 (127.074) million

Before:

$ ./build/petsc-bps -ceed /gpu/cuda/gen -problem bp6

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 2
    Quadrature  Order (Q)                   : 2
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 8
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 8
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : CONVERGED_RTOL
    Total KSP Iterations                    : 2
    Final rnorm                             : 1.227144e-30
  Performance:
    L2 Error                                : 8.049086e-16
    CG Solve Time                           : 0.00057244 (0.00057244) sec
    DoFs/Sec in CG                          : 0.0838516 (0.0838516) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 3
    Quadrature  Order (Q)                   : 3
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 125
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 125
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : CONVERGED_RTOL
    Total KSP Iterations                    : 4
    Final rnorm                             : 5.360939e-15
  Performance:
    L2 Error                                : 2.222572e-01
    CG Solve Time                           : 0.000553244 (0.000553244) sec
    DoFs/Sec in CG                          : 2.71128 (2.71128) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 4
    Quadrature  Order (Q)                   : 4
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 512
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 512
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 3.090721e-01
  Performance:
    L2 Error                                : 2.069256e-02
    CG Solve Time                           : 0.000725386 (0.000725386) sec
    DoFs/Sec in CG                          : 10.5875 (10.5875) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 5
    Quadrature  Order (Q)                   : 5
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 1331
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 1331
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 1.302704e+00
  Performance:
    L2 Error                                : 8.785086e-02
    CG Solve Time                           : 0.000847515 (0.000847515) sec
    DoFs/Sec in CG                          : 23.5571 (23.5571) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 6
    Quadrature  Order (Q)                   : 6
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 2744
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 2744
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 1.599026e+00
  Performance:
    L2 Error                                : 1.322905e-01
    CG Solve Time                           : 0.00113271 (0.00113271) sec
    DoFs/Sec in CG                          : 36.3377 (36.3377) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 7
    Quadrature  Order (Q)                   : 7
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 4913
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 4913
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 1.935311e+00
  Performance:
    L2 Error                                : 2.045741e-01
    CG Solve Time                           : 0.00192776 (0.00192776) sec
    DoFs/Sec in CG                          : 38.2282 (38.2282) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 8
    Quadrature  Order (Q)                   : 8
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 8000
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 8000
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 1.877897e+00
  Performance:
    L2 Error                                : 2.588382e-01
    CG Solve Time                           : 0.00125059 (0.00125059) sec
    DoFs/Sec in CG                          : 95.9549 (95.9549) million

-- CEED Benchmark Problem 6 -- libCEED + PETSc --
  MPI:
    Hostname                                : taliensis
    Total ranks                             : 1
    Ranks per compute node                  : 1
  PETSc:
    PETSc Vec Type                          : seqcuda
  libCEED:
    libCEED Backend                         : /gpu/cuda/gen
    libCEED Backend MemType                 : device
  Mesh:
    Solution Order (P)                      : 9
    Quadrature  Order (Q)                   : 9
    Additional quadrature points (q_extra)  : 0
    Global nodes                            : 12167
    Local Elements                          : 27
    Element topology                        : hexahedron
    Owned nodes                             : 12167
    DoF per node                            : 3
  KSP:
    KSP Type                                : cg
    KSP Convergence                         : DIVERGED_ITS
    Total KSP Iterations                    : 5
    Final rnorm                             : 1.981151e+00
  Performance:
    L2 Error                                : 3.197081e-01
    CG Solve Time                           : 0.00178699 (0.00178699) sec
    DoFs/Sec in CG                          : 102.13 (102.13) million

@jeremylt
Copy link
Member Author

Ok, another todo for a future PR is to reuse the Interp when doing both Interp + Grad as inputs or outputs for a QFunction. This PR is big, so I don't want to add it here.

@jeremylt jeremylt merged commit 41ece66 into main Jul 29, 2025
29 checks passed
@jeremylt jeremylt deleted the jeremy/shared-collocated branch July 29, 2025 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU Shared/Gen Collocated Quadrature

2 participants