Accelerating Triangle Counting with Real Processing-in-Memory Systems

This repository contains the source code for PIM-TC, an implementation of the Triangle Counting (TC) algorithm optimized for the UPMEM Processing-in-Memory (PIM) architecture.

Overview

Triangle Counting (TC) is a fundamental graph analysis task used in various domains. However, TC is memory-bound and often struggles to scale on conventional CPU/GPU systems due to high memory bandwidth requirements and low data reuse. Processing-in-Memory (PIM) offers a promising solution by placing computation closer to memory.

This work presents the first TC algorithm designed for and evaluated on the real-world UPMEM PIM system. Our implementation addresses the specific limitations of this architecture:

Expensive Inter-Core Communication: We employ a vertex coloring technique to partition the graph edges among the PIM cores such that triangles can be counted locally without the need for communication between PIM cores.
Limited DPU Memory: We use reservoir sampling within each PIM core's local DRAM bank (MRAM) to handle graphs larger than the available memory, providing an approximate count with statistical adjustments.
Host-PIM Data Transfer Overhead: Uniform sampling at the host level reduces the number of edges transferred to the PIM cores, speeding up data preparation and execution for approximate results.
High-Degree Nodes: We optionally use the Misra-Gries summary algorithm during host processing to identify and re-label high-degree nodes, optimizing the edge-iterator approach on the PIM cores.

Our implementation accepts graphs in the standard Coordinate List (COO) format and demonstrates significant speedups compared to CPU-based methods for dynamic graph workloads where the COO format allows for efficient updates. Refer to the Dynamic Graph TC Branch for an implementation specifically designed to support dynamic graph updates.

Partitioning of the graph’s edges among the PIM cores through the host processor

Repository Structure

.
├── LICENSE             # Project license information
├── Makefile            # Build configuration for the project
├── README.md           # This file
├── common/             # Common headers used by host and DPU
│   └── common.h
├── dpu/                # Source code to be run on UPMEM DPUs
│   ├── dpu_util.c
│   ├── dpu_util.h
│   ├── locate_nodes.c
│   ├── locate_nodes.h
│   ├── quicksort.c
│   ├── quicksort.h
│   ├── task.c
│   ├── triangle_counter.c
│   └── triangle_counter.h
├── host/               # Source code running on the host CPU
│   ├── app.c
│   ├── handle_edges_parallel.c
│   ├── handle_edges_parallel.h
│   ├── host_util.c
│   ├── host_util.h
│   ├── mg_hashtable.c
│   └── mg_hashtable.h
└── img/                # Images used in README
    └── edges_partitioning.png

Prerequisites

UPMEM SDK: The code was tested with UPMEM SDK v2024.1.0. Download from the UPMEM website. Ensure the SDK environment is activated (source <sdk_path>/upmem_env.sh).
Build Tools: A standard C compiler (like gcc) and make.

Installation

Compile the host and DPU code using the provided Makefile:

make

This will create the executable app in the bin/ directory.

Usage

Configuration

Before compiling, you can adjust the following parameters:

NR_TASKLETS (in Makefile): Number of tasklets (software threads) per DPU. Must be a power of 2. The paper uses 16 for best performance.
NR_DPUS (in Makefile): Number of DPUs to use. This is determined by the number of colors (C) chosen for vertex coloring via the -c flag (see below). The Makefile value should match the value derived from C using the formula: NR_DPUS = Binomial(C + 2, 3). Ensure the target system has at least this many DPUs available.
NR_THREADS (in Makefile): Number of CPU threads used by the host for parallel processing (e.g., reading the graph, preparing batches). Set to the number of available CPU cores/threads for best host performance (the paper uses 32).
WRAM_BUFFER_SIZE (in dpu_util.h): Size of the buffer in the DPU's scratchpad memory (WRAM) for holding edges during processing. Must not exceed 2048 bytes.

Running the Application

Navigate to the bin directory after compilation and execute the application:

./app -c <nr_colors> -f <path_to_graph_file> [OPTIONS]

Required Arguments:

-c <nr_colors>: (Required) Number of colors (C) to use for vertex coloring. This determines the number of DPUs allocated (Binomial(C + 2, 3)).
-f <path_to_graph_file>: (Required) Path to the input graph file in Coordinate List (COO) (Matrix Market (.mtx)) format.

Optional Arguments (for approximation and optimization):

-p <keep_percentage>: Probability (0.0 to 1.0) of keeping an edge during uniform sampling at the host level. Default is 1.0 (no sampling). Values less than 1.0 enable approximation for faster processing.
-M <sample_size>: Maximum number of edges to store in each DPU's MRAM bank for reservoir sampling. If the number of edges assigned to a DPU exceeds this, sampling occurs. Defaults to the maximum possible based on available MRAM if not specified. Enables approximation if triggered.
-k <Misra_Gries_dictionary_size>: Size (K) of the dictionary used by the Misra-Gries algorithm on the host (per thread) to find frequent (high-degree) nodes. Enables the high-degree node optimization if set.
-t <nr_most_frequent_nodes_sent>: Number (t) of the most frequent nodes (identified by Misra-Gries) whose edges are remapped before sending to DPUs. Default is 5. Ignored if -k is not set.
-s <seed>: Seed for random number generation used in coloring and sampling. Uses a random seed if not specified.

Example:

# Run exact count using 10 colors (-> 220 DPUs) on graph.mtx
./app -c 10 -f ../data/graph.mtx

# Run approximate count using 8 colors (-> 120 DPUs), keeping 10% of edges (p=0.1)
./app -c 8 -p 0.1 -f ../data/graph.mtx

# Run exact count using 12 colors (-> 364 DPUs) with Misra-Gries (K=1000, t=20)
./app -c 12 -k 1000 -t 20 -f ../data/graph.mtx

Reproducibility

This section provides details on the datasets and procedures used in the experiments described in the accompanying paper to facilitate reproducibility.

Dataset Preprocessing

The input graphs must be preprocessed before being used. The preprocessing involves:

Reading the raw graph file (assuming COO format, e.g., space-separated node ID pairs per line).
Removing self-loops (edges (u, u)).
Removing duplicate edges (treating (u, v) and (v, u) as the same edge).
Shuffling the resulting unique edges randomly.
Prepending a header line to make the format Matrix Market (max_node_id max_node_id num_edges).

A simple Python script can perform steps 1-3. Using an equivalent implementation in C++ might offer better performance for very large graphs.

Python Preprocessing Function:

def preprocess_graph(input_filename, output_filename):
    seen_edges = set()
    max_node_id = -1
    valid_edge_count = 0

    with open(input_filename, 'r') as infile, open(output_filename, 'w') as outfile:
        for line_num, line in enumerate(infile):

            edge_nodes = line.split()
            u, v = map(int, edge_nodes)

            max_node_id = max(max_node_id, u, v)

            # Remove self-loops
            if u == v:
                continue

            # Store edges uniquely
            edge = tuple(sorted((u, v)))

            # Add to set and write if it's a new edge
            if edge not in seen_edges:
                seen_edges.add(edge)
                outfile.write(f"{u} {v}\n") # Write original u v as in COO
                valid_edge_count += 1

    print(f"Finished processing.")
    print(f"Max Node ID found: {max_node_id}")
    print(f"Number of unique, non-self-loop edges: {valid_edge_count}")
    return max_node_id, valid_edge_count

Post-processing Steps:

Shuffle the cleaned graph:

shuf cleaned_graph.txt > shuffled_graph.txt

Prepend header for Matrix Market format: Use max_node_id and valid_edge_count from the preprocessing step and create the final input file (final_graph.mtx). The header line must be max_node_id max_node_id valid_edge_count.

Datasets Used in Evaluation

The following table details the graphs used, including their sources and properties after preprocessing.

Graph Name	Source	Valid Edges	Unique Nodes	Max Node ID	Max Degree	Average Degree	Triangles	Connected Triples	Global Clustering Coefficient
Kronecker 23	Graph500 Generator (`N=23, -e 16`)	129,335,985	4,609,311	8,388,607	257,484	56.12	4,675,811,428	672,277,292,688	0.0209
Kronecker 24	Graph500 Generator (`N=24, -e 16`)	260,383,358	8,870,393	16,777,214	407,017	58.71	10,285,674,980	1,783,231,368,677	0.0173
V1r	SuiteSparse (GenBank)	232,705,452	214,005,017	214,005,018	8	2.17	49	307,294,717	4.784e-7
LiveJournal	SNAP	42,851,237	4,847,571	4,847,571	20,333	17.68	285,730,264	7,269,503,753	0.1179
Orkut	SNAP	117,185,083	3,072,441	3,072,627	33,313	76.28	627,584,181	45,625,466,571	0.0413
Human-Jung	Network Repository	267,844,669	784,262	1,822,418	21,743	683.05	41,727,013,307	425,248,603,910	0.2944
WikipediaEdit	Konect	255,688,945	42,541,517	42,640,546	3,026,864	12.02	881,439,081	33,785,432,110,811	7.827e-5

Graph500 Kronecker Generator:

Repository: https://github.com/RapidsAtHKUST/Graph500KroneckerGraphGenerator
Specification: https://graph500.org/?page_id=12
Generation Command: ./generator_omp N -e 16 -o output.txt
- N: Scale factor.
- -e 16: Edge factor.
- The output file requires preprocessing as described above.

Running Experiments

General Methodology: For each configuration tested in the paper, 2 warmup runs were performed, followed by 5 measurement runs. The average time of the 5 measurement runs is reported in the paper. Ensure the PIM application is compiled and the datasets are preprocessed. Run PIM commands from the bin/ directory.

1. PIM DPU Scaling:

Modify NR_DPUS in the Makefile to match the number of DPUs corresponding to the chosen color count C (using NR_DPUS = Binomial(C + 2, 3)). Recompile (make clean && make).
Run for various C values tested in the paper on all graphs.

# Example for C=20 (1540 DPUs - adjust Makefile first)
./app -c 20 -f ../data/final_graph.mtx

2. PIM Misra-Gries Summary:

Use a fixed number of colors C=23 (corresponding to 2300 DPUs - set NR_DPUS=2300 in Makefile and compile).
Iterate through K in {0, 2500, 5000, 10000} and T in {0, 10, 50, 100, 250}. Note that K=0 disables Misra-Gries.

# Example run
./app -c 23 -k 5000 -t 50 -f ../data/final_graph.mtx

3. PIM Uniform Sampling:

Use C=23 (NR_DPUS=2300).
Use the best performing K and T parameters found from the Misra-Gries tests for each graph:

Graph	Best K	Best T
Kronecker 23	10000	250
Kronecker 24	5000	250
V1r	0	0
LiveJournal	0	0
Orkut	0	0
Human-Jung	0	0
WikipediaEdit	2500	250

Iterate through p in {0.5, 0.25, 0.1, 0.01}.

# Example run for Kronecker 23 with p=0.1
./app -c 23 -p 0.1 -k 10000 -t 250 -f ../data/kronecker_23.mtx

# Example run for LiveJournal with p=0.5 (no Misra-Gries needed)
./app -c 23 -p 0.5 -k 0 -t 0 -f ../data/liveJournal.mtx

4. PIM Reservoir Sampling:

Use C=23 (NR_DPUS=2300).
Use the best performing K and T parameters (same as above).
Calculate the maximum expected number of edges assigned to each DPU. Using the estimate for the most loaded DPUs (6N): max_sample_size_approx = (6.0 / (C**2)) * total_edges.
Iterate through p in {0.5, 0.25, 0.1, 0.01}.
Set SAMPLE_SIZE = round(p * max_sample_size_approx). Use -M SAMPLE_SIZE option with the final_graph.mtx file.

# Example calculation for Kronecker 23 (129,335,985 edges)
# C = 23
# total_edges = 129335985
# max_sample_size_approx = (6.0 / (23*23)) * 129335985 ≈ 1466694
# For p = 0.1, SAMPLE_SIZE = round(0.1 * 1466694) = 146669

# Example run for Kronecker 23 with p=0.1 for reservoir size
./app -c 23 -M 146669 -k 10000 -t 250 -f ../data/kronecker_23.mtx

5. CPU/GPU Comparisons (static graphs):

CPU Baseline (Bader-Research):
- Clone the repository: git clone https://github.com/Bader-Research/triangle-counting.git
- Modify src/main.c: Comment out all benchmarkTC_P or benchmarkTC lines except the one for tc_MapJIK_P.
- Compile the code according to the repository's instructions (e.g., using make). Let the compiled executable be tc_cpu.
- Run the command:
```
./tc_cpu -P -f ../data/graph.mtx -x -q
```
- The reported time for tc_MapJIK_P is the triangle counting time used for comparison. Note that the internal COO to CSR conversion time performed by this code is excluded from the comparison results.

GPU Baseline (cuGraph):

Use the preprocessed graph file without the header.
Ensure cugraph and cudf are installed in your Python environment.

Use the following Python function:

import cugraph
import cudf
import time

def run_cugraph_tc(file_path):
    start_load = time.time()
    # Read graph from COO (space separated, no header)
    gdf = cudf.read_csv(filepath_or_buffer=file_path,
                        delimiter=' ',
                        dtype=['int32', 'int32'],
                        header=None,
                        names=['src', 'dst'])

    G = cugraph.Graph(directed=False)
    G.from_cudf_edgelist(gdf, source='src', destination='dst')

    load_graph_time = time.time() - start_load
    print(f"Graph Loading Time: {load_graph_time:.4f} s")

    start_count = time.time()

    count = cugraph.triangle_count(G)

    count_time = time.time() - start_count
    print(f"Triangle Count Time: {count_time:.4f} s")

    return count / 3

6. CPU/GPU Comparisons (dynamic graphs):

Refer to the Dynamic Graph TC Branch.

Citation

Please use the following citations to cite this work, if you find this repository useful:

Lorenzo Asquini, Manos Frouzakis, Juan Gómez-Luna, Mohammad Sadrosadati, Onur Mutlu, Francesco Silvestri, "Accelerating Triangle Counting with Real Processing-in-Memory Systems", arXiv:2505.04269 [cs.AR], 2025.

Bibtex entry for citation:

@misc{asquini2025accelerating,
    title={Accelerating Triangle Counting with Real Processing-in-Memory Systems},
    author={Lorenzo Asquini and Manos Frouzakis and Juan Gómez-Luna and Mohammad Sadrosadati and Onur Mutlu and Francesco Silvestri},
    year={2025},
    eprint={2505.04269},
    archivePrefix={arXiv},
    primaryClass={cs.AR}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This work was supported in part by the Big-Mobility project (Uni-Impresa call, University of Padova), MUR PRIN 20174LF3T8 AHEAD, MUR PNRR CN00000013 (HPC, Big Data, Quantum Computing), ETH Future Computing Laboratory (EFCL), Huawei ZRC Storage Team, Semiconductor Research Corporation, AI Chip Center for Emerging Smart Systems (ACCESS), InnoHK funding, Hong Kong SAR, and European Union's Horizon programme [101047160 - BioPIM]. We acknowledge generous gifts from Google, Huawei, Intel, and Microsoft.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Accelerating Triangle Counting with Real Processing-in-Memory Systems

Overview

Repository Structure

Prerequisites

Installation

Usage

Configuration

Running the Application

Reproducibility

Dataset Preprocessing

Datasets Used in Evaluation

Running Experiments

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
common		common
dpu		dpu
host		host
img		img
.clang-format		.clang-format
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

License

CMU-SAFARI/PIM-TC

Folders and files

Latest commit

History

Repository files navigation

Accelerating Triangle Counting with Real Processing-in-Memory Systems

Overview

Repository Structure

Prerequisites

Installation

Usage

Configuration

Running the Application

Reproducibility

Dataset Preprocessing

Datasets Used in Evaluation

Running Experiments

Citation

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages