This repository explores parallel computing paradigms through practical implementations completed as part of the Parallel Systems curriculum at the National Technical University of Athens. The work spans fundamental parallel/distributed computing architectures: shared memory systems, GPU accelerators, and distributed computing environments.
The first module is the classic Game of Life simulation with OpenMP parallelization. The main aspects of the project include:
- Multi-threaded implementation with dynamic thread allocation
- Systematic performance benchmarking
- Visual representation of the result using charts.
- Execution time analysis
Explored two distinct parallelization strategies for the clustering algorithm:
- Strategy 1: Synchronized shared cluster approach leveraging OpenMP directives like
#pragma omp parallel forfor parallel calculation ofNewClusterandNewClusterSize - Strategy 2: Replication-based design with reduction and final calculation assigned to thread 0 (main thread).
Also explored thread affinity configurations GOMP_CPU_AFFINITY that binds threads to specific cores, cache coherence challenges, and memory locality optimizations through NUMA-aware strategies.
Conducted an empirical study of locking mechanisms:
- Typical implementations: pthread mutexes
pthread_mutex_lockand spinlockspthread_spin_lock - Custom solutions: test-and-set (
tas_lock), test-and-test-and-set (ttas_lock), array-based locks (array_lock), CLH queue locks (clh_lock) - Comparative analysis with OpenMP's directives
#pragma omp criticaland#pragma omp atomic
- Recursive Implementation Parallelization with OpenMP tasks.
- Performance profiling across various matrixdimensions
- Comparison of tiled and recursive approaches (additional work)
Investigated concurrent linked list implementations through multiple synchronization paradigms:
- Coarse-grain locking
- Fine-grain locking
- Optimistic synchronization
- Lazy synchronization
- Non-blocking synchronization
Progressively refined GPU implementations demonstrating optimization techniques:
- Naive version: Direct kernel implementation for cluster assignment
- Transpose version: Data transposition for coalesced access patterns
- Shared-memory version: Shared memory utilization for bandwidth reduction
- Full-offload version: The implementation is entirely executed in GPU thus eliminating CPU-GPU transfer overhead
Developed a message-passing variant using MPI infrastructure:
- Distributed nodes implementation
- Scalability comparison with shared-memory OpenMP implementation
Distributed implementation of iterative solvers for 2D thermal diffusion:
- Method 1: Standard Jacobi iteration
- Method 2: Successive over-relaxation (Gauss-Seidel variant)
- Method 3: Red-Black ordering SOR
Visual representation encompassed both fixed-iteration and convergence-based termination criteria, with scalability analysis across process counts.
This hands-on experience bridged theoretical concepts with real-world parallel computing challenges, establishing a solid foundation in modern high-performance computing techniques. Utilizing all current frameworks and libraries like CUDA, OpenMP, MPI.