You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
in a grid, cuda follows SIMD structure ie., single instruction multiple threads. the indices of threads define which portion of the data they work on.
the parameters <<<blocksPerGrid, dimensionsOfBlock>>> are of the type dim3, a vector that can take 3 integer values. we can use fewer than 3 dimensions by setting unused dimension values to 1. example: dim3 dimGrid(32, 1, 1).
these can have any name as long as type is dim3.
if it's just one dimension, you can just use the value. no need for dim3. the remaining two dimensions automatically take the value of 1.
gridDim.x can take values bw 1 to 2^31 - 1, gridDim.y and gridDim.z can range from 1 to 2^16 - 1 (65535).
blockDim.x can range from 0 to gridDim.x - 1 and so one.
max possible total size of block is 1024 threads. they can be distributed in any way within the block (256, 2, 2), (2, 2, 256) etc.,
a grid dimension can be smaller than a block dimension and vice versa.
each threadIdx also has three coordinates.
the labels of blocks are reversed in dimension as in block(1, 0) means block that has blockIdx.y = 1 and blockIdx.x = 0.
thread(1, 0, 2) has threadIdx.z = 1, threadIdx.y = 0 and threadIdx.x = 2. this is done to help us with mapping thread coordinates for multidimensional data.
this reversing of dimensions is because, cuda follows row major layout. so since columns are the fastest varying dimension, threadIdx.x will access these values faster if they are in consecutive memory locations.
c requires the number of columns to be known during compile time, but the whole point of using dynamically allocated arrays is that we can use varying size data. hence we usually flatten a dynamically allocated 2d array into an equivalent 1d array.
because memory is "flat" in modern computers, all multi dimensional data is flattened.
if the data dimension is static, cuda allows us to use higher dimensional indexing. but under the hood, it is still linearized.
memory space in a computer is the program's private workspace in computer's memory. its where data and instructions are kept. so when a program needs some data, it takes the starting address and how many bytes are needed to access this data.
floats need 4 bytes and doubles need 8 bytes, these multibyte requiring varibles are stored consecutively in memory.
row major layout is where all elements of row are consecutively stored in memory.
accessing value at jth row and ith column of M, ie., M[j][i], assming there are 4 values in each row is done by j * 4 + i
column major layour is the transposed form of row major layout.
blurring is usually done as a weighted sum of a neighbourhood of the image. it belongs to the convolution pattern.
usually weights are given to how far away a pixel is from the current position, this is called gaussian blur.
cuda kernel launches are asynchronous. cudaDeviceSynchrnonize() forces the host to wait till gpu is finished executing all preceding cuda calls. this will ensure kernel is completed before any copying is done and catch any errors that might occur during kernel execution.
gpu's dram is relatively slower compard to the cuda cores. so everytime a thread needs to fetch memory from there, latency is introduced.
shared memory is fast on-chip cache that is shared by all threads within a single block. it's closer to the cuda cores and has significantly low latency and higher bandwidth than global memory.
once data has been loaded into the shared memory by a block, all threads in that block will have access to it.
gpu memory hierarchy in terms of distance and speed: registers (fastest memory on the gpu; frequently used variables are stored here; practically instantaneous) -> L1 cache (shared by small group of threads or a small group of warps in a streaming multiprocessor) -> shared memory (resides within the shared multiprocessor, the heart and core processing unit of the gpu; lower latency and higher bandwidth than global memory) -> L2 cache (on chip but slower than l1 cache and shared memory; but faster than global memory) -> device memory /dram (largest memory space but located off the chip and connected via a memory bus)
on chip means things that reside on the same silicon die as the processor's core logic. off chip means things that are not located on the same die, but connected via memory bus or external interfaces.
cooperative loading is done when threads load data into the shared memory ie., each thread within block is responsible for loading some data into the shared memory.
synchronization: all threads in the block must finish loading data before any thread starts reading from there. __syncthreads() will make sure all threads in the block will wait till all threads reach this call.
cudaMalloc() uses void** because it is supposed to be a generic memory allocation function, designed to be able to allocate memory for any data type, not just float but alos into double structs etc.,
since d_xx has a type of float/int, it is type casted to void**, hence the syntax (void**)&d_xxx.
cuda provides events that allow us to precisely measure time on the gpu. create cuda event -> record the start and end of the events -> synchronize and make sure the gpu actually reached the end events -> calculate difference between end and start times.
cudaDeviceSynchronize(stop) makes sure all threads have reached there and after this difference is to be calculated.
sometimes, for small programs, cpu and gpu might consume the same time. this could be because of gpu kernel launch overhead, cpu cache effects working in favor of cpu method and the noise in micro benchmarking in gneral.
time in gpu can be measured creating start and stop cudaEvent_t. use cudaEventCreate(&start) and (&stop) to initialize them, then use cudaRecordEvent(start, 0) to start counting. after kernel launch is done cudaRecordEvent(stop, 0) to record a stop event and cudaEventSynchronize(stop) to ensure all threads have reached that point. then cudaElapsedTime(&time_gpu, start, stop) to store the time in time_gpu variable.
time in cpu can be measured using clock() from time.h. create start and stop variables, call clock() and store result in these two variables. calculate time using (stop - start)/CLOCKS_PER_SEC * 1000 to get time in milliseconds.
the area where processing happens in a gpu is called streaming multiprocessors. threads within an SM are grouped again into warps (typically 32 threads in nvidia GPUs). hence it is recommended to keep threadsperblock a multiple of 32.
inside a warp, the threads follow single instruction multiple data instruction type.
occupancy is a measure of how occupied each SM is with it's threads. too few threads: might not fully utilize the parallel processing power of gpu. too many threads: can reduce the number of blocks that can reside inside an sm, potentially limiting parallelism.
Basic Linear Algebra Subprograms (BLAS) is a standard for publishing libraries that perform basic linear algebra operations. there are three levels in it. first level works with vectors, second with sparse matrix operations and vectors and third level with matrix-matrix operations.
BLAS functions are used as building blocks for higher level algebraic functions.