CUDA

Examples with CUDA

Minimal CUDA program.
Start 2 blocks with 3 threads. Each thread prints its block- and thread index.
Parallel vector addition.
Unified Memory.
Host and device use a common, virtual address space. Host allocates memory, a cuda threads write it, host reads. Memory is copied transparently between host and device by runtime system. In order to improve efficiency, memory can be prefetched with cudaMemPrefetchAsync.
Explicit memory transfer between host and device.
Allocate separate memory on host and device. Host writes memory and copies it to device. Device alters memory in kernel. Host copies device memory back to host.
Debugging.
In order to check CUDA function calls for errors, it is convenient to enclose them in a macro.
Convolution.
Discrete convolution of a signal f and an impulse response g. In order to avoid case distinctions at boundaries, f is assumed to be zero padded at the beginning and at the end. In order to obtain identical results on the GPU and the CPU including rounding errors, the compiler has to be instructed to use fused multiply-add floating point operations on the CPU. This is achieved by compiling with nvcc -Xcompiler -mfma convolution.cu
Persistent memory.
Demonstrate that data in global memory is stored persistently between two kernel invocations. First, the threads of a grid write their thread id to global memory. Then a new grid is started in which the threads verify that the data in global memory is still valid.
Local variables stored in shared memory.
If a local variable is declared with __shared__ in a kernel, it is shared among all threads in a block running the kernel.
Shared dynamic memory.
Parallel median filter using dynamic shared memory to reduce slow global memory accesses.
Race condition.
Two threads in the same warp write a shared variable at exactly the same time. The result is undefined.
Register spilling.
The kernel has too many local variables such that not all can be stored in registers. This leads to register spilling. Compile with nvcc -Xptxas -v register.cu to obtain register usage. Compile with nvcc -ptx register.cu to check in the ptx code that the loop is unrolled and many registers are used.
Streams.
Kernels running in parallel streams.
Streams.
Measuring the time overlap of two parallel streams.
Streams.
Two parallel streams transfer memory with host asynchronously.
Warp Matrix Multiply and Accumulate (WMMA) using Tensor Cores.
Multiplying two matrices using a tensor core. For simplicity the matrices are not subdivided into tiles. The compiler needs to know the target architecture. Compile e.g. with nvcc -arch=sm_89 matrix.cu