Start 2 blocks with 3 threads. Each thread prints its block- and thread index.
Host and device use a common, virtual address space. Host allocates memory, a cuda threads write it, host reads. Memory is copied transparently between host and device by runtime system. In order to improve efficiency, memory can be prefetched with cudaMemPrefetchAsync.
Allocate separate memory on host and device. Host writes memory and copies it to device. Device alters memory in kernel. Host copies device memory back to host.
In order to check CUDA function calls for errors, it is convenient to enclose them in a macro.
Discrete convolution of a signal f and an impulse response g. In order to avoid case distinctions at boundaries, f is assumed to be zero padded at the beginning and at the end. In order to obtain identical results on the GPU and the CPU including rounding errors, the compiler has to be instructed to use fused multiply-add floating point operations on the CPU. This is achieved by compiling with nvcc -Xcompiler -mfma convolution.cu
Demonstrate that data in global memory is stored persistently between two kernel invocations. First, the threads of a grid write their thread id to global memory. Then a new grid is started in which the threads verify that the data in global memory is still valid.
If a local variable is declared with __shared__ in a kernel, it is shared among all threads in a block running the kernel.
Parallel median filter using dynamic shared memory to reduce slow global memory accesses.
Two threads in the same warp write a shared variable at exactly the same time. The result is undefined.
The kernel has too many local variables such that not all can be stored in registers. This leads to register spilling. Compile with nvcc -Xptxas -v register.cu to obtain register usage. Compile with nvcc -ptx register.cu to check in the ptx code that the loop is unrolled and many registers are used.
Kernels running in parallel streams.
Measuring the time overlap of two parallel streams.
Two parallel streams transfer memory with host asynchronously.
Multiplying two matrices using a tensor core. For simplicity the matrices are not subdivided into tiles. The compiler needs to know the target architecture. Compile e.g. with nvcc -arch=sm_89 matrix.cu