A detailed introduction to CUDA can be found in the lecture notes for this course.
In order to use CUDA under Linux, you have to install the native Nvidia driver. As this is a risky process, it is recommended to install a separate linux system on a (fast) USB stick and experiment there. I am using Debian 13 with kde plasma x11. A detailed description how to install the Nvidia driver can be found in the Debian Wiki. Once the driver is installed, you can install the CUDA toolkit with sudo apt install nvidia-cuda-dev nvidia-cuda-toolkit. In case of problems, ChatGPT is of great help.
The Nvidia compiler works only with Visual Studio, not with other C++ compilers. Use the command line interpreter cl.exe which comes with Visual Studio.
A list of examples demonstrating basic concepts of CUDA can be found here.
cuBLAS and cuBLASLt are libraries for linear algebra functions like matrix multiplication which use Tensor Cores. Some examples can be found here.
There are several tools to get information about your GPU like number of streaming processors, amount of memory, clock frequency, ...
If you want to optimize your programs, you need to find out where compting time is lost. This is the purpose of a profiler. A list of tools can be found here.
A list with helpful videos and literature on CUDA can be found here.