Random Stuff

MetaTrader History ToolKit
MetaTrader Data Export 
My ATC2012 EA
Heinz Engine
MetaResources
Custom BlueRetro Adapter 
Custom DIR-868L Router 
3DS Retro Game Gallery
Frypan-Lamp Biquad Antenna
HTPC 001
HTPC 002
HTPC 003 
Aerospace AL CPU Cooler
de-db 
ResMagik
NVIDIA CUDA  
Developer's Notes
Theme:

Page Uploaded: 13/10/2024
Page Updated: 15/10/2024

Info, opinions, tips and learning

Introduction

GPGPU and AI, that is the future, invest in it.

nVidia has provided CUDA (Compute Unified Device Architecture), an invaluable parallel computing platform at your disposal, for free, in every nVidia product be it a cheap gaming GeForce GPU, a professional workstation Quadro card or a dedicated GPGPU solution such as a Tesla card.

Why CUDA?

The tech world seems to want to jump straight from standard serial computing to AI, skipping many steps. CUDA is that middle ground, sitting there to fill the missing gaps with its parallel computing nature.

Let it be known, parallel computing is not the panacea. There are only a limited number of cases where it can be implemented. Similar goes for AI, it will solve so many problems but it will introduce others. As for serial computing, well, as general as it can be, there are problems that simply can't be solved with it.
But there is a place for everything, all technologies have their uses, it is important though to let them coexist and not replace one with another, that would be a catastrophic mistake.

CUDA & You

If you've ever been in need of a general purpose parallel computing platform then you are in the 0.00000001% category in the planet. It sounds like you are alone out there, but it doesn't have to be like that, you'll see that CUDA can be applied to a broather number of cases. It is only a matter of spreading the voice out there so that other programmers can take advantage of their hardware with CUDA.

These parallel computing platforms are known to work better if your program satisfy the following conditions (althought they are not mandatory):

  • You have a bazillion tasks to perform.
  • A task does not depend on the results of a previous task.

Under this criteria there is little chance that you are ever going to need CUDA for your everyday programs, BUT, the truth is that you can use CUDA the way you want!, that's right, you can use CUDA processors similarly to regular host processors, even for small programs that execute once. The CUDA platform was born with systems having a x86 CPU as a host processor, althought I can predict CUDA processors specially designed for hosting systems in the future. It sounds great but please note that currently there are limitations, the most notably are:

  • CUDA devices are isolated when executing a kernel. A CUDA kernel does not have access to the host's resources (there are a couple of exceptions) and you will spend an amount of time reading and writing from and to host/device memory (to mitigate this, there is the Unified Memory system that automatically manages allocated memory). You have to put a lot of effort differentiating host and device functionality.
  • You have access to most of the C++11 language features, but not to the standard library.

CUDA & Me

Time ago I came across with CUDA a couple of times: The first time was back in 2011 when I made a CUDA 4.0 program in driver mode using the D programming language just to prove CUDA could be used with any language with C ABI compatibility. The second time was in 2017 when I migrated a program from x86 to CUDA 8.0. This program was for executing a bazillion number of exercises for a crazy theory I developed about forecasting methodologies in sequential random numbers...a silly idea and a recreational project I still maintain and improve to this day. While I was never able to finish a single work unit in more than 4 months with the x86 version, the CUDA version did it in 12 hours! using a 640 CUDA cores device.
Nowadays I'm set for using CUDA as much as I can in all of my programs, be it of serial or parallel nature. I currently own a bunch of nVidia CMP cards for development. I did the DLI CUDA courses and got certified, taking it more serious now, thinking in going full pro.

One of the coolest things I like about CUDA is that it doesn't require setup code, CUDA programs are your regular C++ programs plus nVidia extensions. It is pretty straight forward, just keep the documentation at hand. Another cool thing is hardware scalability: You can develop on a $70 device (currently around 384 CUDA cores) to make sure everything works as expected and then you can drop in a $700 device (currently around 3584 CUDA cores) in your production environment to ramp up performance and transform your system into a time travel machine!

Managed memory or not?

In CUDA you can utilize managed memory or you can manually manage device/host memory yourself. It is a matter of personal choice but it's also governed by the program itself: What is it trying to accomplish, how and what is allowed within those boundaries.

I personally do manual memory management. It is the most flexible option and also the most native method for overlapping copy and compute operations.

Managed memory is better described as a coding productivity feature. But it introduces performance penalties due to page faults and then you'd have to worry about prefetching memory to partially mitigate these penalties, adding more lines to the code. Copy/compute overlap can be achieved with managed memory but it requires another set of steps, growing the code even more.

Managed memory doesn't sound productive nor managed right now, isn't it?

Tips

Here I'll give you a couple of useful advises:

  • If you have a graphics card (GeForce or Quadro) that is going to be used for CUDA then I recommend to use it as a dedicated computing card. Shutdown your computer, plug in your CUDA card and leave it as is, do not connect any monitor to it. This will reduce latency in the driver for not having to pair it with a monitor, using it for desktop graphics processing and satisfying other OS requests. Use integrated graphics or a cheap discrete graphics card as your main display output. Make your runtime as clean as possible, leave the CUDA device undisturbed while processing and avoid using host resources associated with your CUDA kernels.
  • If you are under a Windows OS then disable Timeout Detection and Recovery (TDR). This is a Windows Display Driver Model feature that affect every GPU in your system, not only your nVidia CUDA devices. TDR is a watchdog that resets the display driver if the graphics device becomes unresponsive for 2 seconds (by default) giving control of the system back again to the OS. This is a stability feature from Windows but it negatively affects CUDA programs because if you have a very intensive computation kernel then the execution time will be way beyond 2 seconds, this will result in Windows reseting the driver and your CUDA program will halt and fail every time.
    Disabling TDR solves the issue altogether. To disable TDR in a development environment, the easiest way is to launch "Nsight Monitor" with administrator privilegies, right click Nsight Monitor's icon located at the system tray and then select "Options". In the options window, under the "General" section set WDDM TDR enabled to "False". Finally restart your system and you are ready to go.
    To disable TDR in a production environment or to manually disable TDR in a system where Nsight Monitor is not available then you'll have to edit a registry key using "Regedit". In Regedit locate the path "HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers" and set the value of a key named "TdrLevel" to "0". If you can't find this key then you can manually create it as a DWORD. Restart your system and you are ready to go.
    Not that everyone will agree to disable TDR but there are cases where these Display Adapters restrictions are not needed because they break the correct operation of GPUs used as dedicated processing devices.
  • By default, the function cudaDeviceSynchronize will make the host thread to wait on a lockless state, raising the host CPU usage to 100%. Since all is about using CUDA for optimization and efficiency, you should call once the following line before executing a long lived kernel one or more times:

    cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);

    This will instruct the host thread to wait on a synchronization primitive such as a mutex object. This is very useful if you are leaving a kernel to execute all night. It will keep your system quiet, CPU usage low and energy efficient.
  • One feature that you'll love about CUDA is: FULL support for lambda closures in kernel code! That's right, you'll be able to do crazy stuff using lambda functions. My advice is: Include the header #include <nvfunctional> and you'll have access to the class nvstd::function which is the equivalent to std::function, pair both elements (nvstd::function + lambdas) and you'll be limitless. I've already played with this dark magic, it works.
  • The most significant optimization (the one with the most noticeable performance gains) I have come across in kernel code is: Making variables and memory local, that is, copying variables and memory blocks to thread local memory, do all your computations and then finally copy back the variables and memory to global memory (if needed). Do this as much as you can and you'll see that generally your kernel will finish work faster (in most cases).
  • More to come...

NVIDIA Deep Learning Institute (DLI)

NVIDIA has this thing called the DLI. Their main focus is to train professionals to solve modern problems using NVIDIA technologies and other standards. They provide online courses directly. You can enroll the courses by yourself in a self-paced mode or in large numbers (usually with your colleagues at your company) with a live teacher.
I did a couple of self-paced, accelerated computing with CUDA related courses. Overall had a great experience, learned a lot and the price was right for each individual course.
Some of the courses offer a certificate of competency that you can share and validate with an URL. This means very little, believe me. It just proves that you (hopefully and honestly) completed the course. Just a little thing to brag about in your CV.

Do you require previous knowledge or professional experience with CUDA in order to take the fundamental DLI courses?

No, not at all. You can start learning CUDA from scratch with the DLI courses (the introductory ones of course). I had a background in CUDA and that certainly made the courses easier for me. But you'll definitely need experience in programing (C/C++ or Python depending on the course) and "intuition" with computer systems, SSH-like, Linux (command-line and desktop).

Why did I enroll in the CUDA courses offered by the DLI?

I enrolled to make sure I wasn't missing anything on CUDA. And sure I was missing some of the good stuff. I had most (but not all) of the fundamentals covered but I was short in profiling, advanced techniques and multi-GPU. To my surprise, the courses did provide tangible value.

I was traveling when I took the courses. I had two extra spare weeks so I thought it was the perfect time to do it. Now, this was key in my decision to enroll in the courses instead of reading an on-line article or grinding the official documentation: Since I had no access to my CUDA development hardware, by enrolling, NVIDIA provided me access to a remote node with CUDA devices and a remote lab with a Linux desktop. I found that absolutely perfect. All you need is a laptop with an Internet connection and you are ready to learn.

Let me explain here that depending on the course, you have on average 24 counted hours of resource (lab/gpu) time. This doesn't mean you must finish the course in 24 hours since the enrollment. Instead, you could for example at first do 2 hours, wait for a week, do another 3, wait another week and then do whatever amount of time required. But combined, sessions can not exceed the 24 hour mark. Believe me, you'll finish the courses way before the available time, leaving you with spare resource time to do whatever you want with the remote hardware. You'd definitely want to use those resources for practicing.

Should you enroll in the CUDA courses offered by the DLI?

The answer is: It depends. On what? It depends on the specific course and your personal situation.

If you don't have access to a CUDA device then your only option for learning while executing your work in real time is to enroll.

I initially learned CUDA directly from the official documentation. The problem with learning from there is that it requires you to grind the documents and example source files for hours if not days or even weeks.
If you don't have self-teaching traits and prefer information to be funneled to you then you might consider taking a course or start with one of the NVIDIA technical blog posts.

Next I'll go through each course I took and give you my opinion and recommendation if you should take the course or not and under what circumstances.

DLI CUDA courses

0 - An Even Easier Introduction to CUDA

This is (at the time of this writing) a free course and practically a mirror to Mark Harris's blog post. A great introduction by all means.
If you just want to learn CUDA then I'd suggest you to take either of them.
Mark's blog post is more accessible and provides further links to advanced topics. The DLI course on the other hand offers access to a remote CUDA device (if resources are available) so that you can interactively get to experience the process and get real-time results. Although, this is currently done through Google's Colab, whose interface can be a bit overwhelming for some of you.

1 - Getting Started with Accelerated Computing with CUDA C/C++

This course has my overall approval and recommendation.
It is a very well put up course, concentrating all the major topics in one place. You get the introductory stuff and optionally more advanced topics. It is a more robust version of "An Even Easier Introduction to CUDA" plus an improved remote Jupyter Lab with a guaranteed CUDA device (usually a very capable one) and an improved visual profiler (Nsight Systems) in a remote Linux desktop.
The final test (N-body simulation) is well suited for the course. I even came up with an alternative solution that involved "Dynamic Parallelism", but you better stick to the course's material and you'll be fine.
I really enjoyed the overall experience and the material from this course, had lots of fun doing it.
This course offers 24 hours of resource time. Single GPU available.

2 - Accelerating CUDA C++ Applications with Concurrent Streams

This will be a no from me, I can't recommend this course. It focuses on copy/compute overlap which is already and concisely covered by "Getting Started with Accelerated Computing with CUDA C/C++" in the advaced topics section. As the material itself describes: This course has a lot of boilerplate. A 30 minutes reading is spreaded into a 4 hours lesson. However, it does cover the case where your work count (or array size) is not evenly divisible by the number of streams.
Having said all that, allow me to talk about how detailed and slow-paced this course is. Everything is very well constructed and it starts with a perfect introduction to the course and Jupyter Lab, something that "Getting Started with Accelerated Computing with CUDA C/C++" doesn't have (and much needs). There are lots of videos and it feels like the instructor is right there with you following your steps. By itself, this course is very well put together.
About copy/compute overlap: It's one of those niche optimization techniques whose benefits are mainly associated with the memory copy operations. It makes more sense when you have a massive dataset, when your kernel is part of a latency-critical pipeline or when you need to continuously launch kernels in a loop. If your program consists of just a single kernel launch with no follow-up then it's not worth the hassle.
This course has no final test.
This course offers 24 hours of resource time. Single GPU available.
You might also want to take a look at CUDA Graphs.

3 - Scaling Workloads Across Multiple GPUs with CUDA C++

Yes, I do recommend this course. There is not much material available online about multi-GPU nodes, so you have no better choice.
This course was exactly was I was hoping for. Now I can better leverage my multi-GPU development computer (once I have access to it again).
I had an idea on how to manage multiple CUDA devices by looking at some functions in the official documentation and it was just that. Maybe if you just want to get into it, have a look at the official documentation about how to query devices and how to select them, that is all, very simple.
I honestly recommend to just take the course because it also covers copy/compute overlap with multiple GPUs. Bear in mind that you'll need previous knowledge or experience working with CUDA streams and single-GPU copy/compute overlap.
This course has no final test.
This course offers 8 hours of resource time. A lot less that the other courses, maybe because it's four GPUs instead of one and they must limit such compute power.

4 - Optimizing CUDA Machine Learning Codes With Nsight Profiling Tools

I haven't done this course yet but it's on my wishlist. I leave this as a placeholder mostly. I want to take this course because I'm a bit short in profiling. I followed the profiling lessons to perfection but I think that it's not enough, there's more about it, I think. I took an overview of Nsight Compute (not Nsight Systems) in the official documentation and that thing is deep. Don't have the time to grind so I'm gonna leave that for the course.
First I need to get a bigger screen and a faster Internet connection because when I was doing the other courses, in the visual profiling portions, I struggled quite a bit with lag and everything looked so tiny in my small screen. That remote Linux desktop really squeezes your bandwidth.

Things or advices to know before/during/after the DLI courses

Course material is contained within Labs (Jupyter). To access a lab you must start a session by hitting, well...the start button. After that you have to wait several minutes, from 7 to 12 minutes depending on the course, while the entire software and hardware is allocated just for you. Once a session is active the resource usage countdown is started. Beware, if you close your browser window, the session will remain active and you could accidentaly deplete your course access time. To properly terminate a session hit the stop button.

You might need to consult the course material time after you have completed a course. First, there is the resource time that if expired then I imagine you can't start a new session to access the material. Second, I read somewhere that the access to the courses is removed after 6 months since completion, don't know if that's true or not. My recommendation is to backup/download the complete material. Start with the entire lab's structure which is where the core of the course is. To do that, start a fresh session and launch the lab. Do not open any notebook nor any other file, better don't touch anything. Then immediately open a new terminal ("File > New > Terminal") and issue the following command: "tar chvfz notebook.tar.gz *" (without the quotes). A new "tar.gz" file will be created, containing all the files. Download the "tar.gz" file by "right-click > Download" (in the file panel).
Important: If you close a session then all of the lab's status is lost, files generated by you as well, all gone. When you start a new session, your lab is restarted from scratch. You might want to keep this in mind in case you have made substancial work in the lab and want to preserve all of that. Make incremental backups as needed.

Additionally I recommend to download each notebook as html so that you could easily access the material without having to mount a complete local lab system. The best way to do this is to use the lab's integrated export functionality: "File > "Export Notebook As... > HTML". A nice thing about the exported html file is that the references to files are relative to the lab's file structure, so if you download and unpack the whole lab and then place the html file next to the corresponding notebook's ipynb file then the html file will correctly reference images and source files.

You should as well download the videos. This is pretty straightforward, the links are easily visible in their corresponding cell. They look something like this (modified for copyright reasons): https://dli.v56.eur-1.amazonaws.com/videos/z-bd-01-v1/01-intro-01.mp4
So next, under Linux, you can simply download the video with wget like this (in a local terminal): wget https://dli.v56.eur-1.amazonaws.com/videos/z-bd-01-v1/01-intro-01.mp4
Under Windows you could try pasting the link in your web browser's address bar. You can also install wget and use the same command above or even use a download manager like BitComet.
Some of the videos, depending on the course, are in 4K. They can take a lot of storage for such simple content. If you can't afford the space then you can resize the videos to 1080p and/or even reencode them. VLC can be used for such video convertions as well as for watching the videos.

While you are at the visual profiling sections of a course, you might want to gain as much screen real state as possible. Go fullscreen in your web browser when visual profiling. You usually do that by hitting the F11 key (at least on most browser is).

Be honest with yourself. You could easily cheat and rush the courses in minutes. You are there for something, take the time to understand the material and do things as intended.

Keep practicing after the courses. Develop and maintain a personal program, implement all the things you've learned and keep it very structured and clean, so that you can use it as a template for new programs. Keep reading about CUDA online, improve your code base as needed.