Random Stuff

MetaTrader History ToolKit
My ATC2012 EA
Heinz Engine
Honda Steed Exhaust Mod
3DS Retro Game Gallery 
Frypan-Lamp Biquad Antenna
HTPC 001
Aerospace AL CPU Cooler
PHP Pager
Developer's Notes

Developer's Notes

Just some thoughts about programming topics


Yep, it's Vulkan, the latest and greatest Khronos group maintained graphics (and computing) API. It feels great to have another fresh flavor aside from OpenGL and i'm glad to be part of a generation of developers that saw this API come to life and will be glad as well to see it grow in time. I really like this API, specially the level of abstraction it provides (which is very low). The model is exploded into hundreds of structures, functions, types and enums just to give control about everything.

Learning Vulkan is not easy nor time cheap, but it's free. Save thousands of dollars and 5 years of your life by self learning --for free-- at your own home. All you have to do is go to the Vulkan registry and download the latest API specification, it's all in there, you don't have to look elsewhere. You'll have to read this document many times and spend lot of time experimenting but trust me, it's fun...difficult but fun. Nothing is impossible and you'll be able to do anything with Vulkan.

Vulkan is not perfect
Sure, nobody is. What I do not like about Vulkan:
Well, this API was designed to support devices for another 20 years (like OpenGL) and that includes devices that do not even exist yet!
From PC's, going through smart wearables to finally who knows what, Vulkan had to make abstraction to the level of it's weakest link: That would be tile based renderers or tilers as I like to call them. Tilers are implemented by device architectures that are low on framebuffer memory and/or bandwidth such as smart phones and smart watches for example. Tilers can only work on a single region of the framebuffer at a time (other tilers architectures may support a more advanced approach) but Vulkan made it easier to work with this kind of devices by introducing a -nightmare- called "render pass". A render pass will allow you to work with any kind of device seamlessly at the expence of breaking your rendering pipeline into multiple steps. What's the big deal? you may ask, the big deal is that you must have a crystal ball to see the future because Vulkan needs to know in advance everything you'll do in order to compose a render pass (and it's subpasses). Not always it is possible to know what the user (not the programmer) is trying to do, many commands depend on user input, sure it is doable but not a simple task. This is contrary to other APIs for the more standard immediate renderers implemented by desktop graphics cards and modern gaming consoles. These immediate renderers have full access to the framebuffer and other resources thus requiring no extra steps nor fragmentation of tasks, but in Vulkan these renderers are gimped by the render pass for the so marketed predictability.
When you record a secondary command buffer (record once, use it many times) that is going to be executed within a render pass, then you must specify the render pass instance and the index of the subpass that the command buffer is going to be executed. If you change your render pass structure, depending on the level of change, you may or may not be able to use your command buffer again. In case you can't reuse it, you must record your command buffer again. The exact same thing applies to graphics pipeline state objects and framebuffers: Recreate or die.
Render passes are no problem at all if you are only programming in Vulkan but surely it is a nightmare if you have a "graphics interface" that all APIs must comply, in this case all the other APIs must emulate the Vulkan behaviour thus adding extra latency. It is hard to demonstrate in text but I can guarantee that you'll get stuck for a while in the documentation when you reach the render pass point.

Another topic I want to bring to the table is the shader resource binding mechanism: Vulkan just like D3D12 uses descriptors to bind resources to shaders but Vulkan has a weakness in its core design. Resources are managed using "descriptor sets", these sets are allocated and then the resource handles are assigned to them, afterthat you can bind the sets to the pipeline to match each shader slot with its corresponding resource parameter. The problem is that a descriptor set can't be modified when issuing commands, even if the pipeline is not using it! This means that you can't change shader's parameters! The function "vkUpdateDescriptorSets" is the only way to update a descriptor set but it only works in CPU side not inlined with any GPU command, that is, you can't call this function from a command buffer nor there is an equivalent function for inlined GPU side. To make things worse, if descriptor sets are not allocated using the correct flag then updating them with "vkUpdateDescriptorSets" will invalidate any command buffer that makes a reference to them. This design is very constraint, it is not flexible at all and limits very much what you can do with command buffers.
Vulkan extensions to the rescue! Later in time a couple of engineers from the graphics industry came to the obvious conclusion that Vulkan was missing trivial descriptor set functionality. Hence they proposed and introduced the VK_KHR_push_descriptor extension. This extension introduces the concept of "push descriptors" which are descriptors whose memory and lifetime is managed by the command buffer they are pushed to. There is no need to create descriptor pools nor allocate descriptor sets nor bind them, the "VkDescriptorSet" object gets deprecated when using push descriptor and they are no longer needed. Push descriptors are much more flexible, allowing you to change parameters freely between commands. Just take a look at the "vkCmdPushDescriptorSetKHR" function in the documentation and the "push_descriptors" example in the SDK. By using this extension you will get Vulkan to behave much more likely to other APIs, which is what you would have expected to do in the first place since the 1.0 spec release. Use push constants as much as you can, give them a try at least, they are easier than normal descriptor sets and according to the official documentation they might give you a little extra performance gain. Again, this is trivial functionality and does not require special hardware capabilities, you just need an updated device driver with support for the "VK_KHR_push_descriptor" extension, which all vendors should have implemented by now.

Design and programming considerations

  • Pack all instance function pointers and device function pointers into an instance and device structure respectively, this will save your life when doing a multi device implementation.
  • Pack any other Vulkan object created by the instance handle within the instance structure. Do the same with objects created by the device handle.
  • You must enable the validation layer on the instance and device(s). Just do it! Print debug messages to console as your most basic callback option.
  • All Vulkan objects that are created must be destroyed and objects that are allocated must be freed. Be careful and keep track of the lifetime of these objects and use parent class destructors (or smart pointers in C++11) to destroy/free them unless they are explicitly destroyed/freed in mid-rendering.
  • You'll end up recording an entire render pass inside a primary command buffer at every frame. Optimize performance by recording as well all commands that change with every frame within this very same primary command buffer in an inlined subpass. Execute all other pre-recorded commands in secondary command buffers within a dedicated subpass for this purpose. If you'll mix between inlined and pre-recorded commands then you must sort these commands into alternated subpasses (inlined and non-inlined modes), the real trick is to know for advance how many subpasses you'll execute (you must expose a way to specify this beforehand).
  • Naturally, you'll have to keep track of all pipeline state objects used by a command buffer. Do not be afraid of creating many pipeline state objects. Do not waste time trying to optimize memory by using and recreating only one PSO at a time, instead, try to keep track of many PSO variants and reuse those.
  • If you do have a "graphics interface" model then design your interface to behave more like Vulkan and not like D3D12 or another graphics API. If you do the other way around then you may fall short in design because you may not be able to pack Vulkan into a more generalized implementation. If you go for the Vulkan-like design then some functions might have to be emulated on the non Vulkan implementations though.
  • Staging resources/buffers, use them if there is device local memory available, otherwise fallback to host visible memory. Host visible memory is still very convenient if your application is simple and does not require the top notch performance. Device local memory is faster for the GPU to access, once the data is placed where it belongs, but there's a difference: A host visible memory implementation is immediate, meaning that its data transfer operation is done right away. Meanwhile, a device local memory implementation is deferred meaning that its data transfer operation must be recorded to a command buffer and then be submitted to a device's graphics queue.
    For the later case, if you are not working in a deferred context (you need a buffer ready as soon as you create it), you can of course pack it into a function and use a fence to wait for the data transfer operation to complete, this should behave like an immediate context function but please note that there are other objects and resources involved in order to do this. If this suits you well and you can overcome these details then certainly go for it. If you are working in a deferred context (the commands are always recorded and expected to execute later in the pipeline) then do not worry about any details, use staging resources/buffers as natural.
    This is hard to comment, it all depends on what you are trying to do and your actual implementation.

Direct3D 12

The next iteration of Microsoft's graphics and processing API, "D3D12". Althought it was released a year before Vulkan, I think it is a much better thought API than Vulkan. It is well designed for modern graphics architechtures but it is set to have a shorter lifespan than Vulkan. I recommend this API if you are in need of absolute control of resources without breaking your head (much) or if you just want to take the trip to a new graphics API and want to use it as a case of study. D3D12 is very interesting, flexible and fun. It will allow you to do anything you want, even crash the system on purpose if that's what you wish.

D3D12 is not perfect either
So far I've encountered only a single caveat: It is related to how resources are presented to shaders, descriptors. D3D12 introduces the concept of views, descriptor handles and descriptor heaps. Altogether they work fine and are very efficient, the problem is that the implementation has a couple of limitations:

  • As per the official documentation: "SetDescriptorHeaps can be called on a bundle, but the bundle descriptor heaps must match the calling command list descriptor heap". This means that you will not be able to change shader parameters in a bundle from a different preassembled descriptor heap. You can only do this in a direct command list.
  • There is no GPU command in Direct3D 12.0 to copy a descriptor from one heap to another. This means that there is no such function such as CopyDescriptorsSimple that is callable from a direct command list or bundle. You can only do this from CPU side and it won't be inlined with other GPU commands.
If you are working extensively with bundles then you will encounter with these two points mentioned above and they will limit very much what you can do with bundles, descriptor heaps and shader parameters. To mitigate the issues I've found out that the best shader resource binding mechanism is to use a single, big, monolithic descriptor heap for shared resources (D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV) and another one for samplers (D3D12_DESCRIPTOR_HEAP_TYPE_SAMPLER). This way you will only have a single heap for the corresponding type and you won't have the need to swap heaps, saving a bit of that overhead. Also, when creating resources, you can immediately precompute their CPU and GPU descriptor handles and keep them at hand and ready to use when needed. It is a fast and highly compatible solution.

More -nonsense blah blah- to come...


GPGPU and AI, that is the future, invest in it.
nVidia has provided CUDA (Compute Unified Device Architecture), an invaluable platform at your disposal, for free, in every nVidia product be it a cheap gaming GeForce GPU, a professional workstation Quadro card or a dedicated GPGPU solution such as a Tesla card.
If you've ever been in need of a general purpose parallel computing platform then you are in the %0.00000001 category in the planet. It sounds like you are alone out there but it is all a matter of perception. These parallel computing platforms are known to work better if your program satisfy the following conditions (althought they are not mandatory):

  • You have a bazillion tasks to perform.
  • A task does not depend on the results of a previous task.

Under this criteria there is little chance that you are ever going to need CUDA for your everyday programs, BUT, the truth is that you can use CUDA the way you want!, that's right, you can use CUDA processors similarly to regular host processors, even for small programs that execute once. The CUDA platform was born with systems having a x86 CPU as a host processor, althought I can predict CUDA processors specially designed for hosting systems in the future. It sounds great but please note that currently there are limitations, the most notably are:

  • CUDA devices are isolated when executing a kernel. A CUDA kernel does not have access to the host's resources (there are a couple of exceptions) and you will spend an amount of time reading and writing from and to host/device memory. You have to put a lot of effort differentiating host and device functionality.
  • You have access to most of the C++11 language features, but not to the standard library.

Personally in the past I came across with CUDA only two times in my life: The first time one was back in 2011 when I made a CUDA 4.0 program in driver mode using the D programming language just to prove CUDA could be used with any language with C ABI compatibility. The second time was in 2017 when I migrated a program from x86 to CUDA 8.0. This program was for executing a bazillion number of exercises for a crazy theory I developed about forecasting methodologies in sequential numbers...a silly idea, at least it was a recreational project, not an academical one (I only wasted my time, not other's). While I was never able to finish a single work unit in more than 4 months with the x86 version, the CUDA version did it in 12 hours! using a 640 CUDA cores device.
Currently I'm set for using CUDA as much as I can in all of my programs, be it of serial or parallel nature.
One of the coolest things I like about CUDA is that it doesn't require setup code, CUDA programs are your regular C++ programs plus nVidia extensions. It is pretty straight forward, just keep the documentation at hand. Another cool thing is hardware scalability: You can develop on a $70 device (currently around 384 CUDA cores) to make sure everything works as expected and then you can drop in a $700 device (currently around 3584 CUDA cores) in your production environment to ramp up performance and transform your system into a time travel machine!

Here I'll give you a couple of useful advises, from one CUDA beginner to another:

  • If you have a graphics card (GeForce or Quadro) that is going to be used for CUDA then I recommend to use it as a dedicated computing card. Shutdown your computer, plug in your CUDA card and leave it as is, do not connect any monitor to it. This will reduce latency in the driver for not having to pair it with a monitor, using it for desktop graphics processing and satisfying other OS requests. Use integrated graphics or a cheap discrete graphics card as your main display output. Make your runtime as clean as possible, leave the CUDA device undisturbed while processing and avoid using host resources associated with your CUDA kernels.
  • If you are under a Windows OS then disable Timeout Detection and Recovery (TDR). This is a Windows Display Driver Model feature that affect every GPU in your system, not only your nVidia CUDA devices. TDR is a watchdog that resets the display driver if the graphics device becomes unresponsive for 2 seconds (by default) giving control of the system back again to the OS. This is a stability feature from Windows but it negatively affects CUDA programs because if you have a very intensive computation kernel then the execution time will be way beyond 2 seconds, this will result in Windows reseting the driver and your CUDA program will halt and fail every time.
    Disabling TDR solves the issue altogether. To disable TDR in a development environment, the easiest way is to launch "Nsight Monitor" with administrator privilegies, right click Nsight Monitor's icon located at the system tray and then select "Options". In the options window, under the "General" section set WDDM TDR enabled to "False". Finally restart your system and you are ready to go.
    To disable TDR in a production environment or to manually disable TDR in a system where Nsight Monitor is not available then you'll have to edit a registry key using "Regedit". In Regedit locate the path "HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers" and set the value of a key named "TdrLevel" to "0". If you can't find this key then you can manually create it as a DWORD. Restart your system and you are ready to go.
    Not that everyone will agree to disable TDR but there are cases where these Display Adapters restrictions are not needed because they break the correct operation of GPUs used as dedicated processing devices.
  • By default, the function cudaDeviceSynchronize will make the host thread to wait on a lockless state, raising the host CPU usage to 100%. Since all is about using CUDA for optimization and efficiency, you should call once the following line before executing a long lived kernel one or more times:


    This will instruct the host thread to wait on a synchronization primitive such as a mutex object. This is very useful if you are leaving a kernel to execute all night. It will keep your system quiet, CPU usage low and energy efficient.
  • One feature that you'll love about CUDA is: FULL support for lambda closures in kernel code! That's right, you'll be able to do crazy stuff using lambda functions. My advice is: Include this header #include <nvfunctional> and you'll have access to the class nvstd::function which is the equivalent to std::function, pair both elements and you'll be limitless. I've already played with this dark magic, it works.
  • The most significant optimization (the one with the most noticeable performance gains) I have come across in kernel code is: Making variables and memory local, that is, copying variables and memory blocks to thread local memory, do all your computations and then finally copy back the variables and memory to global memory (if needed). Do this as much as you can and you'll see that generally your kernel will finish work faster (in most cases).
  • More to come...