Yep, it's Vulkan, the latest and greatest Khronos group maintained graphics (and computing) API. It feels great to have another fresh flavor aside from OpenGL and i'm glad to be part of a generation of developers that saw this API come to life and will be glad as well to see it grow in time. I really like this API, specially the level of abstraction it provides (which is very low). The model is exploded into hundreds of structures, functions, types and enums just to give control about everything.
Learning Vulkan is not easy nor time cheap, but it's free. Save thousands of dollars and 5 years of your life by self learning --for free-- at your own home. All you have to do is go to the Vulkan registry and download the latest API specification, it's all in there, you don't have to look elsewhere. You'll have to read this document many times and spend lot of time experimenting but trust me, it's fun...difficult but fun. Nothing is impossible and you'll be able to do anything with Vulkan.
Vulkan is not perfect
Sure, nobody is. What I do not like about Vulkan:
Well, this API was designed to support devices for another 20 years (like OpenGL) and that includes devices that do not even exist yet!
From PC's, going through smart wearables to finally who knows what, Vulkan had to make abstraction to the level of it's weakest link: That would be tile based renderers or tilers as I like to call them. Tilers are implemented by device architectures that are low on framebuffer memory and/or bandwidth such as smart phones and smart watches for example. Tilers can only work on a single region of the framebuffer at a time (other tilers architectures may support a more advanced approach) but Vulkan made it easier to work with this kind of devices by introducing a -nightmare- called "render pass". A render pass will allow you to work with any kind of device seamlessly at the expence of breaking your rendering pipeline into multiple steps. What's the big deal? you may ask, the big deal is that you must have a crystal ball to see the future because Vulkan needs to know in advance everything you'll do in order to compose a render pass (and it's subpasses). Not always it is possible to know what the user (not the programmer) is trying to do, many commands depend on user input, sure it is doable but not a simple task. This is contrary to other APIs for the more standard immediate renderers implemented by desktop graphics cards and modern gaming consoles. These immediate renderers have full access to the framebuffer and other resources thus requiring no extra steps nor fragmentation of tasks, but in Vulkan these renderers are gimped by the render pass for the so marketed predictability.
When you record a secondary command buffer (record once, use it many times) that is going to be executed within a render pass, then you must specify the render pass instance and the index of the subpass that the command buffer is going to be executed. If you change your render pass structure, depending on the level of change, you may or may not be able to use your command buffer again. In case you can't reuse it, you must record your command buffer again. The exact same thing applies to graphics pipeline state objects and framebuffers: Recreate or die.
Render passes are no problem at all if you are only programming in Vulkan but surely it is a nightmare if you have a "graphics interface" that all APIs must comply, in this case all the other APIs must emulate the Vulkan behaviour thus adding extra latency. It is hard to demonstrate in text but I can guarantee that you'll get stuck for a while in the documentation when you reach the render pass point.
Another topic I want to bring to the table is the shader resource binding mechanism: Vulkan just like D3D12 uses descriptors to bind resources to shaders but Vulkan has a weakness in its core design. Resources are managed using "descriptor sets", these sets are allocated and then the resource handles are assigned to them, afterthat you can bind the sets to the pipeline to match each shader slot with its corresponding resource parameter. The problem is that a descriptor set can't be modified when issuing commands, even if the pipeline is not using it! This means that you can't change shader's parameters! The function "vkUpdateDescriptorSets" is the only way to update a descriptor set but it only works in CPU side not inlined with any GPU command, that is, you can't call this function from a command buffer nor there is an equivalent function for inlined GPU side. To make things worse, if descriptor sets are not allocated using the correct flag then updating them with "vkUpdateDescriptorSets" will invalidate any command buffer that makes a reference to them. This design is very constraint, it is not flexible at all and limits very much what you can do with command buffers. Vulkan extensions to the rescue! Later in time a couple of engineers from the graphics industry came to the obvious conclusion that Vulkan was missing trivial descriptor set functionality. Hence they proposed and introduced the VK_KHR_push_descriptor extension. This extension introduces the concept of "push descriptors" which are descriptors whose memory and lifetime is managed by the command buffer they are pushed to. There is no need to create descriptor pools nor allocate descriptor sets nor bind them, the "VkDescriptorSet" object gets deprecated when using push descriptor and they are no longer needed. Push descriptors are much more flexible, allowing you to change parameters freely between commands. Just take a look at the "vkCmdPushDescriptorSetKHR" function in the documentation and the "push_descriptors" example in the SDK. By using this extension you will get Vulkan to behave much more likely to other APIs, which is what you would have expected to do in the first place since the 1.0 spec release. Use push constants as much as you can, give them a try at least, they are easier than normal descriptor sets and according to the official documentation they might give you a little extra performance gain. Again, this is trivial functionality and does not require special hardware capabilities, you just need an updated device driver with support for the "VK_KHR_push_descriptor" extension, which all vendors should have implemented by now.
Design and programming considerations
Pack all instance function pointers and device function pointers into an instance and device structure respectively, this will save your life when doing a multi device implementation.
Pack any other Vulkan object created by the instance handle within the instance structure. Do the same with objects created by the device handle.
You must enable the validation layer on the instance and device(s). Just do it! Print debug messages to console as your most basic callback option.
All Vulkan objects that are created must be destroyed and objects that are allocated must be freed. Be careful and keep track of the lifetime of these objects and use parent class destructors (or smart pointers in C++11) to destroy/free them unless they are explicitly destroyed/freed in mid-rendering.
You'll end up recording an entire render pass inside a primary command buffer at every frame. Optimize performance by recording as well all commands that change with every frame within this very same primary command buffer in an inlined subpass. Execute all other pre-recorded commands in secondary command buffers within a dedicated subpass for this purpose. If you'll mix between inlined and pre-recorded commands then you must sort these commands into alternated subpasses (inlined and non-inlined modes), the real trick is to know for advance how many subpasses you'll execute (you must expose a way to specify this beforehand).
Naturally, you'll have to keep track of all pipeline state objects used by a command buffer. Do not be afraid of creating many pipeline state objects. Do not waste time trying to optimize memory by using and recreating only one PSO at a time, instead, try to keep track of many PSO variants and reuse those.
If you do have a "graphics interface" model then design your interface to behave more like Vulkan and not like D3D12 or another graphics API. If you do the other way around then you may fall short in design because you may not be able to pack Vulkan into a more generalized implementation. If you go for the Vulkan-like design then some functions might have to be emulated on the non Vulkan implementations though.
Staging resources/buffers, use them if there is device local memory available, otherwise fallback to host visible memory. Host visible memory is still very convenient if your application is simple and does not require the top notch performance. Device local memory is faster for the GPU to access, once the data is placed where it belongs, but there's a difference: A host visible memory implementation is immediate, meaning that its data transfer operation is done right away. Meanwhile, a device local memory implementation is deferred meaning that its data transfer operation must be recorded to a command buffer and then be submitted to a device's graphics queue.
For the later case, if you are not working in a deferred context (you need a buffer ready as soon as you create it), you can of course pack it into a function and use a fence to wait for the data transfer operation to complete, this should behave like an immediate context function but please note that there are other objects and resources involved in order to do this. If this suits you well and you can overcome these details then certainly go for it. If you are working in a deferred context (the commands are always recorded and expected to execute later in the pipeline) then do not worry about any details, use staging resources/buffers as natural.
This is hard to comment, it all depends on what you are trying to do and your actual implementation.
Direct3D 12
The next iteration of Microsoft's graphics and processing API, "D3D12". Althought it was released a year before Vulkan, I think it is a much better thought API than Vulkan. It is well designed for modern graphics architechtures but it is set to have a shorter lifespan than Vulkan. I recommend this API if you are in need of absolute control of resources without breaking your head (much) or if you just want to take the trip to a new graphics API and want to use it as a case of study. D3D12 is very interesting, flexible and fun. It will allow you to do anything you want, even crash the system on purpose if that's what you wish.
D3D12 is not perfect either
So far I've encountered only a single caveat: It is related to how resources are presented to shaders, descriptors. D3D12 introduces the concept of views, descriptor handles and descriptor heaps. Altogether they work fine and are very efficient, the problem is that the implementation has a couple of limitations:
As per the official documentation: "SetDescriptorHeaps can be called on a bundle, but the bundle descriptor heaps must match the calling command list descriptor heap". This means that you will not be able to change shader parameters in a bundle from a different preassembled descriptor heap. You can only do this in a direct command list.
There is no GPU command in Direct3D 12.0 to copy a descriptor from one heap to another. This means that there is no such function such as CopyDescriptorsSimple that is callable from a direct command list or bundle. You can only do this from CPU side and it won't be inlined with other GPU commands.
If you are working extensively with bundles then you will encounter with these two points mentioned above and they will limit very much what you can do with bundles, descriptor heaps and shader parameters. To mitigate the issues I've found out that the best shader resource binding mechanism is to use a single, big, monolithic descriptor heap for shared resources (D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV) and another one for samplers (D3D12_DESCRIPTOR_HEAP_TYPE_SAMPLER). This way you will only have a single heap for the corresponding type and you won't have the need to swap heaps, saving a bit of that overhead. Also, when creating resources, you can immediately precompute their CPU and GPU descriptor handles and keep them at hand and ready to use when needed. It is a fast and highly compatible solution.
More -nonsense blah blah- to come...
CUDA
I've moved all CUDA related content to its own page right here.