SP33D2
Need Another Hit
Ever stop in the middle of writing some code when it hits you? That nagging feeling. It’s in the back of your mind, relentlessly gnawing away. Making you question your decisions. Making you wonder what you’re doing with your life. You’re plugging away happily, when suddenly you remember I haven’t optimized any code in hours.
It’s no way to work.
It’s no way to live.
Let’s get back to our roots, blog.
New Edition
For this edition of the blog, we’re hopping off the usual tracks and onto a different set entirely: Vulkan driver optimization. I can already hear what you’re thinking.
Vulkan drivers are already fast. Go back to doing something useful, like making GL work.
First: no they’re not.
Second: I’m doing the opposite of that.
Third: shut up, it’s my blog.
How does one optimize Vulkan drivers? As we all know, any great optimization needs a test case. In the case of Vulkan, everyone who’s anyone (me) uses Zink as their test case. The reasons for this are many and varied because I say they are, but the important one to keep in mind is, as always, drawoverhead
.
For those who can’t remember the times I have previously blogged about the world’s premiere benchmarking tool, don’t worry. As with any great form of entertainment, I’ve prepared a recap.
TL;DR drawoverhead
Suppose you are a large gaming-oriented company that sells hardware. Your hardware runs on a battery. The battery has a finite charge. Every bit of power drained from the battery powers the device so that users can play games.
Wouldn’t it be great if that battery lasted longer?
There are a number of ways to achieve this goal. Because this is my blog, I’m going to talk about the one that doesn’t involve underclocking or reducing graphical settings.
Obviously I’m talking about optimization, the process by which a smart and talented reader of StackOverflow copies and pastes code in exactly the right sequence such that, upon resolving all compilation errors, execution of a program utilizes fewer resources.
Now because everyone reading this is a GPU driver developer, we all know that optimization comes in two forms, optimizing for CPU and optimizing for GPU. But GPU optimization is easy. Anyone can do that. You just strap on your RadeonGPUProfiler or Nsight or
So we’re done with GPU optimization, but we’re not optimized enough yet. The battery still doesn’t last forever, and the users are still complaining on Reddit that they can’t even finish a casual playthrough of Elden Ring or a boss fight in Monster Hunter: Rise without needing to plug in.
This brings us to “CPU optimization”, the process by which we use more esoteric tools like perf or dtrace or custom instrumentation to generate possibly-useful traces of where the CPU may or may not be executing optimally because it’s a filthy liar that doesn’t even know what line of code it’s executing half the time. But still, we need test cases, and unlike GPU profiling, CPU profiling typically isn’t useful with only a single frame’s worth of sample data.
Thus, drawoverhead
, which provides a view of how various GL operations impact CPU utilization by executing millions of draw calls per second to provide copious samples for profiling.
Why Not drawoverhead?
This is where the blog is going to take a turn for the bizarre. Some people, it seems, don’t want to use Zink for benchmarking and profiling.
I know.
I’m shocked, hurt, appalled, and also it’s me who doesn’t want to use Zink for benchmarking and profiling so it’s a very confusing time.
The problem with using Zink for optimizing CPU usage is that Zink keeps getting in the way. I want to profile only the Vulkan driver, but instead I’ve got all this Mesa oozing and spurting onto my CPU samples. It’s gross, it’s an untenable situation, and I’ve already taken steps to resolve it.
Behold the future: vkoverhead.
With one simple clone, build, and execute, it’s now possible to see how much the Vulkan driver you’re using sucks at any given task.
Want to see how costly it is to bind a different pipeline? Got it covered.
Changing vertex buffers? Blam, your performance is garbage.
Starting and stopping renderpasses? Take your entire computer and throw it in the dumpster because that’s where your performance just went.
vkoverhead: Mythbusting
The obvious problem with this is that somebody has to actually dig into the vkoverhead
results for each driver and figure out what can be made better. I’ll write another post about this since it’s a separate topic.
Instead, what I want to do today is use vkoverhead
to delve into one of the latest and greatest myths of modern Vulkan:
Is the use of fast-linked Graphics Pipeline Libraries worse than, equivalent to, or better than VK_EXT_vertex_input_dynamic_state?
I say this is one of the great myths because, having spoken to Very Knowledgeable Khronos Insiders as well as Experienced Application Developers, I’ve been told repeatedly that VK_EXT_vertex_input_dynamic_state
is just a convenience feature that should not be used or relied upon, and proper use of GPL with fast-linking provides the same functionality and performance with broader adoption. But is this really true?
Well, now that the tools exist, it’s possible to say definitively that this sort of wishful thinking does not reflect reality. Let’s check out the numbers. As of the latest 1.1 vkoverhead
release, the following cases are available:
$ ./vkoverhead -list
0, draw
1, draw_multi
2, draw_vertex
3, draw_multi_vertex
4, draw_index_change
5, draw_index_offset_change
6, draw_rp_begin_end
7, draw_rp_begin_end_dynrender
8, draw_rp_begin_end_dontcare
9, draw_rp_begin_end_dontcare_dynrender
10, draw_multirt
11, draw_multirt_dynrender
12, draw_multirt_begin_end
13, draw_multirt_begin_end_dynrender
14, draw_multirt_begin_end_dontcare
15, draw_multirt_begin_end_dontcare_dynrender
16, draw_vbo_change
17, draw_1vattrib_change
18, draw_16vattrib
19, draw_16vattrib_16vbo_change
20, draw_16vattrib_change
21, draw_16vattrib_change_dynamic
22, draw_16vattrib_change_gpl
23, draw_16vattrib_change_gpl_hashncache
24, draw_1ubo_change
25, draw_12ubo_change
26, draw_1sampler_change
27, draw_16sampler_change
28, draw_1texelbuffer_change
29, draw_16texelbuffer_change
30, draw_1ssbo_change
31, draw_8ssbo_change
32, draw_1image_change
33, draw_16image_change
34, draw_1imagebuffer_change
35, draw_16imagebuffer_change
36, submit_noop
37, submit_50noop
38, submit_1cmdbuf
39, submit_50cmdbuf
40, submit_50cmdbuf_50submit
41, descriptor_noop
42, descriptor_1ubo
43, descriptor_template_1ubo
44, descriptor_12ubo
45, descriptor_template_12ubo
46, descriptor_1sampler
47, descriptor_template_1sampler
48, descriptor_16sampler
49, descriptor_template_16sampler
50, descriptor_1texelbuffer
51, descriptor_template_1texelbuffer
52, descriptor_16texelbuffer
53, descriptor_template_16texelbuffer
54, descriptor_1ssbo
55, descriptor_template_1ssbo
56, descriptor_8ssbo
57, descriptor_template_8ssbo
58, descriptor_1image
59, descriptor_template_1image
60, descriptor_16image
61, descriptor_template_16image
62, descriptor_1imagebuffer
63, descriptor_template_1imagebuffer
64, descriptor_16imagebuffer
65, descriptor_template_16imagebuffer
66, misc_resolve
67, misc_resolve_mutable
The interesting cases for this scenario are:
21, draw_16vattrib_change_dynamic
22, draw_16vattrib_change_gpl
23, draw_16vattrib_change_gpl_hashncache
- Case 21 is changing 16 vertex attributes between draws using
VK_EXT_vertex_input_dynamic_state
- Case 22 is using fast-linking GPL to compile and bind a new pipeline from precompiled partial pipelines between draws
- Case 23 is using fully precompiled GPL pipelines with hash-n-cache to swap pipelines between draws
Running all of these tests on NVIDIA’s driver (the only hardware driver to fully support both extensions) on a AMD Ryzen 9 5900X with a 3060TI yields the following:
Case | Draws per second |
---|---|
21 draw_16vattrib_change_dynamic | 7,965,000 |
22 draw_16vattrib_change_gpl | 315,000 |
23 draw_16vattrib_change_gpl_hashncache | 4,020,000 |
Staggeringly, it turns out that GPL is worse in every scenario. Even the speed of the typical Vulkan hash-n-cache usage can’t make up for the fact that VK_EXT_vertex_input_dynamic_state
genuinely is that much faster. And assuming the driver isn’t doing low-GPU-performance heroics, that means everyone drinking the koolaid about not using or not implementing VK_EXT_vertex_input_dynamic_state
should be reconsidering their beverage of choice.
This isn’t to say Graphics Pipeline Library is bad or should not be used.
GPL is one of the best extensions Vulkan has to offer, and it definitely provides a huge number of features that every application developer should be examining to see how they might improve performance.
But it’s not better than VK_EXT_vertex_input_dynamic_state
.
More vkoverhead
The project is already in a state where at least one major GPU vendor is utilizing it to drive down CPU usage. If you’re a GPU driver engineer, or perhaps if you’re someone who does benchmarking for a popular tech news site, you should check out vkoverhead
too.
Some key takeaways:
- Raw numbers can be compared between different GPUs and and GPU drivers so long as the rest of the system stays the same
- This is how I know that AMDPRO currently performs better than RADV
- If the rest of the system is not the same between drivers/GPUs, the percentage differences can still be compared
Stay tuned for an upcoming post where I teach a course in making your own spaghetti. Also some other things that give zink a 25%+ performance boost.