SP33D2

Need Another Hit

Ever stop in the middle of writing some code when it hits you? That nagging feeling. It’s in the back of your mind, relentlessly gnawing away. Making you question your decisions. Making you wonder what you’re doing with your life. You’re plugging away happily, when suddenly you remember I haven’t optimized any code in hours.

It’s no way to work.

It’s no way to live.

Let’s get back to our roots, blog.

New Edition

For this edition of the blog, we’re hopping off the usual tracks and onto a different set entirely: Vulkan driver optimization. I can already hear what you’re thinking.

Vulkan drivers are already fast. Go back to doing something useful, like making GL work.

First: no they’re not.

Second: I’m doing the opposite of that.

Third: shut up, it’s my blog.

How does one optimize Vulkan drivers? As we all know, any great optimization needs a test case. In the case of Vulkan, everyone who’s anyone (me) uses Zink as their test case. The reasons for this are many and varied because I say they are, but the important one to keep in mind is, as always, drawoverhead.

For those who can’t remember the times I have previously blogged about the world’s premiere benchmarking tool, don’t worry. As with any great form of entertainment, I’ve prepared a recap.

TL;DR drawoverhead

Suppose you are a large gaming-oriented company that sells hardware. Your hardware runs on a battery. The battery has a finite charge. Every bit of power drained from the battery powers the device so that users can play games.

Wouldn’t it be great if that battery lasted longer?

There are a number of ways to achieve this goal. Because this is my blog, I’m going to talk about the one that doesn’t involve underclocking or reducing graphical settings.

Obviously I’m talking about optimization, the process by which a smart and talented reader of StackOverflow copies and pastes code in exactly the right sequence such that, upon resolving all compilation errors, execution of a program utilizes fewer resources.

Now because everyone reading this is a GPU driver developer, we all know that optimization comes in two forms, optimizing for CPU and optimizing for GPU. But GPU optimization is easy. Anyone can do that. You just strap on your RadeonGPUProfiler or Nsight or , glance at the output, mumble mumble mumble, and blammo, your GPU is running like a clock. A really fast one though. With like a trillion hands all spinning at once.

So we’re done with GPU optimization, but we’re not optimized enough yet. The battery still doesn’t last forever, and the users are still complaining on Reddit that they can’t even finish a casual playthrough of Elden Ring or a boss fight in Monster Hunter: Rise without needing to plug in.

This brings us to “CPU optimization”, the process by which we use more esoteric tools like perf or dtrace or custom instrumentation to generate possibly-useful traces of where the CPU may or may not be executing optimally because it’s a filthy liar that doesn’t even know what line of code it’s executing half the time. But still, we need test cases, and unlike GPU profiling, CPU profiling typically isn’t useful with only a single frame’s worth of sample data.

Thus, drawoverhead, which provides a view of how various GL operations impact CPU utilization by executing millions of draw calls per second to provide copious samples for profiling.

Why Not drawoverhead?

This is where the blog is going to take a turn for the bizarre. Some people, it seems, don’t want to use Zink for benchmarking and profiling.

I know.

I’m shocked, hurt, appalled, and also it’s me who doesn’t want to use Zink for benchmarking and profiling so it’s a very confusing time.

The problem with using Zink for optimizing CPU usage is that Zink keeps getting in the way. I want to profile only the Vulkan driver, but instead I’ve got all this Mesa oozing and spurting onto my CPU samples. It’s gross, it’s an untenable situation, and I’ve already taken steps to resolve it.

Behold the future: vkoverhead.

With one simple clone, build, and execute, it’s now possible to see how much the Vulkan driver you’re using sucks at any given task.

Want to see how costly it is to bind a different pipeline? Got it covered.

Changing vertex buffers? Blam, your performance is garbage.

Starting and stopping renderpasses? Take your entire computer and throw it in the dumpster because that’s where your performance just went.

vkoverhead: Mythbusting

The obvious problem with this is that somebody has to actually dig into the vkoverhead results for each driver and figure out what can be made better. I’ll write another post about this since it’s a separate topic.

Instead, what I want to do today is use vkoverhead to delve into one of the latest and greatest myths of modern Vulkan:

Is the use of fast-linked Graphics Pipeline Libraries worse than, equivalent to, or better than VK_EXT_vertex_input_dynamic_state?

I say this is one of the great myths because, having spoken to Very Knowledgeable Khronos Insiders as well as Experienced Application Developers, I’ve been told repeatedly that VK_EXT_vertex_input_dynamic_state is just a convenience feature that should not be used or relied upon, and proper use of GPL with fast-linking provides the same functionality and performance with broader adoption. But is this really true?

Well, now that the tools exist, it’s possible to say definitively that this sort of wishful thinking does not reflect reality. Let’s check out the numbers. As of the latest 1.1 vkoverhead release, the following cases are available:

$ ./vkoverhead -list
   0, draw
   1, draw_multi
   2, draw_vertex
   3, draw_multi_vertex
   4, draw_index_change
   5, draw_index_offset_change
   6, draw_rp_begin_end
   7, draw_rp_begin_end_dynrender
   8, draw_rp_begin_end_dontcare
   9, draw_rp_begin_end_dontcare_dynrender
  10, draw_multirt
  11, draw_multirt_dynrender
  12, draw_multirt_begin_end
  13, draw_multirt_begin_end_dynrender
  14, draw_multirt_begin_end_dontcare
  15, draw_multirt_begin_end_dontcare_dynrender
  16, draw_vbo_change
  17, draw_1vattrib_change
  18, draw_16vattrib
  19, draw_16vattrib_16vbo_change
  20, draw_16vattrib_change
  21, draw_16vattrib_change_dynamic
  22, draw_16vattrib_change_gpl
  23, draw_16vattrib_change_gpl_hashncache
  24, draw_1ubo_change
  25, draw_12ubo_change
  26, draw_1sampler_change
  27, draw_16sampler_change
  28, draw_1texelbuffer_change
  29, draw_16texelbuffer_change
  30, draw_1ssbo_change
  31, draw_8ssbo_change
  32, draw_1image_change
  33, draw_16image_change
  34, draw_1imagebuffer_change
  35, draw_16imagebuffer_change
  36, submit_noop
  37, submit_50noop
  38, submit_1cmdbuf
  39, submit_50cmdbuf
  40, submit_50cmdbuf_50submit
  41, descriptor_noop
  42, descriptor_1ubo
  43, descriptor_template_1ubo
  44, descriptor_12ubo
  45, descriptor_template_12ubo
  46, descriptor_1sampler
  47, descriptor_template_1sampler
  48, descriptor_16sampler
  49, descriptor_template_16sampler
  50, descriptor_1texelbuffer
  51, descriptor_template_1texelbuffer
  52, descriptor_16texelbuffer
  53, descriptor_template_16texelbuffer
  54, descriptor_1ssbo
  55, descriptor_template_1ssbo
  56, descriptor_8ssbo
  57, descriptor_template_8ssbo
  58, descriptor_1image
  59, descriptor_template_1image
  60, descriptor_16image
  61, descriptor_template_16image
  62, descriptor_1imagebuffer
  63, descriptor_template_1imagebuffer
  64, descriptor_16imagebuffer
  65, descriptor_template_16imagebuffer
  66, misc_resolve
  67, misc_resolve_mutable

The interesting cases for this scenario are:

  21, draw_16vattrib_change_dynamic
  22, draw_16vattrib_change_gpl
  23, draw_16vattrib_change_gpl_hashncache
  • Case 21 is changing 16 vertex attributes between draws using VK_EXT_vertex_input_dynamic_state
  • Case 22 is using fast-linking GPL to compile and bind a new pipeline from precompiled partial pipelines between draws
  • Case 23 is using fully precompiled GPL pipelines with hash-n-cache to swap pipelines between draws

Running all of these tests on NVIDIA’s driver (the only hardware driver to fully support both extensions) on a AMD Ryzen 9 5900X with a 3060TI yields the following:

Case Draws per second
21 draw_16vattrib_change_dynamic 7,965,000
22 draw_16vattrib_change_gpl 315,000
23 draw_16vattrib_change_gpl_hashncache 4,020,000

Staggeringly, it turns out that GPL is worse in every scenario. Even the speed of the typical Vulkan hash-n-cache usage can’t make up for the fact that VK_EXT_vertex_input_dynamic_state genuinely is that much faster. And assuming the driver isn’t doing low-GPU-performance heroics, that means everyone drinking the koolaid about not using or not implementing VK_EXT_vertex_input_dynamic_state should be reconsidering their beverage of choice.

This isn’t to say Graphics Pipeline Library is bad or should not be used.

GPL is one of the best extensions Vulkan has to offer, and it definitely provides a huge number of features that every application developer should be examining to see how they might improve performance.

But it’s not better than VK_EXT_vertex_input_dynamic_state.

More vkoverhead

The project is already in a state where at least one major GPU vendor is utilizing it to drive down CPU usage. If you’re a GPU driver engineer, or perhaps if you’re someone who does benchmarking for a popular tech news site, you should check out vkoverhead too.

Some key takeaways:

  • Raw numbers can be compared between different GPUs and and GPU drivers so long as the rest of the system stays the same
    • This is how I know that AMDPRO currently performs better than RADV
  • If the rest of the system is not the same between drivers/GPUs, the percentage differences can still be compared

Stay tuned for an upcoming post where I teach a course in making your own spaghetti. Also some other things that give zink a 25%+ performance boost.

Written on September 8, 2022