Overdue

I Meant To Write This Some Time Ago But Then I Didn’t

The blog has been there before, it’ll be there again, I’m over it.

Let’s talk about the only thing anyone cares about on this blog for the 23.1 release cycle: perf.

What does it mean? Who knows.

How does one acquire it? See above.

But everyone wants it, so I’ve gotta deliver like a hypothetical shipping company that delivers on-time and actually rings the bell when they drop something off instead of just hucking a box onto my front porch and running away to leave my stuff sitting out to get snowed on.

Unlike such a hypothetical shipping company which doesn’t exist, perf does exist, and I’ve got it now. Lots of it.

Let’s talk about what parts of my soul I had to sell to get to this point in my life.

Time-bending

Fans of the blog will recall I previously wrote a post that had great SEO for problem spaces and temporal ordering of events. This functionality, rewriting GL commands on-the-fly, has since been dubbed “time-bending” by prominent community members, so I guess that’s what we’re calling it since naming is the most important part of any technical challenge.

Initially, I implemented time-bending to be able to reorder some barriers and transfer operations and avoid splitting renderpasses, something that only the savages living in tiling GPU land care about. Is it useful on immediate-mode rendering GPUs, the ones we use in our daily lives for real work (NOT GAMING DON’T ASK ABOUT GAMING)?

Maybe? Outlook uncertain. Try again later.

So I’m implementing these great features with great names that are totally getting used by real people, and then notable perf anthropologist Emma Anholt comes to me and asks why zink is spewing out a thousand TRANSFER barriers for sequences of subdata calls. I’m sitting in front of my computer staring at the charts, checking over the data, looking at the graphs, and I don’t have an answer.

At that point in my life, three weeks ago before I embarked on a sleepless, not-leaving-my-house-until-the-perf-was-extant adventure that got me involved with dark forces too horrifying to describe, I thought I knew how to perf. It’s something anyone can do, isn’t it?

Obviously not. perf isn’t even a verb.

The First Step

I had to start by discarding any ideas I had about perf. Clearly I knew nothing based on the reports I was getting: zink was being absolutely slaughtered in benchmarks across the board to a truly embarrassing degree by both ANGLE and native drivers.

Again, NOT GAMING RELATED DON’T EVEN THINK ABOUT ASKING ABOUT GAMING I DON’T PLAY GAMES.

This question about TRANSFER barriers ended up being the spark of inspiration that would lead to perf. Eventually.

To make it simple, suppose this sequence of operations occurs:

  • copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)
  • copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
  • copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)

Now here’s a trick question: are barriers required between these operations?

If you answered no, you’re right. But also you’re wrong if you’re on certain hardware/driver combos which are (currently) extremely broken. Let’s pretend we don’t know about those for the sake of what sanity remains.

But what did zink, as of a few weeks ago, do?

  • copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)
  • barrier(TRANSFER)
  • copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
  • barrier(TRANSFER)
  • copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)

Brilliant. Instead of pipelining the copies, the driver is instead pointlessly stalling between each one. This is what perf looks like, right?

The simple fix for this (that I implemented) was to track “live” transfer regions on resources, i.e., the unflushed regions that have writes pending in the current batch, and then query these regions for any subsequent read/write operation to determine whether a barrier is needed. This enables omitting any and all barriers for non-overlapping regions. Thus, the command stream looks like:

  • copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)
  • copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
  • copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)

without any barriers stalling the GPU workload.

But Wait…

Because this still isn’t perf.

Let’s look at another sequence of operations:

  • copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)
  • draw(index_buffer=dst_buffer, index_offset=0)
  • copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
  • draw(index_buffer=dst_buffer, index_offset=128)
  • copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)
  • draw(index_buffer=dst_buffer, index_offset=256)

This is some pretty standard stream upload index/vertex buffer behavior when using glthread: glthread handles the upload to a staging buffer and then punts to the driver for a GPU copy, eliminating a CPU data copy that would otherwise happen between the eventual buffer->buffer copy.

And how was the glorious zink driver handling such a scenario?

  • copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=0)
  • copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=128)
  • copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=256)

thonking.png

But this, too, can be optimized to give us foolish graphics experts a brief glance at perf. By leveraging the previous region tracking, it becomes possible to execute time-bending even more powerfully: instead of checking whether a buffer has been read from and disabling reordering, the driver can check whether a “live” region has been tracked for the operation. For example:

  • copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)
    • track(dst_buffer, offset=0, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=0)
  • copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
    • track(dst_buffer, offset=128, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=128)
  • copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)
    • track(dst_buffer, offset=256, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=256)

implements tracking, which can be used like:

  • check_liveness(dst_buffer, offset=0, size=64)
    • if (check_liveness==false) track(dst_buffer, offset=0, size=64)
  • copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=0)
  • check_liveness(dst_buffer, offset=128, size=64)
    • if (check_liveness==false) track(dst_buffer, offset=128, size=64)
  • copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=128)
  • check_liveness(dst_buffer, offset=256, size=64)
    • if (check_liveness==false) track(dst_buffer, offset=256, size=64)
  • copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=256)

which can then be transformed to:

  • check_liveness(dst_buffer, offset=0, size=64) = false
    • if (check_liveness==false) track(dst_buffer, offset=0, size=64)
  • copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)
  • check_liveness(dst_buffer, offset=128, size=64)
    • if (check_liveness==false) track(dst_buffer, offset=128, size=64)
    • if (check_liveness==false) copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
  • check_liveness(dst_buffer, offset=256, size=64)
    • if (check_liveness==false) copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=0)
  • check_liveness(dst_buffer, offset=128, size=64)
    • if (check_liveness==false) track(dst_buffer, offset=128, size=64)
    • if (check_liveness==true) copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
    • if (check_liveness==true) barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=128)
  • check_liveness(dst_buffer, offset=256, size=64)
    • if (check_liveness==false) track(dst_buffer, offset=256, size=64)
    • if (check_liveness==true) copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)
    • if (check_liveness==true) barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=256)

Practically unreadable, so let’s clean it up:

  • check_liveness(dst_buffer, offset=0, size=64) = false
    • if (check_liveness==false) track(dst_buffer, offset=0, size=64)
  • copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)
  • check_liveness(dst_buffer, offset=128, size=64)
    • if (check_liveness==false) track(dst_buffer, offset=128, size=64)
    • if (check_liveness==false) copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
  • check_liveness(dst_buffer, offset=256, size=64)
    • if (check_liveness==false) track(dst_buffer, offset=256, size=64)
    • if (check_liveness==false) copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=0)
  • draw(index_buffer=dst_buffer, index_offset=128)
  • draw(index_buffer=dst_buffer, index_offset=256)

Which, functionally, becomes:

  • track(dst_buffer, offset=0, size=64)
  • copy(src_buffer, dst_buffer, src_offset=0, dst_offset=0, size=64)
  • track(dst_buffer, offset=128, size=64)
  • copy(src_buffer, dst_buffer, src_offset=128, dst_offset=128, size=64)
  • track(dst_buffer, offset=256, size=64)
  • copy(src_buffer, dst_buffer, src_offset=256, dst_offset=256, size=64)
  • barrier(src=TRANSFER, dst=INDEX_READ)
  • draw(index_buffer=dst_buffer, index_offset=0)
  • draw(index_buffer=dst_buffer, index_offset=128)
  • draw(index_buffer=dst_buffer, index_offset=256)

All the copies are pipelined, all the draws are pipelined, and there’s only one barrier.

This is perf.

Stay tuned, because Mesa 23.1 is going to do things with perf that zink users have never seen before.

Written on March 17, 2023