|
![]() ![]()
![]() ![]() |

English version follows.
Aujourd'hui, Khronos Group a sorti la spécification 1.4 de l'API graphique standard Vulkan. Le projet Asahi Linux est fier d'annoncer le premier pilote Vulkan 1.4 pour le matériel d'Apple. En effet, notre pilote graphique Honeykrisp est reconnu par Khronos comme conforme à cette nouvelle version dès aujourd'hui.
Ce pilote est déjà disponible dans nos dépôts officiels. Après avoir installé Fedora Asahi Remix, executez dnf upgrade --refresh
pour obtenir la dernière version du pilote.
Vulkan 1.4 standardise plusieurs fonctionnalités importantes, y compris les horodatages et la lecture locale avec le rendu dynamique. L'industrie suppose que ces fonctionnalités devront être plus courantes, et nous y sommes préparés.
Sortir un pilote conforme reflète notre engagement en faveur des standards graphiques et du logiciel libre. Asahi Linux est aussi compatible avec OpenGL 4.6, OpenGL ES 3.2, et OpenCL 3.0, tous conformes aux spécifications pertinentes. D'ailleurs, les nôtres sont les seuls pilotes conformes pour le materiel d'Apple de n'importe quel standard graphique.
Même si le pilote est sorti, il faut encore compiler une version expérimentale de Vulkan-Loader pour utiliser la nouvelle version de Vulkan. Toutes les nouvelles fonctionnalités sont néanmoins disponibles comme extensions à notre pilote Vulkan 1.3 pour en profiter tout de suite.
Pour plus d'informations, consultez l'article du blog de Khronos.
Today, the Khronos Group released the 1.4 specification of Vulkan, the standard graphics API. The Asahi Linux project is proud to announce the first Vulkan 1.4 driver for Apple hardware. Our Honeykrisp driver is Khronos-recognized as conformant to the new version since day one.
That driver is already available in our official repositories. After installing Fedora Asahi Remix, run dnf upgrade --refresh
to get the latest drivers.
Vulkan 1.4 standardizes several important features, including timestamps and dynamic rendering local read. The industry expects that these features will become more common, and we are prepared.
Releasing a conformant driver reflects our commitment to graphics standards and software freedom. Asahi Linux is also compatible with OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0, all conformant to the relevant specifications. For that matter, ours are the only conformant drivers on Apple hardware for any graphics standard.
Although the driver is released, you still need to build an experimental version of Vulkan-Loader to access the new Vulkan version. Nevertheless, you can immediately use all the new features as extensions in our Vulkan 1.3 driver.
For more information, see the Khronos blog post.
- Facebook✓
- Google+✓
- Instapaper✓
- Tweet✓

Gaming on Linux on M1 is here! We're thrilled to release our Asahi game playing toolkit, which integrates our Vulkan 1.3 drivers with x86 emulation and Windows compatibility. Plus a bonus: conformant OpenCL 3.0.
Asahi Linux now ships the only conformant OpenGL®, OpenCL™, and Vulkan® drivers for this hardware. As for gaming… while today's release is an alpha, Control runs well!

Installation
First, install Fedora Asahi Remix. Once installed, get the latest drivers with dnf upgrade --refresh && reboot
. Then just dnf install steam
and play. While all M1/M2-series systems work, most games require 16GB of memory due to emulation overhead.
The stack
Games are typically x86 Windows binaries rendering with DirectX, while our target is Arm Linux with Vulkan. We need to handle each difference:
- FEX emulates x86 on Arm.
- Wine translates Windows to Linux.
- DXVK and vkd3d-proton translate DirectX to Vulkan.
There's one curveball: page size. Operating systems allocate memory in fixed size "pages". If an application expects smaller pages than the system uses, they will break due to insufficient alignment of allocations. That's a problem: x86 expects 4K pages but Apple systems use 16K pages.
While Linux can't mix page sizes between processes, it can virtualize another Arm Linux kernel with a different page size. So we run games inside a tiny virtual machine using muvm, passing through devices like the GPU and game controllers. The hardware is happy because the system is 16K, the game is happy because the virtual machine is 4K, and you're happy because you can play Fallout 4.

Vulkan
The final piece is an adult-level Vulkan driver, since translating DirectX requires Vulkan 1.3 with many extensions. Back in April, I wrote Honeykrisp, the only Vulkan 1.3 driver for Apple hardware. I've since added DXVK support. Let's look at some new features.
Tessellation
Tessellation enables games like The Witcher 3 to generate geometry. The M1 has hardware tessellation, but it is too limited for DirectX, Vulkan, or OpenGL. We must instead tessellate with arcane compute shaders, as detailed in today's talk at XDC2024.

Geometry shaders
Geometry shaders are an older, cruder method to generate geometry. Like tessellation, the M1 lacks geometry shader hardware so we emulate with compute. Is that fast? No, but geometry shaders are slow even on desktop GPUs. They don't need to be fast — just fast enough for games like Ghostrunner.

Enhanced robustness
"Robustness" permits an application's shaders to access buffers out-of-bounds without crashing the hardware. In OpenGL and Vulkan, out-of-bounds loads may return arbitrary elements, and out-of-bounds stores may corrupt the buffer. Our OpenGL driver exploits this definition for efficient robustness on the M1.
Some games require stronger guarantees. In DirectX, out-of-bounds loads return zero, and out-of-bounds stores are ignored. DXVK therefore requires VK_EXT_robustness2
, a Vulkan extension strengthening robustness.
Like before, we implement robustness with compare-and-select instructions. A naïve implementation would compare a loaded index with the buffer size and select a zero result if out-of-bounds. However, our GPU loads are vector while arithmetic is scalar. Even if we disabled page faults, we would need up to four compare-and-selects per load.
load R, buffer, index * 16
ulesel R[0], index, size, R[0], 0
ulesel R[1], index, size, R[1], 0
ulesel R[2], index, size, R[2], 0
ulesel R[3], index, size, R[3], 0
There's a trick: reserve 64 gigabytes of zeroes using virtual memory voodoo. Since every 32-bit index multiplied by 16 fits in 64 gigabytes, any index into this region loads zeroes. For out-of-bounds loads, we simply replace the buffer address with the reserved address while preserving the index. Replacing a 64-bit address costs just two 32-bit compare-and-selects.
ulesel buffer.lo, index, size, buffer.lo, RESERVED.lo
ulesel buffer.hi, index, size, buffer.hi, RESERVED.hi
load R, buffer, index * 16
Two instructions, not four.
Next steps
Sparse texturing is next for Honeykrisp, which will unlock more DX12 games. The alpha already runs DX12 games that don't require sparse, like Cyberpunk 2077.

While many games are playable, newer AAA titles don't hit 60fps yet. Correctness comes first. Performance improves next. Indie games like Hollow Knight do run full speed.

Beyond gaming, we're adding general purpose x86 emulation based on this stack. For more information, see the FAQ.
Today's alpha is a taste of what's to come. Not the final form, but enough to enjoy Portal 2 while we work towards "1.0".

Acknowledgements
This work has been years in the making with major contributions from…
- Alyssa Rosenzweig
- Asahi Lina
- chaos_princess
- Davide Cavalca
- Dougall Johnson
- Ella Stanforth
- Faith Ekstrand
- Janne Grunau
- Karol Herbst
- marcan
- Mary Guillemard
- Neal Gompa
- Sergio López
- TellowKrinkle
- Teoh Han Hui
- Rob Clark
- Ryan Houdek
… Plus hundreds of developers whose work we build upon, spanning the Linux, Mesa, Wine, and FEX projects. Today's release is thanks to the magic of open source.
We hope you enjoy the magic.
Happy gaming.
- Facebook✓
- Google+✓
- Instapaper✓
- Tweet✓

Finally, conformant Vulkan for the M1! The new "Honeykrisp" driver is the first conformant Vulkan® for Apple hardware on any operating system, implementing the full 1.3 spec without "portability" waivers.
Honeykrisp is not yet released for end users. We're continuing to add features, improve performance, and port to more hardware. Source code is available for developers.

Honeykrisp is not based on prior M1 Vulkan efforts, but rather Faith Ekstrand's open source NVK driver for NVIDIA GPUs. In her words:
All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan driver and started by copying+pasting from it. My hope is that NVK will eventually become the driver that everyone copies and pastes from. To that end, I'm building NVK with all the best practices we've developed for Vulkan drivers over the last 7.5 years and trying to keep the code-base clean and well-organized.
Why spend years implementing features from scratch when we can reuse NVK? There will be friction starting out, given NVIDIA's desktop architecture differs from the M1's mobile roots. In exchange, we get a modern driver designed for desktop games.
We'll need to pass a half-million tests ensuring correctness, submit the results, and then we'll become conformant after 30 days of industry review. Starting from NVK and our OpenGL 4.6 driver… can we write a driver passing the Vulkan 1.3 conformance test suite faster than the 30 day review period?
It's unprecedented…
Challenge accepted.
April 2
It begins with a text.
Faith… I think I want to write a Vulkan driver.
Her advice?
Just start typing.
Thre's no copy-pasting yet — we just add M1 code to NVK and remove NVIDIA as we go. Since the kernel mediates our access to the hardware, we begin connecting "NVK" to Asahi Lina's kernel driver using code shared with OpenGL. Then we plug in our shader compiler and hit the hay.
April 3
To access resources, GPUs use "descriptors" containing the address, format, and size of a resource. Vulkan bundles descriptors into "sets" per the application's "descriptor set layout". When compiling shaders, the driver lowers descriptor accesses to marry the set layout with the hardware's data structures. As our descriptors differ from NVIDIA's, our next task is adapting NVK's descriptor set lowering. We start with a simple but correct approach, deleting far more code than we add.
April 4
With working descriptors, we can compile compute shaders. Now we program the fixed-function hardware to dispatch compute. We first add bookkeeping to map Vulkan command buffers to lists of M1 "control streams", then we generate a compute control stream. We copy that code from our OpenGL driver, translate the GL into Vulkan, and compute works.
That's enough to move on to "copies" of buffers and images. We implement Vulkan's copies with compute shaders, internally dispatched with Vulkan commands as if we were the application. The first copy test passes.
April 5
Fleshing out yesterday's code, all copy tests pass.
April 6
We're ready to tackle graphics. The novelty is handling graphics state like depth/stencil. That's straightforward, but there's a lot of state to handle. Faith's code collects all "dynamic state" into a single structure, which we translate into hardware control words. As usual, we grab that translation from our OpenGL driver, blend with NVK, and move on.
April 7
What makes state "dynamic"? Dynamic state can change without recompiling shaders. By contrast, static state is baked into shader binaries called "pipelines". If games create all their pipelines during a loading screen, there is no compiler "stutter" during gameplay. The idea hasn't quite panned out: many game developers don't know their state ahead-of-time so cannot create pipelines early. In response, Vulkan has made ever more state dynamic, punctuated with the EXT_shader_object
extension that makes pipelines optional.
We want full dynamic state and shader objects. Unfortunately, the M1 bakes random state into shaders: vertex attributes, fragment outputs, blending, even linked interpolation qualifiers. Like most of the industry in the 2010s, the M1's designers bet on pipelines.
Faced with this hardware, a reasonable driver developer would double-down on pipelines. DXVK would stutter, but we'd pass conformance.
I am not reasonable.
To eliminate stuttering in OpenGL, we make state dynamic with four strategies:
- Conditional code.
- Precompiled variants.
- Indirection.
- Prologs and epilogs.
Wait, what-a-logs?
AMD also bakes state into shaders… with a twist. They divide the hardware binary into three parts: a prolog, the shader, and an epilog. Confining dynamic state to the periphery eliminates shader variants. They compile prologs and epilogs on the fly, but that's fast and doesn't stutter. Linking shader parts is a quick concatenation, or long jumps avoid linking altogether. This strategy works for the M1, too.
For Honeykrisp, let's follow NVK's lead and treat all state as dynamic. No other Vulkan driver has implemented full dynamic state and shader objects this early on, but it avoids refactoring later. Today we add the code to build, compile, and cache prologs and epilogs.
Putting it together, we get a (dynamic) triangle:
April 8
Guided by the list of failing tests, we wire up the little bits missed along the way, like translating border colours.
/* Translate an American VkBorderColor into a Canadian agx_border_colour */
enum agx_border_colour
translate_border_color(VkBorderColor color)
{
switch (color) {
case VK_BORDER_COLOR_INT_TRANSPARENT_BLACK:
return AGX_BORDER_COLOUR_TRANSPARENT_BLACK;
...
}
}
Test results are getting there.
Pass: 149770, Fail: 7741, Crash: 2396
That's good enough for vkQuake.
April 9
Lots of little fixes bring us to a 99.6% pass rate… for Vulkan 1.1. Why stop there? NVK is 1.3 conformant, so let's claim 1.3 and skip to the finish line.
Pass: 255209, Fail: 3818, Crash: 599
98.3% pass rate for 1.3 on our 1 week anniversary.
Not bad.
April 10
SuperTuxKart has a Vulkan renderer.
April 11
Zink works too.
April 12
I tracked down some fails to a test bug, where an arbitrary verification threshold was too strict to pass on some devices. I filed a bug report, and it's resolved within a few weeks.
April 16
The tests for "descriptor indexing" revealed a compiler bug affecting subgroup shuffles in non-uniform control flow. The M1's shuffle instruction is quirky, but it's easy to workaround. Fixing that fixes the descriptor indexing tests.
April 17
A few tests crash inside our register allocator. Their shaders contain a peculiar construction:
condition
is always false, but the compiler doesn't know that.
Infinite loops are nominally invalid since shaders must terminate in finite time, but this shader is syntactically valid. "All loops contain a break" seems obvious for a shader, but it's false. It's straightforward to fix register allocation, but what a doozy.
April 18
Remember copies? They're slow, and every frame currently requires a copy to get on screen.
For "zero copy" rendering, we need enough Linux window system integration to negotiate an efficient surface layout across process boundaries. Linux uses "modifiers" for this purpose, so we implement the EXT_image_drm_format_modifier
extension. And by implement, I mean copy.
Copies to avoid copies.
April 20
"I'd like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux Vulkan Mac."
…
"Ma'am, this is a Wendy's."
April 22
As bug fixing slows down, we step back and check our driver architecture. Since we treat all state as dynamic, we don't pre-pack control words during pipeline creation. That adds theoretical CPU overhead.
Is that a problem? After some optimization, vkoverhead says we're pushing 100 million draws per second.
I think we're okay.
April 24
Time to light up YCbCr. If we don't use special YCbCr hardware, this feature is "software-only". However, it touches a lot of code.
It touches so much code that Mohamed Ahmed spent an entire summer adding it to NVK.
Which means he spent a summer adding it to Honeykrisp.
Thanks, Mohamed ;-)
April 25
Query copies are next. In Vulkan, the application can query the number of samples rendered, writing the result into an opaque "query pool". The result can be copied from the query pool on the CPU or GPU.
For the CPU, the driver maps the pool's internal data structure and copies the result. This may require nontrivial repacking.
For the GPU, we need to repack in a compute shader. That's harder, because we can't just run C code on the GPU, right?
…Actually, we can.
A little witchcraft makes GPU query copies as easy as C.
void copy_query(struct params *p, int i) {
uintptr_t dst = p->dest + i * p->stride;
int query = p->first + i;
if (p->available[query] || p->partial) {
int q = p->index[query];
write_result(dst, p->_64, p->results[q]);
}
...
}
April 26
The final boss: border colours, hard mode.
Direct3D lets the application choose an arbitrary border colour when creating a sampler. By contrast, Vulkan only requires three border colours:
(0, 0, 0, 0)
— transparent black(0, 0, 0, 1)
— opaque black(1, 1, 1, 1)
— opaque white
We handled these on April 8. Unfortunately, there are two problems.
First, we need custom border colours for Direct3D compatibility. Both DXVK and vkd3d-proton require the EXT_custom_border_color
extension.
Second, there's a subtle problem with our hardware, causing dozens of fails even without custom border colours. To understand the issue, let's revisit texture descriptors, which contain a pixel format and a component reordering swizzle.
Some formats are implicitly reordered. Common "BGRA" formats swap red and blue for historical reasons. The M1 does not directly support these formats. Instead, the driver composes the swizzle with the format's reordering. If the application uses a BARB
swizzle with a BGRA
format, the driver uses an RABR
swizzle with an RGBA
format.
There's a catch: swizzles apply to the border colour, but formats do not. We need to undo the format reordering when programming the border colour for correct results after the hardware applies the composed swizzle. Our OpenGL driver implements border colours this way, because it knows the texture format when creating the sampler. Unfortunately, Vulkan doesn't give us that information.
Without custom border colour support, we "should" be okay. Swapping red and blue doesn't change anything if the colour is white or black.
There's an even subtler catch. Vulkan mandates support for a packed 16-bit format with 4-bit components. The M1 supports a similar format… but with reversed "endianness", swapping red and alpha.
That still seems okay. For transparent black (all zero) and opaque white (all one), swapping components doesn't change the result.
The problem is opaque black: (0, 0, 0, 1)
. Swapping red and alpha gives (1, 0, 0, 0)
. Transparent red? Uh-oh.
We're stuck. No known hardware configuration implements correct Vulkan semantics.
Is hope lost?
Do we give up?
A reasonable person would.
I am not reasonable.
Let's jump into the deep end. If we implement custom border colours, opaque black becomes a special case. But how? The M1's custom border colours entangle the texture format with the sampler. A reasonable person would skip Direct3D support.
As you know, I am not reasonable.
Although the hardware is unsuitable, we control software. Whenever a shader samples a texture, we'll inject code to fix up the border colour. This emulation is simple, correct, and slow. We'll use dirty driver tricks to speed it up later. For now, we eat the cost, advertise full custom border colours, and pass the opaque black tests.
April 27
All that's left is some last minute bug fixing, and…
Pass: 686930, Fail: 0
Success.
The future
The next task is implementing everything that DXVK and vkd3d-proton require to layer Direct3D. That includes esoteric extensions like transform feedback. Then Wine and an open source x86 emulator will run Windows games on Asahi Linux.
That's getting ahead of ourselves. In the mean time, enjoy Linux games with our conformant OpenGL 4.6 drivers… and stay tuned.

- Facebook✓
- Google+✓
- Instapaper✓
- Tweet✓

For years, the M1 has only supported OpenGL 4.1. That changes today — with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! Install Fedora for the latest M1/M2-series drivers.
Already installed? Just dnf upgrade --refresh
.
Unlike the vendor's non-conformant 4.1 drivers, our open source Linux drivers are conformant to the latest OpenGL versions, finally promising broad compatibility with modern OpenGL workloads, like Blender.
Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The official list of conformant drivers now includes our OpenGL 4.6 and ES 3.2.
While the vendor doesn't yet support graphics standards like modern OpenGL, we do. For this Valentine's Day, we want to profess our love for interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special ports. For that, we need standards conformance. Six months ago, we became the first conformant driver for any standard graphics API for the M1 with the release of OpenGL ES 3.1 drivers. Today, we've finished OpenGL with the full 4.6… and we're well on the road to Vulkan.
Compared to 4.1, OpenGL 4.6 adds dozens of required features, including:
- Robustness
- SPIR-V
- Clip control
- Cull distance
- Compute shaders
- Upgraded transform feedback
Regrettably, the M1 doesn't map well to any graphics standard newer than OpenGL ES 3.1. While Vulkan makes some of these features optional, the missing features are required to layer DirectX and OpenGL on top. No existing solution on M1 gets past the OpenGL 4.1 feature set.
How do we break the 4.1 barrier? Without hardware support, new features need new tricks. Geometry shaders, tessellation, and transform feedback become compute shaders. Cull distance becomes a transformed interpolated value. Clip control becomes a vertex shader epilogue. The list goes on.
For a taste of the challenges we overcame, let's look at robustness.
Built for gaming, GPUs traditionally prioritize raw performance over safety. Invalid application code, like a shader that reads a buffer out-of-bounds, can trigger undefined behaviour. Drivers exploit that to maximize performance.
For applications like web browsers, that trade-off is undesirable. Browsers handle untrusted shaders, which they must sanitize to ensure stability and security. Clicking a malicious link should not crash the browser. While some sanitization is necessary as graphics APIs are not security barriers, reducing undefined behaviour in the API can assist "defence in depth".
"Robustness" features can help. Without robustness, out-of-bounds buffer access in a shader can crash. With robustness, the application can opt for defined out-of-bounds behaviour, trading some performance for less attack surface.
All modern cross-vendor APIs include robustness. Many games even (accidentally?) rely on robustness. Strangely, the vendor's proprietary API omits buffer robustness. We must do better for conformance, correctness, and compatibility.
Let's first define the problem. Different APIs have different definitions of what an out-of-bounds load returns when robustness is enabled:
- Zero (Direct3D, Vulkan with
robustBufferAccess2
) - Either zero or some data in the buffer (OpenGL, Vulkan with
robustBufferAccess
) - Arbitrary values, but can't crash (OpenGL ES)
OpenGL uses the second definition: return zero or data from the buffer. One approach is to return the last element of the buffer for out-of-bounds access. Given the buffer size, we can calculate the last index. Now consider the minimum of the index being accessed and the last index. That equals the index being accessed if it is valid, and some other valid index otherwise. Loading the minimum index is safe and gives a spec-compliant result.
As an example, a uniform buffer load without robustness might look like:
Robustness adds a single unsigned minimum (umin
) instruction:
Is the robust version slower? It can be. The difference should be small percentage-wise, as arithmetic is faster than memory. With thousands of threads running in parallel, the arithmetic cost may even be hidden by the load's latency.
There's another trick that speeds up robust uniform buffers. Like other GPUs, the M1 supports "preambles". The idea is simple: instead of calculating the same value in every thread, it's faster to calculate once and reuse the result. The compiler identifies eligible calculations and moves them to a preamble executed before the main shader. These redundancies are common, so preambles provide a nice speed-up.
We usually move uniform buffer loads to the preamble when every thread loads the same index. Since the size of a uniform buffer is fixed, extra robustness arithmetic is also moved to the preamble. The robustness is "free" for the main shader. For robust storage buffers, the clamping might move to the preamble even if the load or store cannot.
Armed with robust uniform and storage buffers, let's consider robust "vertex buffers". In graphics APIs, the application can set vertex buffers with a base GPU address and a chosen layout of "attributes" within each buffer. Each attribute has an offset and a format, and the buffer has a "stride" indicating the number of bytes per vertex. The vertex shader can then read attributes, implicitly indexing by the vertex. To do so, the shader loads the address:
Some hardware implements robust vertex fetch natively. Other hardware has bounds-checked buffers to accelerate robust software vertex fetch. Unfortunately, the M1 has neither. We need to implement vertex fetch with raw memory loads.
One instruction set feature helps. In addition to a 64-bit base address, the M1 GPU's memory loads also take an offset in elements. The hardware shifts the offset and adds to the 64-bit base to determine the address to fetch. Additionally, the M1 has a combined integer multiply-add instruction imad
. Together, these features let us implement vertex loads in two instructions. For example, a 32-bit attribute load looks like:
The hardware load can perform an additional small shift. Suppose our attribute is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction:
…with the hardware calculating the address:
What about robustness?
We want to implement robustness with a clamp, like we did for uniform buffers. The problem is that the vertex buffer size is given in bytes, while our optimized load takes an index in "vertices". A single vertex buffer can contain multiple attributes with different formats and offsets, so we can't convert the size in bytes to a size in "vertices".
Let's handle the latter problem. We can rewrite the addressing equation as:
That is: one buffer with many attributes at different offsets is equivalent to many buffers with one attribute and no offset. This gives an alternate perspective on the same data layout. Is this an improvement? It avoids an addition in the shader, at the cost of passing more data — addresses are 64-bit while attribute offsets are 16-bit. More importantly, it lets us translate the vertex buffer size in bytes into a size in "vertices" for each vertex attribute. Instead of clamping the offset, we clamp the vertex index. We still make full use of the hardware addressing modes, now with robustness:
We need to calculate the last valid vertex index ahead-of-time for each attribute. Each attribute has a format with a particular size. Manipulating the addressing equation, we can calculate the last byte accessed in the buffer (plus 1) relative to the base:
The load is valid when that value is bounded by the buffer size in bytes. We solve the integer inequality as:
The driver calculates the right-hand side and passes it into the shader.
One last problem: what if a buffer is too small to load anything? Clamping won't save us — the code would clamp to a negative index. In that case, the attribute is entirely invalid, so we swap the application's buffer for a small buffer of zeroes. Since we gave each attribute its own base address, this determination is per-attribute. Then clamping the index to zero correctly loads zeroes.
Putting it together, a little driver math gives us robust buffers at the cost of one umin
instruction.
In addition to buffer robustness, we need image robustness. Like its buffer counterpart, image robustness requires that out-of-bounds image loads return zero. That formalizes a guarantee that reasonable hardware already makes.
…But it would be no fun if our hardware was reasonable.
Running the conformance tests for image robustness, there is a single test failure affecting "mipmapping".
For background, mipmapped images contain multiple "levels of detail". The base level is the original image; each successive level is the previous level downscaled. When rendering, the hardware selects the level closest to matching the on-screen size, improving efficiency and visual quality.
With robustness, the specifications all agree that image loads return…
- Zero if the X- or Y-coordinate is out-of-bounds
- Zero if the level is out-of-bounds
Meanwhile, image loads on the M1 GPU return…
- Zero if the X- or Y-coordinate is out-of-bounds
- Values from the last level if the level is out-of-bounds
Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware clamps the level and returns nonzero values. It's a mystery why. The vendor does not document their hardware publicly, forcing us to rely on reverse engineering to build drivers. Without documentation, we don't know if this behaviour is intentional or a hardware bug. Either way, we need a workaround to pass conformance.
The obvious workaround is to never load from an invalid level:
That involves branching, which is inefficient. Loading an out-of-bounds level doesn't crash, so we can speculatively load and then use a compare-and-select operation instead of branching:
This workaround is okay, but it could be improved. While the M1 GPU has combined compare-and-select instructions, the instruction set is scalar. Each thread processes one value at a time, not a vector of multiple values. However, image loads return a vector of four components (red, green, blue, alpha). While the pseudo-code looks efficient, the resulting assembly is not:
image_load R, x, y, level
ulesel R[0], level, levels, R[0], 0
ulesel R[1], level, levels, R[1], 0
ulesel R[2], level, levels, R[2], 0
ulesel R[3], level, levels, R[3], 0
Fortunately, the vendor driver has a trick. We know the hardware returns zero if either X or Y is out-of-bounds, so we can force a zero output by setting X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X greater than 16384 is out-of-bounds. That justifies an alternate workaround:
Why is this better? We only change a single scalar, not a whole vector, compiling to compact scalar assembly:
If we preload the constant to a uniform register, the workaround is a single instruction. That's optimal — and it passes conformance.
Blender "Wanderer" demo by Daniel Bystedt, licensed CC BY-SA.
- Facebook✓
- Google+✓
- Instapaper✓
- Tweet✓

After a month of reverse-engineering, we're excited to release documentation on the Valhall instruction set, available as a PDF. The findings are summarized in an XML architecture description for machine consumption. In tandem with the documentation, we've developed a Valhall assembler and disassembler as a reverse-engineering aid.
Valhall is the fourth Arm® Mali™ architecture and the fifth Mali instruction set. It is implemented in the Arm® Mali™-G78, the most recently released Mali hardware, and Valhall will continue to be implemented in Mali products yet to come.
Each architecture represents a paradigm shift from the last. Midgard generalizes the Utgard pixel processor to support compute shaders by unifying the shader stages, adding general purpose memory access, and supporting integers of various bit sizes. Bifrost scalarizes Midgard, transitioning away from the fixed 4-channel vector (vec4
) architecture of Utgard and Midgard to instead rely on warp-based execution for parallelism, better using the hardware on modern workloads. Valhall linearizes Bifrost, removing the Very Long Instruction Word mechanisms of its predecessors. Valhall replaces the compiler's static scheduling with hardware dynamic scheduling, trading additional control hardware for higher average performance. That means padding with "no operation" instructions is no longer required, which may decrease code size, promising better instruction cache use.
All information in this post and the linked PDF and XML is published in good faith and for general information purpose only. We do not make any warranties about the completeness, reliability and accuracy of this information. Any action you take upon the information you find here, is strictly at your own risk. We are not be liable for any losses and/or damages in connection with the use of this information.
While we strive to make the information as accurate as possible, we make no claims, promises, or guarantees about its accuracy, completeness, or adequacy. We expressly disclaim liability for content, errors and omissions in this information.
Let's dig in.
Getting started
In June, Collabora procured an International edition of the Samsung Galaxy S21 phone, powered by a system-on-chip with Mali G78. Although Arm announced Valhall with the Mali G77 in May 2019, roll out has been slow due to the COVID-19 pandemic. At the time of writing, there are not yet Linux friendly devices with a Valhall chip, forcing use of a locked down Android device. There's a silver lining: we have a head start on the reverse-engineering, so by the time hacker-friendly devices arrive with Valhall GPUs, we can have open source drivers ready.
Android complicates reverse-engineering (though not as much as macOS). On Linux, we can compile a library on the device to intercept data sent to the GPU. On Android, we must cross-compile from a desktop with the Android Native Development Kit, ironically software that doesn't run on Arm processors. Further, where on Linux we can track the standard system calls, Android device drivers replace the standard open()
system call with a complicated Android-only "binder" interface. Adapting the library to support binder would be gnarly, but do we have to? We could sprinkle in one little hack anywhere we see a file descriptor without the file name.
#define MALI0 "/dev/mali0"
bool is_mali(int fd)
{
char in[128] = { 0 }, out[128] = { 0 };
snprintf(in, sizeof(in), "/proc/self/fd/%d", fd);
int count = readlink(in, out, sizeof(out) - 1);
return count == strlen(MALI0) && strncmp(out, MALI0, count) == 0;
}
Now we can hook the Mali ioctl()
calls without tracing binder and easily dump graphics memory.
We're interested in the new instruction set, so we're looking for the compiled shader binaries in memory. There's a chicken-and-egg problem: we need to find the shaders to reverse-engineer them, but we need to reverse-engineer the shaders to know what to look for. Fortunately, there's an escape hatch. The proprietary Mali drivers allow an OpenGL application to query the compiled binary with the ARM_mali_program_binary
extension, returning a file in the Mali Binary Shader format. That format was reverse-engineered years ago by Connor Abbott for earlier Mali architectures, and the basic structure is unchanged in Valhall. Our task is simple: compile a test shader, dump both GPU memory and the Mali Binary Shader, and find the common section. Searching for the common bytes produces an address in executable graphics memory, in this case 0x7f0002de00
. Searching for that address in turn finds the "shader program descriptor" which references it.
18 00 00 80 00 10 00 00 00 DE 02 00 7F 00 00 00
Another search shows this descriptor's address in the payload of an index-driven vertex shading job for graphics or a compute job for OpenCL. Those jobs contain the Job Manager header introduced a decade ago for Midgard, so we understand them well: they form a linked list of jobs, and only the first job is passed to the kernel. The kernel interface has a "job chain" parameter on the submit system call taking a GPU address. We understand the kernel interface well as it is open source due to kernel licensing requirements.
With each layer identified, we teach the wrapper library to chase the pointers and dump every shader executed, enabling us to reverse-engineer the new instruction set and develop a disassembler.
Instruction set reconnaissance
Reverse-engineering in the dark is possible, but it's easier to have some light. While waiting for the Valhall phone to arrive, I read everything Arm made public about the instruction set, particularly this article from Anandtech. Without lifting a finger, that article tells us Valhall is…
- Warp-based, like Bifrost, but with 16 threads per warp instead of Bifrost's 4/8.
- Isomorphic to Bifrost on the instruction level ("operational equivalence").
- Regularly encoded.
- Flat, lacking Bifrost's clause and tuple packaging.
It also says that Valhall has a 16KB instruction cache, holding 2048 instructions. Since Valhall has a regular encoding, we divide 16384 bytes by 2048 instructions to find a Valhall instruction is 8 bytes. Our first attempt at a "disassembler" can print hex dumps of every 8 bytes on a line; our calculation ensures that is the correct segmentation.
From here on, reverse-engineering is iterative. We have a baseline level of knowledge, and we want to grow that knowledge. To do so, we input test programs into the proprietary driver to observe the output, then perturbe the input program to see how the output changes.
As we discover new facts about the architecture, we update our disassembler, demonstrating new knowledge and separating the known from the unknown. Ideally, we encode these facts in a machine-readable file forming a single reference for the architecture. From this file, we can generate a disassembler, an assembler, an instruction encoder, and documentation. For Valhall, I use an XML file, resembling Bifrost's equivalent XML.
Filling out this file is usually straightforward though tedious. Modern APIs are large, so there is a great deal of effort required to map the API requirements to the hardware features.
However, some hardware features do not map to any API. Here are subtler tales from reversing Valhall.
Dependency slots
Arithmetic is faster than memory access, so modern processors execute arithmetic in parallel with pending memory accesses. Modern GPU architectures require the compiler to manage this mechanism by analyzing the program and instructing the hardware to wait for the results before they're needed.
For this purpose, Bifrost uses an explicit scoreboarding system. Bifrost groups up to 16 instructions together in a clause, and each clause has a fixed header. The compiler assigns a "dependency slot" between 0 and 7 to each clause, specified in the header. Each clause can wait on any set of slots, specified with another 8-bits in the clause header. Specifying dependencies per-clause is a compromise between precision and code size.
We expect Valhall to feature a similar scheme, but Valhall doesn't have clauses or clause headers, so where does it specify this info?
Studying compiled shaders, we see the last byte of every instruction is usually zero. But when the result of a memory access is first read, the previous instruction has a bit set in the last byte. Which bit is set depends on the number of memory accesses in flight, so it seems the last byte encodes a dependency wait. The memory access instructions themselves are often zero in their last bytes, so it doesn't look like the last byte is used to encode the dependency slot — but executing many memory access instructions at once and comparing the bits, we see a single 2-bit field stands out as differing. The dependency slot is specified inside the instruction, not in the metadata.
What makes this design practical? Two factors.
One, only the waits need to be specified in general. Arithmetic instructions don't need a dependency slot, since they complete immediately. The longest message passing instructions is shorter than the longer arithmetic instruction, so there is space in the instruction itself to specify only when needed.
Two, the performance gain from adding extra slots levels off quickly. Valhall cuts back on Bifrost's 8 slots (6 general purpose). Instead it has 4 or 5 slots, with only 3 general purpose, saving 4-bits for every instruction.
This story exemplifies a general pattern: Valhall is a flattening of Bifrost. Alternatively, Bifrost is "Valhall with clauses", although that description is an anachronism. Why does Bifrost have clauses, and why does Valhall remove them? The pattern in this story of dependency waits generalizes to answer the question: grouping many instructions into Bifrost clauses allows the hardware to amortize operations like dependency waits and reduce the hardware gate count of the shader core. However, clauses add substantial encoding overhead, compiler complexity, and imprecision. Bifrost optimizes for die space; Valhall optimizes for performance.
The missing modifier
Hardware features that are unused by the proprietary driver are a perennial challenge for reverse-engineering. However, we have a complete Bifrost reference at our disposal, and Valhall instructions are usually equivalent to Bifrost. Special instructions and modes from Bifrost cast a shadow on Valhall, showing where there are gaps in our knowledge. Sometimes these gaps are impractical to close, short of brute-forcing the encoding space. Other times we can transfer knowledge and make good guesses.
Consider the Cross Lane PERmute instruction, CLPER
, which takes a register and the index of another lane in the warp, and returns the value of the register in the specified lane. CLPER
is a "subgroup operation", required for Vulkan and used to implement screen-space derivatives in fragment shaders. On Bifrost, the CLPER
instruction is defined as:
<ins name="+CLPER.i32" mask="0xfc000" exact="0x7c000">
<src start="0" mask="0x7"/>
<src start="3"/>
<mod name="lane_op" start="6" size="2">
<opt>none</opt>
<opt>xor</opt>
<opt>accumulate</opt>
<opt>shift</opt>
</mod>
<mod name="subgroup" start="8" size="2">
<opt>subgroup2</opt>
<opt>subgroup4</opt>
<opt>subgroup8</opt>
</mod>
<mod name="inactive_result" start="10" size="4">
<opt>zero</opt>
<opt>umax</opt>
....
<opt>v2infn</opt>
<opt>v2inf</opt>
</mod>
</ins>
We expect a similar definition for Valhall. One modification is needed: Valhall warps contain 16 threads, so there should be a subgroup16
option after subgroup8
, with the natural binary encoding 11
. Looking at a binary Valhall CLPER instruction, we see a 11
pair corresponding to the subgroup
field. Similarly experimenting with different subgroup operations in OpenCL lets us figure out the lane_op
field. We end up with an instruction definition like:
<ins name="CLPER.u32" title="Cross-lane permute" dests="1" opcode="0xA0" opcode2="0xF">
<src/>
<src widen="true"/>
<subgroup/>
<lane_op/>
</ins>
Notice we do not specify the encoding in the Valhall XML, since Valhall encoding is regular. Also notice we lack the inactive_result
modifier. On Bifrost, inactive_result
specifies the value returned if the program attempts to access an inactive lane. We may guess Valhall has the same mechanism, but that modifier is not directly controllable by current APIs. How do we proceed?
If we can run code on the device, we can experiment with the instruction. Inactive lanes may be caused by divergent control flow, where one lane in the thread branches but another lane does not, forcing the hardware to execute only part of the warp. After reverse-engineering Valhall's branch instructions, we can construct a situation where a single lane is active and the rest are inactive. Then we insert a CLPER instruction with extra bits set, store the result to main memory, and print the result. This assembly program does the trick:
# Elect a single lane
BRANCHZ.reconverge.id lane_id, offset:3
# Try to read a value from an inactive thread
CLPER.u32 r0, r0, 0x01000000.b3, inactive_result:VALUE
# Store the value
STORE.i32.slot0.reconverge @r0, u0, offset:0
# End shader
NOP.return
With the assembler we're writing, we can assemble this compute kernel. How do we run it on the device without knowing the GPU data structures required to dispatch compute shaders? We make use of another classic reverse-engineering technique: instead of writing the initialization code ourselves, piggyback off the proprietary driver. Our wrapper library allows us to access graphics memory before the driver submits work to the hardware. We use this to read the memory, but we may also modify it. We already identified the shader program descriptor, so we can inject our own shaders. From here, we can jury-rig a script to execute arbitrary shader binaries on the device in the context of an OpenCL application running under the proprietary driver.
Putting it together, we find the inactive_result
bits in the CLPER
encoding and write one more script to dump all values.
for ((i = 0 ; i < 16 ; i++)); do
sed -e "s/VALUE/$i/" shader.asm | python3 asm.py shader.bin
adb push shader.bin /data/local/tmp/
adb shell 'REPLACE=/data/local/tmp/shader.bin '\
'LD_PRELOAD=/data/local/tmp/panwrap.so '\
'/data/local/tmp/test-opencl'
done
The script's output contains sixteen possibilities — and they line up perfectly with Bifrost's sixteen options. Success.
Next steps
There's more to learn about Valhall, but we've reverse-engineered enough to develop a Valhall compiler. As Valhall is a simplification of Bifrost, and we've already developed a free and open source compiler for Bifrost, this task is within reach. Indeed, adapting the Bifrost compiler to Valhall will require refactoring but little new development.
Mali G78 does bring changes beyond the instruction set. The data structures are changed to reduce Vulkan driver overhead. For example, the monolithic "Renderer State Descriptor" on Bifrost is split into a "Shader Program Descriptor" and a "Depth Stencil Descriptor", so changes to the depth/stencil state no longer require the driver to re-emit shader state. True, the changes require more reverse-engineering. Fortunately, many data structures are adapted from Bifrost requiring few changes to the Mesa driver.
Overall, supporting Valhall in Mesa is within reach. If you're designing a Linux-friendly device with Valhall and looking for open source drivers, please reach out!
Originally posted on Collabora's blog
- Facebook✓
- Google+✓
- Instapaper✓
- Tweet✓


After beginning a compiler for the Apple M1 GPU, the next step is to develop a graphics driver exercising the compiler. Since the last post two weeks ago, I've begun a Gallium driver for the M1, implementing much of the OpenGL 2.1 and ES 2.0 specifications. With the compiler and driver together, we're now able to run OpenGL workloads like glxgears and scenes from glmark2 on the M1 with an open source stack. We are passing about 75% of the OpenGL ES 2.0 tests in the drawElements Quality Program used to establish Khronos conformance. To top it off, the compiler and driver are now upstreamed in Mesa!
Gallium is a driver framework inside Mesa. It splits drivers into frontends, like OpenGL and OpenCL, and backends, like Intel and AMD. In between, Gallium has a common caching system for graphics and compute state, reducing the CPU overhead of every Gallium driver. The code sharing, central to Gallium's design, allows high-performance drivers to be written at a low cost. For us, that means we can focus on writing a Gallium backend for Apple's GPU and pick up OpenGL and OpenCL support "for free".
More sharing is possible. A key responsibility of the Gallium backend is to translate Gallium's state objects into hardware packets, so we need a good representation of hardware packets. While packed bitfields can work, C's bitfields have performance and safety (overflow) concerns. Hand-coded C structures lack pretty-printing needed for efficient debugging. Finally, while reverse-engineering, hand-coded C structures tend to accumulate random magic numbers in driver code, which is undesirable. These issues are not new; systems like Intel's GenXML and Nouveau's envytools solve them by allowing the hardware packets to be described as XML while all necessary C code is auto-generated. For Asahi, I've opted to use GenXML, providing a concise description of my reverse-engineering results and an ergonomic API for the driver.
The XML contains recently reverse-engineered packets, like those describing textures and samplers. Fortunately, these packets map closely to both Metal and Gallium, allowing a natural translation. Coupled with Dougall Johnson's latest notes on texture instructions, adding texture support to the Gallium driver and NIR compiler was a cinch.
The resulting XML is somewhat smaller than that of other reverse-engineered drivers of similar maturity. As discussed in the previous post, Apple relies on shader code in lieu of fixed-function graphics hardware for tasks like vertex attribute fetch and blending. Apple's design reduces the hardware surface area, in turn reducing the number of packets in the XML at the expense of extra driver code to produce the needed shader variants. Here is yet another win for code sharing: the most complex code needed is for blending and logic operations, for which Mesa already has lowering code. Mali (Panfrost) needs some blending lowered to shader code, and all Mesa drivers need advanced blending equations lowered (modes like overlay, colour dodge, and screen). As a result, it will be a simple task to wire in the "load tilebuffer to register" instruction and Mesa's blending code and see a thousand more tests pass.
Although Apple culled a great deal of legacy functionality from their GPU, some features are retained to support older APIs like compatibility contexts of desktop OpenGL, even when the features are inaccessible from Metal. Index buffers and primitive types exhibit this phenomenon. In Metal, an application can draw using a 16-bit or 32-bit index buffer, selecting primitives like triangles and triangle strips, with primitive restart always enabled. Most graphics developers want to use this fast path; by only supporting the fast path, Metal prevents a game developer from accidentally introducing slow code. However, this limits our ability to implement Khronos APIs like OpenGL and Vulkan on top of the Apple hardware… or does it?
In addition to the subset supported by Metal, the Khronos APIs also support 8-bit index buffers, additional primitive types like triangle fans, and an option to disable primitive restart. True, some of this functionality is unnecessary. Real geometry usually requires more than 256 vertices, so 8-bit index buffers are a theoretical curiosity. Primitives like triangle fans can bring a hardware penalty relative to triangle strips due to poor cache locality while offering no generality over indexed triangles. Well-written apps generally require primitive restart, and it's almost free for apps that don't need it. Even so, real applications (and the Khronos conformance tests) do use these features, so we need to support them in our OpenGL and Vulkan drivers. The issue does not only affect us — drivers layered on top of Metal like MoltenVK struggle with these exact gaps.
If Apple left these features in the hardware but never wired them into Metal, we'd like to know. But clean room reverse-engineering the hardware requires observing the output of the proprietary driver and looking for patterns, so if the proprietary (Metal) driver doesn't use the functionality, how can we ever know it exists?
This is a case for an underappreciated reverse-engineering technique: guesswork. Hardware designers are logical engineers. If we can understand how Apple approached the hardware design, we can figure out where these hidden features should be in the command stream. We're looking for conspicuous gaps in our understanding of the hardware, like looking for black holes by observing the absence of light. Consider index buffers, which are configured by an indexed draw packet with a field describing the size. After trying many indexed draws in Metal, I was left with the following reverse-engineered fragment:
<enum name="Index size">
<value name="U16" value="1"/>
<value name="U32" value="2"/>
</enum>
<struct name="Indexed draw" size="32">
...
<field name="Index size" size="2" start="2:17" type="Index size"/>
...
</struct>
If the hardware only supported what Metal supports, this fragment would be unusual. Note 16-bit and 32-bit index buffers are encoded as 1 and 2. In binary, that's 01 and 10, occupying two different bits, so we know the index size is at least 2 bits wide. That leaves 00 (0) and 11 (3) as possible unidentified index sizes. Now observe index sizes are multiples of 8 bits and powers of two, so there is a natural encoding as the base-2 logarithm of the index size in bytes. This matches our XML fragment, since log2(16 / 8) = 1
and log2(32 / 8) = 2
. From here, we make our leap of faith: if the hardware supports 8-bit index buffers, it should be encoded with value log2(8 / 8) = 0
. Sure enough, if we try out this guess, we'll find it passes the relevant OpenGL tests. Success.
Finding the missing primitive types works the same way. The primitive type field is 4-bits in the hardware, allowing for 16 primitive types to be encoded, while Metal only uses 5, leaving only 11 to brute force with the tests in hand. Likewise, few bits vary between an indexed draw of triangles (no primitive restart) and an indexed draw of triangle strips (with primitive restart), giving us a natural candidate for a primitive restart enable bit. Our understanding of the hardware is coming together.
One outstanding difficulty is ironically specific to macOS: the IOGPU interface with the kernel. Traditionally, open source drivers on Linux use simple kernel space drivers together with complex userspace drivers. The userspace (Mesa) handles all 3D state, and the kernel simply handles memory management and scheduling. However, macOS does not follow this model; the IOGPU kernel extension, common to all GPUs on macOS, is made aware of graphics state like surface dimensions and even details about mipmapping. Many of these mechanisms can be ignored in Mesa, but there is still an uncomfortably large volume of "magic" to interface with the kernel, like the memory mapping descriptors. The good news is many of these elements can be simplified when we write a Linux kernel driver. The bad news is that they do need to be reverse-engineered and implemented in Mesa if we would like native Vulkan support on Macs. Still, we know enough to drive the GPU from macOS… and hey, soon enough, we'll all be running Linux on our M1 machines anyway :-)
- Facebook✓
- Google+✓
- Instapaper✓
- Tweet✓


After a few weeks of investigating the Apple M1 GPU in January, I was able to draw a triangle with my own open source code. Although I began dissecting the instruction set, the shaders there were specified as machine code. A real graphics driver needs a compiler from high-level shading languages (GLSL or Metal) to a native binary. Our understanding of the M1 GPU's instruction set has advanced over the past few months. Last week, I began writing a free and open source shader compiler targeting the Apple GPU. Progress has been rapid: at the end of its first week, it can compile both basic vertex and fragment shaders, sufficient to render 3D scenes. The spinning cube pictured above has its shaders written in idiomatic GLSL, compiled with the nascent free software compiler, and rendered with native code like the first triangle in January. No proprietary blobs here!
Over the past few months, Dougall Johnson has investigated the instruction set in-depth, building on my initial work. His findings on the architecture are outstanding, focusing on compute kernels to complement my focus on graphics. Armed with his notes and my command stream tooling, I could chip away at a compiler.
The compiler's design must fit into the development context. Asahi Linux aims to run a Linux desktop on Apple Silicon, so our driver should follow Linux's best practices like upstream development. That includes using the New Intermediate Representation (NIR) in Mesa, the home for open source graphics drivers. NIR is a lightweight library for shader compilers, with a GLSL frontend and backend targets including Intel and AMD. NIR is an alternative to LLVM, the compiler framework used by Apple. Just because Apple prefers LLVM doesn't mean we have to. A team at Valve famously rewrote AMD's LLVM backend as a NIR compiler, improving performance. If it's good enough for Valve, it's good enough for me.
Supporting NIR as input does not dictate our compiler's own intermediate representation, which reflects the hardware's design. The instruction set of AGX2 (Apple's GPU) has:
- Scalar arithmetic
- Vectorized input/output
- 16-bit types
- Free conversions between 16-bit and 32-bit
- Free floating-point absolute value, negate, saturate
- 256 registers (16-bits each)
- Register usage / thread occupancy trade-off
- Some form of multi-issue or out-of-order (superscalar) execution
Each hardware property induces a compiler property:
- Scalar sources. Don't complicate the compiler by allowing unrestricted vectors.
- Vector values at the periphery separated with vector combine and extract pseudo-instructions, optimized out during register allocation.
- 16-bit units.
- Sources and destinations are sized. The optimizer folds size conversion instructions into uses and definitions.
- Sources have absolute value and negate bits; instructions have a saturate bit. Again, the optimizer folds these away.
- A large register file means running out of registers is rare, so don't optimize for register spilling performance.
- Minimizing register pressure is crucial. Use static single assignment (SSA) form to facilitate pressure estimates, informing optimizations.
- The scheduler simply reorders instructions without leaking details to the rest of the backend. Scheduling is feasible both before and after register allocation.
Putting it together, a design for an AGX compiler emerges: a code generator translating NIR to an SSA-based intermediate representation, optimized by instruction combining passes, scheduled to minimize register pressure, register allocated while going out of SSA, scheduled again to maximize instruction-level parallelism, and finally packed to binary instructions.
These decisions reflect the hardware traits visible to software, which are themselves "shadows" cast by the hardware design. Investigating these traits offers insight into the hardware itself. Consider the register file. While every thread can access up to 256 half-word registers, there is a performance penalty: the more registers used, the fewer concurrent threads possible, since threads share a register file. The number of threads allowed in a given shader is reported in Metal as the maxTotalThreadsPerThreadgroup
property. So, we can study the register pressure versus occupancy trade-off by varying the register pressure of Metal shaders (confirmed via our disassembler) and correlating with the value of maxTotalThreadsPerThreadgroup
:
Registers | Threads |
---|---|
<= 104 | 1024 |
112 | 896 |
120, 128 | 832 |
136 | 768 |
144 | 704 |
152, 160 | 640 |
168-184 | 576 |
192-208 | 512 |
216-232 | 448 |
240-256 | 384 |
From the table, it's clear that up until a threshold, it doesn't matter how many registers the program uses; occupancy is unaffected. Most well-written shaders fall in this bracket and need not worry. After hitting the threshold, other GPUs might spill registers to memory, but Apple doesn't need to spill until more than 256 registers are required. Between 112 and 256 registers, the number of threads decreases in an almost linear fashion, in increments of 64 threads. Carefully considering rounding, it's easy to recover the formula Metal uses to map register usage to thread count.
What's less obvious is that we can infer the size of the machine's register file. On one hand, if 256 registers are used, the machine can still support 384 threads, so the register file must be at least 256 half-words * 2 bytes per half-word * 384 threads = 192 KiB large. Likewise, to support 1024 threads at 104 registers requires at least 104 * 2 * 1024 = 208 KiB. If the file were any bigger, we would expect more threads to be possible at higher pressure, so we guess each threadgroup has exactly 208 KiB in its register file.
The story does not end there. From Apple's public specifications, the M1 GPU supports 24576 = 1024 * 24 simultaneous threads. Since the table shows a maximum of 1024 threads per threadgroup, we infer 24 threadgroups may execute in parallel across the chip, each with its own register file. Putting it together, the GPU has 208 KiB * 24 = 4.875 MiB of register file! This size puts it in league with desktop GPUs.
For all the visible hardware features, it's equally important to consider what hardware features are absent. Intriguingly, the GPU lacks some fixed-function graphics hardware ubiquitous among competitors. For example, I have not encountered hardware for reading vertex attributes or uniform buffer objects. The OpenGL and Vulkan specifications assume dedicated hardware for each, so what's the catch?
Simply put — Apple doesn't need to care about Vulkan or OpenGL performance. Their only properly supported API is their own Metal, which they may shape to fit the hardware rather than contorting the hardware to match the API. Indeed, Metal de-emphasizes vertex attributes and uniform buffers, favouring general constant buffers, a compute-focused design. The compiler is responsible for translating the fixed-function attribute and uniform state to shader code. In theory, this has a slight runtime cost; conventional wisdom says dedicated hardware is faster and lower power than software emulation. In practice, the code is so simple it may make no difference, although application developers should be mindful of the vertex formats used in case conversion code is inserted. As always, there is a trade-off: omitting features allows Apple to squeeze more arithmetic logic units (or register file!) onto the chip, speeding up everything else.
The more significant concern is the increased time on the CPU spent compiling shaders. If changing fixed-function attribute state can affect the shader, the compiler could be invoked at inopportune times during random OpenGL calls. Here, Apple has another trick: Metal requires the layout of vertex attributes to be specified when the pipeline is created, allowing the compiler to specialize formats at no additional cost. The OpenGL driver pays the price of the design decision; Metal is exempt from shader recompile tax.
The silver lining is that there is nothing to reverse-engineer for "missing" features like attributes and uniform buffers. As long as we know how to access memory in compute kernels, we can write the lowering code ourselves with no hardware mysteries. So far, I've implemented enough to spin a cube.
At present, the in-progress compiler supports most arithmetic and input/output instructions found in OpenGL ES 2.0, with a simple optimizer and native instruction packing. Support for control flow, textures, scheduling, and register allocation will come further down the line as we work towards a real driver.
- Facebook✓
- Google+✓
- Instapaper✓
- Tweet✓

Less than a month ago, I began investigating the Apple M1 GPU in hopes of developing a free and open-source driver. This week, I've reached a second milestone: drawing a triangle with my own open-source code. The vertex and fragment shaders are handwritten in machine code, and I interface with the hardware via the IOKit kernel driver in an identical fashion to the system's Metal userspace driver.

The bulk of the new code is responsible for constructing the various command buffers and descriptors resident in shared memory, used to control the GPU's behaviour. Any state accessible from Metal corresponds to bits in these buffers, so understanding them will be the next major task. So far, I have focused less on the content and more on the connections between them. In particular, the structures contain pointers to one another, sometimes nested multiple layers deep. The bring-up process for the project's triangle provides a bird's eye view of how all these disparate pieces in memory fit together.
As an example, the application-provided vertex data are in their own buffers. An internal table in yet another buffer points each of these vertex buffers. That internal table is passed directly as input to the vertex shader, specified in another buffer. That description of the vertex shader, including the address of the code in executable memory, is pointed to by another buffer, itself referenced from the main command buffer, which is referenced by a handle in the IOKit call to submit a command buffer. Whew!
In other words, the demo code is not yet intended to demonstrate an understanding of the fine-grained details of the command buffers, but rather to demonstrate there is "nothing missing". Since GPU virtual addresses change from run to run, the demo validates that all of the pointers required are identified and can be relocated freely in memory using our own (trivial) allocator. As there is a bit of "magic" around memory and command buffer allocation on macOS, having this code working at an early stage gives peace of mind going forward.
I employed a piecemeal bring-up process. Since my IOKit wrapper exists in the same address space as the Metal application, the wrapper may modify command buffers just before submission to the GPU. As an early "hello world", I identified the encoding of the render target's clear colour in memory, and demonstrated that I could modify the colour as I pleased. Similarly, while learning about the instruction set to bring up the disassembler, I replaced shaders with handwritten equivalents and confirmed I could execute code on the GPU, provided I wrote out the machine code. But it's not necessary to stop at these "leaf nodes" of the system; after modifying the shader code, I tried uploading shader code to a different part of the executable buffer while modifying the command buffer's pointer to the code to compensate. After that, I could try uploading the commands for the shader myself. Iterating in this fashion, I could build up every structure needed while testing each in isolation.
Despite curveballs, this procedure worked out far better than the alternative of jumping straight to constructing buffers, perhaps via a "replay". I had used that alternate technique to bring-up Mali a few years back, but it comes with the substantial drawback of fiendishly difficult debugging. If there is a single typo in five hundred lines of magic numbers, there would be no feedback, except an error from the GPU. However, by working one bit at a time, errors could be pinpointed and fixed immediately, providing a faster turn around time and a more pleasant bring-up experience.
But curveballs there were! My momentary elation at modifying the clear colours disappeared when I attempted to allocate a buffer for the colours. Despite encoding the same bits as before, the GPU would fail to clear correctly. Wondering if there was something wrong with the way I modified the pointer, I tried placing the colour in an unused part of memory that was already created by the Metal driver — that worked. The contents were the same, the way I modified the pointers was the same, but somehow the GPU didn't like my memory allocation. I wondered if there was something wrong with the way I allocated memory, but the arguments I used to invoke the memory allocation IOKit call were bit-identical to those used by Metal, as confirmed by wrap
. My last-ditch effort was checking if GPU memory had to be mapped explicitly via some side channel, like the mmap
system call. IOKit does feature a device-independent memory map call, but no amount of fortified tracing found any evidence of side-channel system call mappings.
Trouble was brewing. Feeling delirious after so much time chasing an "impossible" bug, I wondered if there wasn't something "magic" in the system call… but rather in the GPU memory itself. It was a silly theory since it produces a serious chicken-and-egg problem if true: if a GPU allocation has to be blessed by another GPU allocation, who blesses the first allocation?
But feeling silly and perhaps desperate, I pressed forward to test the theory by inserting a memory allocation call in the middle of the application flow, such that every subsequent allocation would be at a different address. Dumping GPU memory before and after this change and checking for differences revealed my first horror: an auxiliary buffer in GPU memory tracked all of the required allocations. In particular, I noticed values in this buffer increasing by one at a predictable offset (every 0x40
bytes), suggesting that the buffer contained an array of handles to allocations. Indeed, these values corresponded exactly to handles returned from the kernel on GPU memory allocation calls.
Putting aside the obvious problems with this theory, I tested it anyway, modifying this table to include an extra entry at the end with the handle of my new allocation, and modifying the header data structure to bump the number of entries by one. Still no dice. Discouraging as it was, that did not sink the theory entirely. In fact, I noticed something peculiar about the entries: contrary to what I thought, not all of them corresponded to valid handles. No, all but the last entry were valid. The handles from the kernel are 1-indexed, yet in each memory dump, the final handle was always 0
, nonexistent. Perhaps this acts as a sentinel value, analogous to NULL-terminated strings in C. That explanation begs the question of why? If the header already contains a count of entries, a sentinel value is redundant.
I pressed on. Instead of adding on an extra entry with my handle, I copied the last entry n
to the extra entry n + 1
and overwrote the (now second to last) entry n
with the new handle.
Suddenly my clear colour showed up.
Is the mystery solved? I got the code working, so in some sense, the answer must be yes. But this is hardly a satisfying explanation; at every step, the unlikely solution only raises more questions. The chicken-and-egg problem is the easiest to resolve: this mapping table, along with the root command buffer, is allocated via a special IOKit selector independent from the general buffer allocation, and the handle to the mapping table is passed along with the submit command buffer selector. Further, the idea of passing required handles with command buffer submission is not unheard of; a similar mechanism is used on mainline Linux drivers. Nevertheless, the rationale for using 64-byte table entries in shared memory, as opposed to a simple CPU-side array, remains totally elusive.
Putting memory allocation woes behind me, the road ahead was not without bumps (and potholes), but with patience, I iterated until I had constructed the entirety of GPU memory myself in parallel to Metal, relying on the proprietary userspace only to initialize the device. Finally, all that remained was a leap of faith to kick off the IOKit handshake myself, and I had my first triangle.
These changes amount to around 1700 lines of code since the last blog post, available on GitHub. I've pieced together a simple demo animating a triangle with the GPU on-screen. The window system integration is effectively nonexistent at this point: XQuartz is required and detiling the (64x64 Morton-order interleaved) framebuffer occurs in software with naive scalar code. Nevertheless, the M1's CPU is more than fast enough to cope.
Now that each part of the userspace driver is bootstrapped, going forward we can iterate on the instruction set and the command buffers in isolation. We can tease apart the little details and bit-by-bit transform the code from hundreds of inexplicable magic constants to a real driver. Onwards!
- Facebook✓
- Google+✓
- Instapaper✓
- Tweet✓

Apple's latest line of Macs includes their in-house "M1" system-on-chip, featuring a custom GPU. This poses a problem for those of us in the Asahi Linux project who wish to run Linux on our devices, as this custom Apple GPU has neither public documentation nor open source drivers. Some speculate it might descend from PowerVR GPUs, as used in older iPhones, while others believe the GPU to be completely custom. But rumours and speculations are no fun when we can peek under the hood ourselves!
A few weeks ago, I purchased a Mac Mini with an M1 GPU as a development target to study the instruction set and command stream, to understand the GPU's architecture at a level not previously publicly understood, and ultimately to accelerate the development of a Mesa driver for the hardware. Today I've reached my first milestone: I now understand enough of the instruction set to disassemble simple shaders with a free and open-source tool chain, released on GitHub here.
The process for decoding the instruction set and command stream of the GPU parallels the same process I used for reverse-engineering Mali GPUs in the Panfrost project, originally pioneered by the Lima, Freedreno, and Nouveau free software driver projects. Typically, for Linux or Android driver reverse-engineering, a small wrap library will be written to inject into a test application via LD_PRELOAD
that hooks key system calls like ioctl
and mmap
in order to analyze user-kernel interactions. Once the "submit command buffer" call is issued, the library can dump all (mapped) shared memory for offline analysis.
The same overall process will work for the M1, but there are some macOSisms that need to be translated. First, there is no LD_PRELOAD
on macOS; the equivalent is DYLD_INSERT_LIBRARIES
, which has some extra security features which are easy enough to turn off for our purposes. Second, while the standard Linux/BSD system calls do exist on macOS, they are not used for graphics drivers. Instead, Apple's own IOKit
framework is used for both kernel and userspace drivers, with the critical entry point of IOConnectCallMethod
, an analogue of ioctl
. These differences are easy enough to paper over, but they do add a layer of distance from the standard Linux tooling.
The bigger issue is orienting ourselves in the IOKit world. Since Linux is under a copyleft license, (legal) kernel drivers are open source, so the ioctl
interface is public, albeit vendor-specific. macOS's kernel (XNU) being under a permissive license brings no such obligations; the kernel interface is proprietary and undocumented. Even after wrapping IOConnectCallMethod
, it took some elbow grease to identify the three critical calls: memory allocation, command buffer creation, and command buffer submission. Wrapping the allocation and creation calls is essential for tracking GPU-visible memory (what we are interested in studying), and wrapping the submission call is essential for timing the memory dump.
With those obstacles cleared, we can finally get to the shader binaries, black boxes in themselves. However, the process from here on out is standard: start with the simplest fragment or compute shader possible, make a small change in the input source code, and compare the output binaries. Iterating on this process is tedious but will quickly reveal key structures, including opcode numbers.
The findings of the process documented in the free software disassembler confirm a number of traits of the GPU:
One, the architecture is scalar. Unlike some GPUs that are scalar for 32-bits but vectorized for 16-bits, the M1's GPU is scalar at all bit sizes. Yet Metal optimization resources imply 16-bit arithmetic should be significantly faster, in addition to a reduction of register usage leading to higher thread count (occupancy). This suggests the hardware is superscalar, with more 16-bit ALUs than 32-bit ALUs, allowing the part to benefit from low-precision graphics shaders much more than competing chips can, while removing a great deal of complexity from the compiler.
Two, the architecture seems to handle scheduling in hardware, common among desktop GPUs but less so in the embedded space. This again makes the compiler simpler at the expense of more hardware. Instructions seem to have minimal encoding overhead, unlike other architectures which need to pad out instructions with nop's to accommodate highly constrained instruction sets.
Three, various modifiers are supported. Floating-point ALUs can do clamps (saturate), negates, and absolute value modifiers "for free", a common shader architecture trait. Further, most (all?) instructions can type-convert between 16-bit and 32-bit "for free" on both the destination and the sources, which allows the compiler to be much more aggressive about using 16-bit operations without risking conversion overheads. On the integer side, various bitwise complements and shifts are allowed on certain instructions for free. None of this is unique to Apple's design, but it's worth noting all the same.
Finally, not all ALU instructions have the same timing. Instructions like imad
, used to multiply two integers and add a third, are avoided in favour of repeated iadd
integer addition instructions where possible. This also suggests a superscalar architecture; software-scheduled designs like those I work on for my day job cannot exploit differences in pipeline length, inadvertently slowing down simple instructions to match the speed of complex ones.
From my prior experience working with GPUs, I expect to find some eldritch horror waiting in the instruction set, to balloon compiler complexity. Though the above work currently covers only a small surface area of the instruction set, so far everything seems sound. There are no convoluted optimization tricks, but doing away with the trickery creates a streamlined, efficient design that does one thing and does it well. Maybe Apple's hardware engineers discovered it's hard to beat simplicity.
Alas, a shader tool chain isn't much use without an open source userspace driver. Next up: dissecting the command stream!
Disclaimer: This work is a hobby project conducted based on public information. Opinions expressed may not reflect those of my employer.
- Facebook✓
- Google+✓
- Instapaper✓
- Tweet✓