HLSL pitfalls

May 24 2020

I've encountered several issues of HLSL, and lucky enough found most workaround/solution on the internet.
It doesn't feel right for anyone to remember all these.

Denormalized float, small float/large uint

This is a FXC only issue.

First, there is asfloat(0x7FFFFF) => 0 from
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson.
That is, asfloat always gives 0 if you feed bits fit denormalized pattern.

Then, floats smaller than 1e-6 won't make their way into output.
return 1e-7; => return 0;
I wonder how one can write to a R32_FLOAT correctly.
ShaderPlayground

Similarly, uint equal or larger than 0x00800000 (24 bit),
return 0x00800000; => return 0;
ShaderPlayground

Unlike the denormalized case,
small float/large uint stay alive during the calculation.
It only becomes a problem when exporting them, to render target or UAV.

Thank god at least compile-time constants acts in the same way.


NonUniformResourceIndex

This is rather simple. From Microsoft Docs

One need to add the NonUniformResourceIndex when accessing array with non-uniform index.

tex1[NonUniformResourceIndex(myMaterialID)].Sample(samp[NonUniformResourceIndex(samplerID)], texCoords);

The evil part is, even if you don't follow compiler won't complain about it.
Then you'll find it breaks on some GPUs.
Further detail can be found here.
Direct3D 12 - Watch out for non-uniform resource index! by Adam Sawicki

Update:
PORTING DETROIT: BECOME HUMAN FROM PLAYSTATION® 4 TO PC – PART 2
also mentioned this keyword (with annotated ISA!).


WaveReadLaneAt

Since Microsoft has a update to the documents on WaveReadLaneAt on Jun 2020,
I split this section into 2 parts to keep up with the latest version and also keep some old observations.

After

The resulting value is the result of expr. It will be uniform if laneIndex is uniform.

Now it's clear WaveReadLaneAt supports both uniform or non-uniform laneIndex. Here's the commit message.

The HLSL compiler team has agreed to this change, which is more consistent with the later specifications, but failed to update this document.

It is hard to say having one function to work in both scenarios is good or bad.
This relates to the facts that the uniformness is already quite unclear in DXIL,
and lane count is hardware-dependent if not driver-dependent.
Which Values Are Scalar in a Shader? by Adam Sawicki have a wider discussion over it.
(Actually I learned about the update from this one.)

Before

Note this part is written before the update.

The input lane index must be uniform across the wave.
This function is effectively a broadcast of the value in the laneIndex’th lane.

But what happens if we pass non-uniform index to it? It works!
Now That's confusing.
Here's a sample on Shader Playground.

As DXC just gives

%6 = call i32 @dx.op.waveReadLaneAt.i32(i32 117, i32 %2, i32 %5)

My guess is it doesn't actually check whether a variable is uniform or not, thus no error is reported.
So why the restriction in the first place?

https://twitter.com/SebAaltonen/status/1095183824290484226

WaveReadLaneAt must have wave uniform index. Docs say so. SM 6.0 wave intrinsics were modeled after GCN2 hardware (original Xbox One). GCN3 additions (DS_PERMUTE) and Nvidia/Intel equivalents are not exposed. GCN3+ and all Nvidia/Intel DX12 HW support full per lane stuffle.

Turn out it needs a specific instruction to perform on non-uniform index.
And it becomes a shuffle rather than a broadcast.
In GLSL/Vulkan, subgroupBroadcast and subgroupShuffle are well defined.
As described in HLSL/GLSL/SPRI-V mapping

I checked on my Intel HD Graphics 620 (26.20.100.7639) and NVIDIA GeForce RTX 2070 (27.21.14.5638).
They behave like a shuffle and the results seem alright.
I'm curious about what will driver do if hardware just don't support shuffle.
Luckily for AMD we have the Radeon GPU Analyzer.
I ran RGA on a AMD Radeon R9 Fury, not sure if the results will vary on GPU or driver though.
For gfx1010, it generate ds_bpermute_b32, which is a shuffle as mentioned in the twitter thread.
For gfx702, GCN2, the generated code is much longer. Pastebin.

// The HLSL might look like this
// Every lane keeps checking till itself becomes the first lane, then do the actual work. Nice trick
while (true)
{
	if (WaveReadLaneFirst(SV_GroupThreadID) == SV_GroupThreadID)
	{
		uint uniform_index = WaveReadLaneFirst(divergent_index);
		output = WaveReadLaneAt(input, uniform_index);
		break;
	}
}

A full sample in Shader Playground
Somehow an extra if is needed for DXC to generate a loop correctly…

NVIDIA equivalent seems to be NvShfl/shuffleNV described as Indexed Any-to-Any Shuffle in Reading Between The Threads: Shader Intrinsics.


globallycoherent

The document describes this this keyword as

RWStructuredBuffer objects can be prefixed with the storage class globallycoherent. This storage class causes memory barriers and syncs to flush data across the entire GPU such that other groups can see writes. Without this specifier, a memory barrier or sync will only flush a UAV within the current group.

More detail can be found in (DirectX-Specs) Global vs Group/Local Coherency on Non-Atomic UAV Reads

As it is not so intuitive on when the prefix should be used,
I made a sample to check when it must be used to ensure correctness.

The sample works as,

I assume the underlying issue is, UAV access is L1 coherent by default,
and during shader execution, there is no other way to flush L1, bypass is a reasonable choice.
Test results on my RTX 2070 even suggested that,
globallycoherent is likely to make UAV access bypass L1, even if barrier is absent.
Strictly speaking, L1 is per SM on RTX and is not directly mapped to shader group either,
but it is still more intuitive to understand globallycoherent from cache hierarchy.

Practically, globallycoherent might not be necessary in most cases,
e.g. Atomic operations ensure global coherence by themselves.
But multi-phase processing, like mip pyramid generation in reference,
is where globallycoherent should be taken into consideration.

Reference