HLSL pitfalls

HLSL pitfalls
2020/05/24

Here are pitfalls when dealing with HLSL, along with workarounds if available.

Contents

(Top)
DXC
  1.1  Uninitialized variables
    1.1.1  Vertex / Pixel Shader Output
    1.1.2  UAV Output
FXC / DXC
  2.1  NonUniformResourceIndex
  2.2  WaveReadLaneAt
    2.2.1  After Document Update
    2.2.2  Before Document Update
  2.3  globallycoherent
    2.3.1  Reference
FXC-only
  3.1  Denormalized float, small float/large uint

   

DXC

   

Uninitialized variables

Whether compiler gives warning depends on how the variables are used.
Control flow and struct might also complicate things a bit sometimes.

   

Vertex / Pixel Shader Output

Normally warning like this is displayed, and output operation is just skipped.
-WX should be able to catch most of the cases.

float4 main() : SV_TARGET0
{
    float4 x;
    return x;
}

warning: Declared output SV_Target0 not fully written in shader. 

When the uninitialized variable is used in some calculation before output,
the compiler might no longer be able to detect the usage, and leave an undef in DXIL.
The actual behavior may varies on GPU and driver, and here is a sample.

   

UAV Output

Slightly different from vertex / pixel shader, validation error is triggered for UAV.

RWBuffer<float> Output : register(u0);
[numthreads(64, 1, 1)]
void main(uint3 inDispatchThreadID : SV_DispatchThreadID)
{
    float x;
    Output[inDispatchThreadID.x] = x;
}

error: validation errors
<source>:6:34: error: Assignment of undefined values to UAV.
note: at 'call void @dx.op.bufferStore.f32(i32 69, %dx.types.Handle %1, i32 %2, i32 undef, float undef, float undef, float undef, float undef, i8 15)' in block '#0' of function 'main'.
Validation failed.

Again, when the uninitialized variable is used in some calculation before output,
it compiles and leave an undef in DXIL.

RWBuffer<float> Output : register(u0);
[numthreads(64, 1, 1)]
void main(uint3 inDispatchThreadID : SV_DispatchThreadID)
{
    float x; x+=1;
    Output[inDispatchThreadID.x] = x;
}

call void @llvm.dbg.value(metadata float fadd (float undef, float 1.000000e+00), i64 0, metadata !47, metadata !48), !dbg !49 ; var:"x" !DIExpression() func:"main"
   

FXC / DXC

   

NonUniformResourceIndex

According to Microsoft Docs,
NonUniformResourceIndex is needed when accessing array with non-uniform index.

However, even if you don't follow the instruction,
compiler won't complain about it. Then you might get incorrect results.

Further detail can be found at
Direct3D 12 - Watch out for non-uniform resource index! by Adam Sawicki.

Also, PORTING DETROIT: BECOME HUMAN FROM PLAYSTATION® 4 TO PC
mentioned this keyword (with annotated ISA!).

   

WaveReadLaneAt

This entry is no longer relevant after this MicrosoftDocs update.

   

After Document Update

The resulting value is the result of expr. It will be uniform if laneIndex is uniform.
Now it's clear WaveReadLaneAt supports both uniform or non-uniform laneIndex.

Here's the commit message,

The HLSL compiler team has agreed to this change, which is more consistent with the later specifications, but failed to update this document.
It is hard to say having one function to work in both scenarios is good or bad. This relates to the facts that the uniformness is already quite unclear in DXIL, and lane count is hardware-dependent if not driver-dependent.

Which Values Are Scalar in a Shader? by Adam Sawicki have a wider discussion over it. It's also where I learned about this udpate.

   

Before Document Update

Before, the document page says,

The input lane index must be uniform across the wave.” “This function is effectively a broadcast of the value in the laneIndex’th lane.
But what happens if we pass non-uniform index to it? It works! Now That's confusing. Here's a sample on Shader Playground.

As DXC just gives

%6 = call i32 @dx.op.waveReadLaneAt.i32(i32 117, i32 %2, i32 %5)

My guess is it doesn't actually check whether a variable is uniform or not, thus no error is reported. So why the restriction in the first place?

https://twitter.com/SebAaltonen/status/1095183824290484226

WaveReadLaneAt must have wave uniform index. Docs say so. SM 6.0 wave intrinsics were modeled after GCN2 hardware (original Xbox One). GCN3 additions (DS_PERMUTE) and Nvidia/Intel equivalents are not exposed. GCN3+ and all Nvidia/Intel DX12 HW support full per lane stuffle.
Turn out it needs a specific instruction to perform on non-uniform index. And it becomes a shuffle rather than a broadcast. In GLSL/Vulkan, subgroupBroadcast and subgroupShuffle are well defined. As described in HLSL/GLSL/SPRI-V mapping

I checked on my Intel HD Graphics 620 (26.20.100.7639) and NVIDIA GeForce RTX 2070 (27.21.14.5638). They behave like a shuffle and the results seem alright. I'm curious about what will driver do if hardware just don't support shuffle. Luckily for AMD we have the Radeon GPU Analyzer. I ran RGA on a AMD Radeon R9 Fury, not sure if the results will vary on GPU or driver though. For gfx1010, it generate ds_bpermute_b32, which is a shuffle as mentioned in the twitter thread. For gfx702, GCN2, the generated code is much longer. Pastebin.

// The HLSL might look like this
// Every lane keeps checking till itself becomes the first lane, then do the actual work. Nice trick
while (true)
{
	if (WaveReadLaneFirst(SV_GroupThreadID) == SV_GroupThreadID)
	{
		uint uniform_index = WaveReadLaneFirst(divergent_index);
		output = WaveReadLaneAt(input, uniform_index);
		break;
	}
}

A full sample in Shader Playground Somehow an extra if is needed for DXC to generate a loop correctly...

NVIDIA equivalent seems to be NvShfl/shuffleNV described as “Indexed Any-to-Any Shuffle” in Reading Between The Threads: Shader Intrinsics.

   

globallycoherent

The document describes this this keyword as

RWStructuredBuffer objects can be prefixed with the storage class globallycoherent. This storage class causes memory barriers and syncs to flush data across the entire GPU such that other groups can see writes. Without this specifier, a memory barrier or sync will only flush a UAV within the current group.
More detail can be found in Global vs Group/Local Coherency on Non-Atomic UAV Reads

As it is not so intuitive on when the prefix should be used, I made a sample to check when it must be used to ensure correctness.

The sample works as,

I assume the underlying issue is, UAV access is L1 coherent by default, and during shader execution, there is no other way to flush L1, bypass is a reasonable choice.

Test results on my RTX 2070 even suggested that, globallycoherent is likely to make UAV access bypass L1, even if barrier is absent. Strictly speaking, L1 is per SM on RTX and is not directly mapped to shader group either, but it is still more intuitive to understand globallycoherent from cache hierarchy.

Practically, globallycoherent might not be necessary in most cases, e.g. Atomic operations ensure global coherence by themselves. But multi-phase processing, like mip pyramid generation in reference, is where globallycoherent should be taken into consideration.

   

Reference

   

FXC-only

   

Denormalized float, small float/large uint

This is a FXC only issue.

As described in Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson,
asfloat(0x7FFFFF) => 0
That is, asfloat always gives 0 if you feed bits fit denormalized pattern.

Also, floats smaller than 1e-6 won't make their way into output.
return 1e-7; => return 0;
I wonder how one can write to a R32_FLOAT correctly.
ShaderPlayground

Similarly, uint equal or larger than 0×00800000 (24 bit),
return 0x00800000; => return 0;
ShaderPlayground

Unlike the denormalized case,
small float/large uint stay alive during the calculation.
It only becomes a problem when exporting them, to render target or UAV.

At least compile-time constants acts in the same way.

Back to Index

formatted by Markdeep 1.18