**HLSL pitfalls** 2020/05/24 Here are pitfalls when dealing with HLSL, along with workarounds if available.
# DXC ## Uninitialized variables Whether compiler gives warning depends on how the variables are used.
Control flow and struct might also complicate things a bit sometimes.
### Vertex / Pixel Shader Output Normally warning like this is displayed, and output operation is just skipped.
`-WX` should be able to catch most of the cases.
``` float4 main() : SV_TARGET0 { float4 x; return x; } warning: Declared output SV_Target0 not fully written in shader. ``` When the uninitialized variable is used in some calculation before output,
the compiler might no longer be able to detect the usage, and leave an `undef` in DXIL.
The actual behavior may varies on GPU and driver, and here is a sample.
![](1.png width="400px") ### UAV Output Slightly different from vertex / pixel shader, validation error is triggered for UAV. ``` RWBuffer Output : register(u0); [numthreads(64, 1, 1)] void main(uint3 inDispatchThreadID : SV_DispatchThreadID) { float x; Output[inDispatchThreadID.x] = x; } error: validation errors :6:34: error: Assignment of undefined values to UAV. note: at 'call void @dx.op.bufferStore.f32(i32 69, %dx.types.Handle %1, i32 %2, i32 undef, float undef, float undef, float undef, float undef, i8 15)' in block '#0' of function 'main'. Validation failed. ``` Again, when the uninitialized variable is used in some calculation before output,
it compiles and leave an `undef` in DXIL.
``` RWBuffer Output : register(u0); [numthreads(64, 1, 1)] void main(uint3 inDispatchThreadID : SV_DispatchThreadID) { float x; x+=1; Output[inDispatchThreadID.x] = x; } call void @llvm.dbg.value(metadata float fadd (float undef, float 1.000000e+00), i64 0, metadata !47, metadata !48), !dbg !49 ; var:"x" !DIExpression() func:"main" ``` # FXC / DXC ## NonUniformResourceIndex According to [Microsoft Docs](https://docs.microsoft.com/en-us/windows/win32/direct3d12/resource-binding-in-hlsl#resource-types-and-arrays),
`NonUniformResourceIndex` is needed when accessing array with non-uniform index. However, even if you don't follow the instruction,
compiler won't complain about it. Then you might get incorrect results.
Further detail can be found at
[Direct3D 12 - Watch out for non-uniform resource index! by Adam Sawicki](https://www.asawicki.info/news_1608_direct3d_12_-_watch_out_for_non-uniform_resource_index).
Also, [PORTING DETROIT: BECOME HUMAN FROM PLAYSTATION® 4 TO PC](https://gpuopen.com/learn/porting-detroit-2/)
mentioned this keyword (with annotated ISA!).
## WaveReadLaneAt This entry is no longer relevant after this [MicrosoftDocs](https://github.com/MicrosoftDocs/win32/commit/803a98ea702c4fcf8ba2b80e5ab1bca54ebfa05a#diff-64753bb1ca6424ec22c87245f67a8a9b) update. ### After Document Update > "The resulting value is the result of expr. It will be uniform if laneIndex is uniform." Now it's clear `WaveReadLaneAt` supports both uniform or non-uniform laneIndex. Here's the commit message, > "The HLSL compiler team has agreed to this change, which is more consistent with the later specifications, but failed to update this document." It is hard to say having one function to work in both scenarios is good or bad. This relates to the facts that the uniformness is already quite unclear in DXIL, and lane count is hardware-dependent if not driver-dependent. [Which Values Are Scalar in a Shader? by Adam Sawicki](https://asawicki.info/news_1734_which_values_are_scalar_in_a_shader) have a wider discussion over it. It's also where I learned about this udpate. ### Before Document Update Before, the document page says, > "The input lane index must be uniform across the wave." > "This function is effectively a broadcast of the value in the laneIndex’th lane." But what happens if we pass non-uniform index to it? It works! Now That's confusing. Here's a sample on [Shader Playground](http://shader-playground.timjones.io/60ec03f77dfcbfa9f1951327cf80cede). As DXC just gives ``` %6 = call i32 @dx.op.waveReadLaneAt.i32(i32 117, i32 %2, i32 %5) ``` My guess is it doesn't actually check whether a variable is uniform or not, thus no error is reported. So why the restriction in the first place? https://twitter.com/SebAaltonen/status/1095183824290484226 > "WaveReadLaneAt must have wave uniform index. Docs say so. SM 6.0 wave intrinsics were modeled after GCN2 hardware (original Xbox One). GCN3 additions (DS_PERMUTE) and Nvidia/Intel equivalents are not exposed. GCN3+ and all Nvidia/Intel DX12 HW support full per lane stuffle." Turn out it needs a specific instruction to perform on non-uniform index. And it becomes a shuffle rather than a broadcast. In GLSL/Vulkan, `subgroupBroadcast` and `subgroupShuffle` are well defined. As described in [HLSL/GLSL/SPRI-V mapping](https://www.khronos.org/assets/uploads/developers/library/2018-vulkan-devday/06-subgroups.pdf) I checked on my Intel HD Graphics 620 (26.20.100.7639) and NVIDIA GeForce RTX 2070 (27.21.14.5638). They behave like a shuffle and the results seem alright. I'm curious about what will driver do if hardware just don't support shuffle. Luckily for AMD we have the [Radeon GPU Analyzer](https://github.com/GPUOpen-Tools/radeon_gpu_analyzer). I ran RGA on a AMD Radeon R9 Fury, not sure if the results will vary on GPU or driver though. For gfx1010, it generate `ds_bpermute_b32`, which is a shuffle as mentioned in the twitter thread. For gfx702, GCN2, the generated code is much longer. [Pastebin](https://pastebin.com/gr0ivzg3). ``` // The HLSL might look like this // Every lane keeps checking till itself becomes the first lane, then do the actual work. Nice trick while (true) { if (WaveReadLaneFirst(SV_GroupThreadID) == SV_GroupThreadID) { uint uniform_index = WaveReadLaneFirst(divergent_index); output = WaveReadLaneAt(input, uniform_index); break; } } ``` A full sample in [Shader Playground](http://shader-playground.timjones.io/2fc43f77f626c4883e1b1ed23fd8b2e1) Somehow an extra if is needed for DXC to generate a loop correctly... NVIDIA equivalent seems to be `NvShfl/shuffleNV` described as "Indexed Any-to-Any Shuffle" in [Reading Between The Threads: Shader Intrinsics](https://developer.nvidia.com/reading-between-threads-shader-intrinsics). ## globallycoherent The [document](https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/sm5-object-rwstructuredbuffer) describes this this keyword as > "RWStructuredBuffer objects can be prefixed with the storage class globallycoherent. This storage class causes memory barriers and syncs to flush data across the entire GPU such that other groups can see writes. Without this specifier, a memory barrier or sync will only flush a UAV within the current group." More detail can be found in [Global vs Group/Local Coherency on Non-Atomic UAV Reads](https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#7.14.4%20Global%20vs%20Group/Local%20Coherency%20on%20Non-Atomic%20UAV%20Reads) As it is not so intuitive on when the prefix should be used, I made [a sample](https://github.com/silvesthu/ComputeShaderPlayground/blob/globallycoherent/Shader.hlsl) to check when it must be used to ensure correctness. The sample works as, - Group 0 write uav - Group 1 do some heavy lifting to wait for Group 0 - Group 1 write uav (this write will be ignored by later read if globallycoherent is not used) - Group 0 wait on Group 1 with a counter - Group 0 read uav I assume the underlying issue is, UAV access is L1 coherent by default, and during shader execution, there is no other way to flush L1, bypass is a reasonable choice. Test results on my RTX 2070 even suggested that, globallycoherent is likely to make UAV access bypass L1, even if barrier is absent. Strictly speaking, L1 is per SM on RTX and is not directly mapped to shader group either, but it is still more intuitive to understand globallycoherent from cache hierarchy. Practically, globallycoherent might not be necessary in most cases, e.g. Atomic operations ensure global coherence by themselves. But multi-phase processing, like mip pyramid generation in reference, is where globallycoherent should be taken into consideration. ### Reference - [Related discussion on twitter](https://twitter.com/sebaaltonen/status/1115717152244420608) - [Single pass globallycoherent mip pyramid generation by @SebAaltonen](https://gist.github.com/sebbbi/6cfbec7ab343924dad9b7ee48ef3ba6c) # FXC-only ## Denormalized float, small float/large uint **This is a FXC only issue.** As described in [Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson](https://www.slideshare.net/DevCentralAMD/lowlevel-shader-optimization-for-nextgen-and-dx11-by-emil-persson),
`asfloat(0x7FFFFF) => 0`
That is, `asfloat` always gives 0 if you feed bits fit denormalized pattern.
Also, floats smaller than 1e-6 won't make their way into output.
`return 1e-7; => return 0;`
I wonder how one can write to a R32_FLOAT correctly.
[ShaderPlayground](http://shader-playground.timjones.io/a0d2300aa58d50ccf28afdd8b66edf7a)
Similarly, uint equal or larger than 0x00800000 (24 bit),
`return 0x00800000; => return 0;`
[ShaderPlayground](http://shader-playground.timjones.io/195e9f0e3b92cd603fd4a604fdebe72f)
Unlike the denormalized case,
small float/large uint stay alive during the calculation.
It only becomes a problem when exporting them, to render target or UAV.
At least compile-time constants acts in the same way.
[Back to Index](./../../../../index.html)