**Tools for debugging Device Removed** 2020/10/03 Here's the list of tools for debugging in a Device Removed (gpu crash) condition, along with some related links and tips I've gathered. # Introduction - [Youtube | Aftermath: Advances in GPU Crash Debugging](https://www.youtube.com/watch?v=VaGcs5-W6S4) # Tools ## NVIDIA Aftermath - https://developer.nvidia.com/nsight-aftermath - SDK provided for integration - Generate GPU dump .nv-gpudmp on crash - Show running shaders with sources associated (if available) - Show GPU state (Graphics Pipeline, Vertex/Tessellation/Geometry, etc.) - Show CommandList checkpoints (similar to breadcrumbs) - Show MMU fault information (if available) - Access type (read/write/atomic), originating unit - Failing warp and instruction - nv-nsight-remote-monitor.exe allows to enable Aftermath os-wide with hook - Works only on NVIDIA Pascal architecture GPUs and newer - Documentation - https://developer.nvidia.com/nsight-aftermath - https://github.com/NVIDIA/nsight-aftermath-samples ## AMD Radeon GPU Detective - https://github.com/GPUOpen-Tools/radeon_gpu_detective - Documentation - [Quickstart Guide](https://github.com/GPUOpen-Tools/radeon_gpu_detective/blob/master/documentation/source/quickstart.rst) - [Help Manual](https://github.com/GPUOpen-Tools/radeon_gpu_detective/blob/master/documentation/source/help_manual.rst) ## DebugLayer & GPU Based Validation (GBV) - Prevention is better than cure! - Both can be enabled on a PIX GPU capture - Both slow down application a lot, which may affect reproducibility - dxcpl.exe allows os-wide control to enable DebugLayer - GBV takes a long time to convert DXBC to DXIL internally. Might be not for practical use - Documentation - https://docs.microsoft.com/en-us/windows/win32/direct3d12/using-d3d12-debug-layer-gpu-based-validation - https://docs.microsoft.com/en-us/windows/win32/api/d3d12sdklayers/ne-d3d12sdklayers-d3d12_debug_feature - https://docs.microsoft.com/en-us/windows/win32/api/d3d12sdklayers/ns-d3d12sdklayers-d3d12_debug_device_gpu_based_validation_settings - https://docs.microsoft.com/en-us/windows/win32/api/d3d12sdklayers/ns-d3d12sdklayers-d3d12_debug_device_gpu_slowdown_performance_factor ## PIX - Debug TDR if capturable - Documentation - https://devblogs.microsoft.com/pix/tdr-debugging/ ## GPUView - Show "GPU Hardware Queue" and "Context CPU Queue" if capturable - Good way to find out which queue caused TDR - Documentation - https://docs.microsoft.com/ja-jp/windows-hardware/drivers/display/using-gpuview ## OpenExistingHeapFromAddress & WriteBufferImmediate - Heap stay alive after a crash - Could be used to implement breadcrumbs - Documentation - https://devblogs.microsoft.com/directx/announcing-new-directx-12-features/ - https://channel9.msdn.com/Events/GDC/GDC-2018/10?ocid=player ## D3D12 Device Removed Extended Data (DRED) - API provided for integration - Auto-Breadcrumbs helps to narrow down problematic command in CommandList - D3DDred.js can extract DRED information from a full dump. Works with both old and new WinDbg - DRED 1.1 (Windows 10 1903) - Page Fault Data provides information on the resource that causes the page fault - DRED 1.2 (Windows 10 2004) - Breadcrumb Context Strings allows annotating breadcrumb with user-defined names - D3DConfig.exe allows os-wide control to enable DRED. Use with D3DDred.js then no integration is needed - Documentation - https://docs.microsoft.com/en-us/windows/win32/direct3d12/use-dred - https://microsoft.github.io/DirectX-Specs/d3d/DeviceRemovedExtendedData.html - https://devblogs.microsoft.com/directx/dred-v1-2-supports-pix-marker-and-event-strings-in-auto-breadcrumbs/ - https://github.com/microsoft/DirectX-Debugging-Tools # Reference - Device Removalの処方箋 by Masaya Takeshige - Talk on CEDEC 2020, in Japanese - Detailed information to tools, methods to categorize crash, and also corresponding strategies - [Slide](https://cedil.cesa.or.jp/cedil_sessions/view/2258) - [Supplementary (blog)](https://shikihuiku.github.io/post/cedec2020_prescriptions_for_deviceremoval/) - Tips - Error may also cause TDR - Reproducibility may be affected by Synchronized Command Queue Validation of DebugLayer - GBV need DebugLayer enabled, DRED don't # Related topics - Callback on resource destruction - ID3DDestructionNotifier - [Tweet from @_Humus_](https://twitter.com/_Humus_/status/1141602950948761600) - https://github.com/MicrosoftDocs/sdk-api/pull/393 - Device Removed detection with fence - https://github.com/MicrosoftDocs/sdk-api/pull/391/commits/e88a8b9157ba7c7ce0e883b9aec485c7dae5fe98 [Back to Index](./../../../../index.html)