diff options
author | Dave Airlie <airlied@redhat.com> | 2024-04-17 07:48:59 +0200 |
---|---|---|
committer | Dave Airlie <airlied@redhat.com> | 2024-04-17 07:48:59 +0200 |
commit | 34633158b8eb8fca145c9a73f8fe4f98c7275b06 (patch) | |
tree | a8e0e2d55dff19f68a1c6842142255e9deaf2d7d /Documentation/gpu | |
parent | Merge tag 'drm-misc-next-2024-04-10' of https://gitlab.freedesktop.org/drm/mi... (diff) | |
parent | drm/amd/display: Add a function for checking tmds mode (diff) | |
download | linux-34633158b8eb8fca145c9a73f8fe4f98c7275b06.tar.xz linux-34633158b8eb8fca145c9a73f8fe4f98c7275b06.zip |
Merge tag 'amd-drm-next-6.10-2024-04-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.10-2024-04-13:
amdgpu:
- HDCP fixes
- ODM fixes
- RAS fixes
- Devcoredump improvements
- Misc code cleanups
- Expose VCN activity via sysfs
- SMY 13.0.x updates
- Enable fast updates on DCN 3.1.4
- Add dclk and vclk reporting on additional devices
- Add ACA RAS infrastructure
- Implement TLB flush fence
- EEPROM handling fixes
- SMUIO 14.0.2 support
- SMU 14.0.1 Updates
- Sync page table freeing with TLB flushes
- DML2 refactor
- DC debug improvements
- SR-IOV fixes
- Suspend and Resume fixes
- DCN 3.5.x Updates
- Z8 fixes
- UMSCH fixes
- GPU reset fixes
- HDP fix for second GFX pipe on GC 10.x
- Enable secondary GFX pipe on GC 10.3
- Refactor and clean up BACO/BOCO/BAMACO handling
- VCN partitioning fix
- DC DWB fixes
- VSC SDP fixes
- DCN 3.1.6 fix
- GC 11.5 fixes
- Remove invalid TTM resource start check
- DCN 1.0 fixes
amdkfd:
- MQD handling cleanup
- Preemption handling fixes for XCDs
- TLB flush fix for GC 9.4.2
- Properly clean up workqueue during module unload
- Fix memory leak process create failure
- Range check CP bad op exception targets to avoid reporting invalid exceptions to userspace
radeon:
- Misc code cleanups
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240413213708.3427038-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
Diffstat (limited to 'Documentation/gpu')
-rw-r--r-- | Documentation/gpu/amdgpu/debugging.rst | 80 | ||||
-rw-r--r-- | Documentation/gpu/amdgpu/display/display-contributing.rst | 2 | ||||
-rw-r--r-- | Documentation/gpu/amdgpu/index.rst | 1 |
3 files changed, 82 insertions, 1 deletions
diff --git a/Documentation/gpu/amdgpu/debugging.rst b/Documentation/gpu/amdgpu/debugging.rst new file mode 100644 index 000000000000..e75f97d0e4ea --- /dev/null +++ b/Documentation/gpu/amdgpu/debugging.rst @@ -0,0 +1,80 @@ +=============== + GPU Debugging +=============== + +GPUVM Debugging +=============== + +To aid in debugging GPU virtual memory related problems, the driver supports a +number of options module parameters: + +`vm_fault_stop` - If non-0, halt the GPU memory controller on a GPU page fault. + +`vm_update_mode` - If non-0, use the CPU to update GPU page tables rather than +the GPU. + + +Decoding a GPUVM Page Fault +=========================== + +If you see a GPU page fault in the kernel log, you can decode it to figure +out what is going wrong in your application. A page fault in your kernel +log may look something like this: + +:: + + [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425) + in page starting at address 0x0000800102800000 from IH client 0x1b (UTCL2) + VM_L2_PROTECTION_FAULT_STATUS:0x00301030 + Faulty UTCL2 client ID: TCP (0x8) + MORE_FAULTS: 0x0 + WALKER_ERROR: 0x0 + PERMISSION_FAULTS: 0x3 + MAPPING_ERROR: 0x0 + RW: 0x0 + +First you have the memory hub, gfxhub and mmhub. gfxhub is the memory +hub used for graphics, compute, and sdma on some chips. mmhub is the +memory hub used for multi-media and sdma on some chips. + +Next you have the vmid and pasid. If the vmid is 0, this fault was likely +caused by the kernel driver or firmware. If the vmid is non-0, it is generally +a fault in a user application. The pasid is used to link a vmid to a system +process id. If the process is active when the fault happens, the process +information will be printed. + +The GPU virtual address that caused the fault comes next. + +The client ID indicates the GPU block that caused the fault. +Some common client IDs: + +- CB/DB: The color/depth backend of the graphics pipe +- CPF: Command Processor Frontend +- CPC: Command Processor Compute +- CPG: Command Processor Graphics +- TCP/SQC/SQG: Shaders +- SDMA: SDMA engines +- VCN: Video encode/decode engines +- JPEG: JPEG engines + +PERMISSION_FAULTS describe what faults were encountered: + +- bit 0: the PTE was not valid +- bit 1: the PTE read bit was not set +- bit 2: the PTE write bit was not set +- bit 3: the PTE execute bit was not set + +Finally, RW, indicates whether the access was a read (0) or a write (1). + +In the example above, a shader (cliend id = TCP) generated a read (RW = 0x0) to +an invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual address +0x0000800102800000. The user can then inspect their shader code and resource +descriptor state to determine what caused the GPU page fault. + +UMR +=== + +`umr <https://gitlab.freedesktop.org/tomstdenis/umr>`_ is a general purpose +GPU debugging and diagnostics tool. Please see the umr +`documentation <https://umr.readthedocs.io/en/main/>`_ for more information +about its capabilities. diff --git a/Documentation/gpu/amdgpu/display/display-contributing.rst b/Documentation/gpu/amdgpu/display/display-contributing.rst index fdb2bea01d53..36f3077eee00 100644 --- a/Documentation/gpu/amdgpu/display/display-contributing.rst +++ b/Documentation/gpu/amdgpu/display/display-contributing.rst @@ -135,7 +135,7 @@ Enable underlay --------------- AMD display has this feature called underlay (which you can read more about at -'Documentation/GPU/amdgpu/display/mpo-overview.rst') which is intended to +'Documentation/gpu/amdgpu/display/mpo-overview.rst') which is intended to save power when playing a video. The basic idea is to put a video in the underlay plane at the bottom and the desktop in the plane above it with a hole in the video area. This feature is enabled in ChromeOS, and from our data diff --git a/Documentation/gpu/amdgpu/index.rst b/Documentation/gpu/amdgpu/index.rst index 912e699fd373..847e04924030 100644 --- a/Documentation/gpu/amdgpu/index.rst +++ b/Documentation/gpu/amdgpu/index.rst @@ -15,4 +15,5 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures. ras thermal driver-misc + debugging amdgpu-glossary |