| .. SPDX-License-Identifier: GPL-2.0+ |
| |
| ============================ |
| DRM RAS over Generic Netlink |
| ============================ |
| |
| The DRM RAS (Reliability, Availability, Serviceability) interface provides a |
| standardized way for GPU/accelerator drivers to expose error counters and |
| other reliability nodes to user space via Generic Netlink. This allows |
| diagnostic tools, monitoring daemons, or test infrastructure to query hardware |
| health in a uniform way across different DRM drivers. |
| |
| Key Goals: |
| |
| * Provide a standardized RAS solution for GPU and accelerator drivers, enabling |
| data center monitoring and reliability operations. |
| * Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML |
| specifications and centralize all RAS-related communication in one namespace. |
| * Support a basic error counter interface, addressing the immediate, essential |
| monitoring needs. |
| * Offer a flexible, future-proof interface that can be extended to support |
| additional types of RAS data in the future. |
| * Allow multiple nodes per driver, enabling drivers to register separate |
| nodes for different IP blocks, sub-blocks, or other logical subdivisions |
| as applicable. |
| |
| Nodes |
| ===== |
| |
| Nodes are logical abstractions representing an error type or error source within |
| the device. Currently, only error counter nodes is supported. |
| |
| Drivers are responsible for registering and unregistering nodes via the |
| `drm_ras_node_register()` and `drm_ras_node_unregister()` APIs. |
| |
| Node Management |
| ------------------- |
| |
| .. kernel-doc:: drivers/gpu/drm/drm_ras.c |
| :doc: DRM RAS Node Management |
| .. kernel-doc:: drivers/gpu/drm/drm_ras.c |
| :internal: |
| |
| Generic Netlink Usage |
| ===================== |
| |
| The interface is implemented as a Generic Netlink family named ``drm-ras``. |
| User space tools can: |
| |
| * List registered nodes with the ``list-nodes`` command. |
| * List all error counters in an node with the ``get-error-counter`` command with ``node-id`` |
| as a parameter. |
| * Query specific error counter values with the ``get-error-counter`` command, using both |
| ``node-id`` and ``error-id`` as parameters. |
| |
| YAML-based Interface |
| -------------------- |
| |
| The interface is described in a YAML specification ``Documentation/netlink/specs/drm_ras.yaml`` |
| |
| This YAML is used to auto-generate user space bindings via |
| ``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink |
| attributes and operations. |
| |
| Usage Notes |
| ----------- |
| |
| * User space must first enumerate nodes to obtain their IDs. |
| * Node IDs or Node names can be used for all further queries, such as error counters. |
| * Error counters can be queried by either the Error ID or Error name. |
| * Query Parameters should be defined as part of the uAPI to ensure user interface stability. |
| * The interface supports future extension by adding new node types and |
| additional attributes. |
| |
| Example: List nodes using ynl |
| |
| .. code-block:: bash |
| |
| sudo ynl --family drm_ras --dump list-nodes |
| [{'device-name': '0000:03:00.0', |
| 'node-id': 0, |
| 'node-name': 'correctable-errors', |
| 'node-type': 'error-counter'}, |
| {'device-name': '0000:03:00.0', |
| 'node-id': 1, |
| 'node-name': 'uncorrectable-errors', |
| 'node-type': 'error-counter'}] |
| |
| Example: List all error counters using ynl |
| |
| .. code-block:: bash |
| |
| sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}' |
| [{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}, |
| {'error-id': 2, 'error-name': 'error_name2', 'error-value': 0}] |
| |
| Example: Query an error counter for a given node |
| |
| .. code-block:: bash |
| |
| sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}' |
| {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0} |
| |