Documentation/gpu/amdgpu/userq.rst - pub/scm/linux/kernel/git/tnguy/next-queue - Git at Google

 ==================
  User Mode Queues
 ==================

 Introduction
 ============

 Similar to the KFD, GPU engine queues move into userspace.  The idea is to let
 user processes manage their submissions to the GPU engines directly, bypassing
 IOCTL calls to the driver to submit work.  This reduces overhead and also allows
 the GPU to submit work to itself.  Applications can set up work graphs of jobs
 across multiple GPU engines without needing trips through the CPU.

 UMDs directly interface with firmware via per application shared memory areas.
 The main vehicle for this is queue.  A queue is a ring buffer with a read
 pointer (rptr) and a write pointer (wptr).  The UMD writes IP specific packets
 into the queue and the firmware processes those packets, kicking off work on the
 GPU engines.  The CPU in the application (or another queue or device) updates
 the wptr to tell the firmware how far into the ring buffer to process packets
 and the rtpr provides feedback to the UMD on how far the firmware has progressed
 in executing those packets.  When the wptr and the rptr are equal, the queue is
 idle.

 Theory of Operation
 ===================

 The various engines on modern AMD GPUs support multiple queues per engine with a
 scheduling firmware which handles dynamically scheduling user queues on the
 available hardware queue slots.  When the number of user queues outnumbers the
 available hardware queue slots, the scheduling firmware dynamically maps and
 unmaps queues based on priority and time quanta.  The state of each user queue
 is managed in the kernel driver in an MQD (Memory Queue Descriptor).  This is a
 buffer in GPU accessible memory that stores the state of a user queue.  The
 scheduling firmware uses the MQD to load the queue state into an HQD (Hardware
 Queue Descriptor) when a user queue is mapped.  Each user queue requires a
 number of additional buffers which represent the ring buffer and any metadata
 needed by the engine for runtime operation.  On most engines this consists of
 the ring buffer itself, a rptr buffer (where the firmware will shadow the rptr
 to userspace), a wptr buffer (where the application will write the wptr for the
 firmware to fetch it), and a doorbell.  A doorbell is a piece of one of the
 device's MMIO BARs which can be mapped to specific user queues.  When the
 application writes to the doorbell, it will signal the firmware to take some
 action. Writing to the doorbell wakes the firmware and causes it to fetch the
 wptr and start processing the packets in the queue. Each 4K page of the doorbell
 BAR supports specific offset ranges for specific engines.  The doorbell of a
 queue must be mapped into the aperture aligned to the IP used by the queue
 (e.g., GFX, VCN, SDMA, etc.).  These doorbell apertures are set up via NBIO
 registers.  Doorbells are 32 bit or 64 bit (depending on the engine) chunks of
 the doorbell BAR.  A 4K doorbell page provides 512 64-bit doorbells for up to
 512 user queues.  A subset of each page is reserved for each IP type supported
 on the device.  The user can query the doorbell ranges for each IP via the INFO
 IOCTL.  See the IOCTL Interfaces section for more information.

 When an application wants to create a user queue, it allocates the necessary
 buffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).
 These can be separate buffers or all part of one larger buffer.  The application
 would map the buffer(s) into its GPUVM and use the GPU virtual addresses of for
 the areas of memory they want to use for the user queue.  They would also
 allocate a doorbell page for the doorbells used by the user queues.  The
 application would then populate the MQD in the USERQ IOCTL structure with the
 GPU virtual addresses and doorbell index they want to use.  The user can also
 specify the attributes for the user queue (priority, whether the queue is secure
 for protected content, etc.).  The application would then call the USERQ
 CREATE IOCTL to create the queue using the specified MQD details in the IOCTL.
 The kernel driver then validates the MQD provided by the application and
 translates the MQD into the engine specific MQD format for the IP.  The IP
 specific MQD would be allocated and the queue would be added to the run list
 maintained by the scheduling firmware.  Once the queue has been created, the
 application can write packets directly into the queue, update the wptr, and
 write to the doorbell offset to kick off work in the user queue.

 When the application is done with the user queue, it would call the USERQ
 FREE IOCTL to destroy it.  The kernel driver would preempt the queue and
 remove it from the scheduling firmware's run list.  Then the IP specific MQD
 would be freed and the user queue state would be cleaned up.

 Some engines may require the aggregated doorbell too if the engine does not
 support doorbells from unmapped queues.  The aggregated doorbell is a special
 page of doorbell space which wakes the scheduler.  In cases where the engine may
 be oversubscribed, some queues may not be mapped.  If the doorbell is rung when
 the queue is not mapped, the engine firmware may miss the request.  Some
 scheduling firmware may work around this by polling wptr shadows when the
 hardware is oversubscribed, other engines may support doorbell updates from
 unmapped queues.  In the event that one of these options is not available, the
 kernel driver will map a page of aggregated doorbell space into each GPUVM
 space.  The UMD will then update the doorbell and wptr as normal and then write
 to the aggregated doorbell as well.

 Special Packets
 ---------------

 In order to support legacy implicit synchronization, as well as mixed user and
 kernel queues, we need a synchronization mechanism that is secure.  Because
 kernel queues or memory management tasks depend on kernel fences, we need a way
 for user queues to update memory that the kernel can use for a fence, that can't
 be messed with by a bad actor.  To support this, we've added a protected fence
 packet.  This packet works by writing a monotonically increasing value to
 a memory location that only privileged clients have write access to. User
 queues only have read access.  When this packet is executed, the memory location
 is updated and other queues (kernel or user) can see the results.  The
 user application would submit this packet in their command stream.  The actual
 packet format varies from IP to IP (GFX/Compute, SDMA, VCN, etc.), but the
 behavior is the same.  The packet submission is handled in userspace.  The
 kernel driver sets up the privileged memory used for each user queue when it
 sets the queues up when the application creates them.


 Memory Management
 =================

 It is assumed that all buffers mapped into the GPUVM space for the process are
 valid when engines on the GPU are running.  The kernel driver will only allow
 user queues to run when all buffers are mapped.  If there is a memory event that
 requires buffer migration, the kernel driver will preempt the user queues,
 migrate buffers to where they need to be, update the GPUVM page tables and
 invaldidate the TLB, and then resume the user queues.

 Interaction with Kernel Queues
 ==============================

 Depending on the IP and the scheduling firmware, you can enable kernel queues
 and user queues at the same time, however, you are limited by the HQD slots.
 Kernel queues are always mapped so any work that goes into kernel queues will
 take priority.  This limits the available HQD slots for user queues.

 Not all IPs will support user queues on all GPUs.  As such, UMDs will need to
 support both user queues and kernel queues depending on the IP.  For example, a
 GPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG,
 and VPE.  UMDs need to support both.  The kernel driver provides a way to
 determine if user queues and kernel queues are supported on a per IP basis.
 UMDs can query this information via the INFO IOCTL and determine whether to use
 kernel queues or user queues for each IP.

 Queue Resets
 ============

 For most engines, queues can be reset individually.  GFX, compute, and SDMA
 queues can be reset individually.  When a hung queue is detected, it can be
 reset either via the scheduling firmware or MMIO.  Since there are no kernel
 fences for most user queues, they will usually only be detected when some other
 event happens; e.g., a memory event which requires migration of buffers.  When
 the queues are preempted, if the queue is hung, the preemption will fail.
 Driver will then look up the queues that failed to preempt and reset them and
 record which queues are hung.

 On the UMD side, we will add a USERQ QUERY_STATUS IOCTL to query the queue
 status.  UMD will provide the queue id in the IOCTL and the kernel driver
 will check if it has already recorded the queue as hung (e.g., due to failed
 peemption) and report back the status.

 IOCTL Interfaces
 ================

 GPU virtual addresses used for queues and related data (rptrs, wptrs, context
 save areas, etc.) should be validated by the kernel mode driver to prevent the
 user from specifying invalid GPU virtual addresses.  If the user provides
 invalid GPU virtual addresses or doorbell indicies, the IOCTL should return an
 error message.  These buffers should also be tracked in the kernel driver so
 that if the user attempts to unmap the buffer(s) from the GPUVM, the umap call
 would return an error.

 INFO
 ----
 There are several new INFO queries related to user queues in order to query the
 size of user queue meta data needed for a user queue (e.g., context save areas
 or shadow buffers), whether kernel or user queues or both are supported
 for each IP type, and the offsets for each IP type in each doorbell page.

 USERQ
 -----
 The USERQ IOCTL is used for creating, freeing, and querying the status of user
 queues.  It supports 3 opcodes:

 1. CREATE - Create a user queue.  The application provides an MQD-like structure
    that defines the type of queue and associated metadata and flags for that
    queue type.  Returns the queue id.
 2. FREE - Free a user queue.
 3. QUERY_STATUS - Query that status of a queue.  Used to check if the queue is
    healthy or not.  E.g., if the queue has been reset. (WIP)

 USERQ_SIGNAL
 ------------
 The USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled.

 USERQ_WAIT
 ----------
 The USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on.

 Kernel and User Queues
 ======================

 In order to properly validate and test performance, we have a driver option to
 select what type of queues are enabled (kernel queues, user queues or both).
 The user_queue driver parameter allows you to enable kernel queues only (0),
 user queues and kernel queues (1), and user queues only (2).  Enabling user
 queues only will free up static queue assignments that would otherwise be used
 by kernel queues for use by the scheduling firmware.  Some kernel queues are
 required for kernel driver operation and they will always be created.  When the
 kernel queues are not enabled, they are not registered with the drm scheduler
 and the CS IOCTL will reject any incoming command submissions which target those
 queue types.  Kernel queues only mirrors the behavior on all existing GPUs.
 Enabling both queues allows for backwards compatibility with old userspace while
 still supporting user queues.
	==================
	User Mode Queues
	==================

	Introduction
	============

	Similar to the KFD, GPU engine queues move into userspace. The idea is to let
	user processes manage their submissions to the GPU engines directly, bypassing
	IOCTL calls to the driver to submit work. This reduces overhead and also allows
	the GPU to submit work to itself. Applications can set up work graphs of jobs
	across multiple GPU engines without needing trips through the CPU.

	UMDs directly interface with firmware via per application shared memory areas.
	The main vehicle for this is queue. A queue is a ring buffer with a read
	pointer (rptr) and a write pointer (wptr). The UMD writes IP specific packets
	into the queue and the firmware processes those packets, kicking off work on the
	GPU engines. The CPU in the application (or another queue or device) updates
	the wptr to tell the firmware how far into the ring buffer to process packets
	and the rtpr provides feedback to the UMD on how far the firmware has progressed
	in executing those packets. When the wptr and the rptr are equal, the queue is
	idle.

	Theory of Operation
	===================

	The various engines on modern AMD GPUs support multiple queues per engine with a
	scheduling firmware which handles dynamically scheduling user queues on the
	available hardware queue slots. When the number of user queues outnumbers the
	available hardware queue slots, the scheduling firmware dynamically maps and
	unmaps queues based on priority and time quanta. The state of each user queue
	is managed in the kernel driver in an MQD (Memory Queue Descriptor). This is a
	buffer in GPU accessible memory that stores the state of a user queue. The
	scheduling firmware uses the MQD to load the queue state into an HQD (Hardware
	Queue Descriptor) when a user queue is mapped. Each user queue requires a
	number of additional buffers which represent the ring buffer and any metadata
	needed by the engine for runtime operation. On most engines this consists of
	the ring buffer itself, a rptr buffer (where the firmware will shadow the rptr
	to userspace), a wptr buffer (where the application will write the wptr for the
	firmware to fetch it), and a doorbell. A doorbell is a piece of one of the
	device's MMIO BARs which can be mapped to specific user queues. When the
	application writes to the doorbell, it will signal the firmware to take some
	action. Writing to the doorbell wakes the firmware and causes it to fetch the
	wptr and start processing the packets in the queue. Each 4K page of the doorbell
	BAR supports specific offset ranges for specific engines. The doorbell of a
	queue must be mapped into the aperture aligned to the IP used by the queue
	(e.g., GFX, VCN, SDMA, etc.). These doorbell apertures are set up via NBIO
	registers. Doorbells are 32 bit or 64 bit (depending on the engine) chunks of
	the doorbell BAR. A 4K doorbell page provides 512 64-bit doorbells for up to
	512 user queues. A subset of each page is reserved for each IP type supported
	on the device. The user can query the doorbell ranges for each IP via the INFO
	IOCTL. See the IOCTL Interfaces section for more information.

	When an application wants to create a user queue, it allocates the necessary
	buffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).
	These can be separate buffers or all part of one larger buffer. The application
	would map the buffer(s) into its GPUVM and use the GPU virtual addresses of for
	the areas of memory they want to use for the user queue. They would also
	allocate a doorbell page for the doorbells used by the user queues. The
	application would then populate the MQD in the USERQ IOCTL structure with the
	GPU virtual addresses and doorbell index they want to use. The user can also
	specify the attributes for the user queue (priority, whether the queue is secure
	for protected content, etc.). The application would then call the USERQ
	CREATE IOCTL to create the queue using the specified MQD details in the IOCTL.
	The kernel driver then validates the MQD provided by the application and
	translates the MQD into the engine specific MQD format for the IP. The IP
	specific MQD would be allocated and the queue would be added to the run list
	maintained by the scheduling firmware. Once the queue has been created, the
	application can write packets directly into the queue, update the wptr, and
	write to the doorbell offset to kick off work in the user queue.

	When the application is done with the user queue, it would call the USERQ
	FREE IOCTL to destroy it. The kernel driver would preempt the queue and
	remove it from the scheduling firmware's run list. Then the IP specific MQD
	would be freed and the user queue state would be cleaned up.

	Some engines may require the aggregated doorbell too if the engine does not
	support doorbells from unmapped queues. The aggregated doorbell is a special
	page of doorbell space which wakes the scheduler. In cases where the engine may
	be oversubscribed, some queues may not be mapped. If the doorbell is rung when
	the queue is not mapped, the engine firmware may miss the request. Some
	scheduling firmware may work around this by polling wptr shadows when the
	hardware is oversubscribed, other engines may support doorbell updates from
	unmapped queues. In the event that one of these options is not available, the
	kernel driver will map a page of aggregated doorbell space into each GPUVM
	space. The UMD will then update the doorbell and wptr as normal and then write
	to the aggregated doorbell as well.

	Special Packets
	---------------

	In order to support legacy implicit synchronization, as well as mixed user and
	kernel queues, we need a synchronization mechanism that is secure. Because
	kernel queues or memory management tasks depend on kernel fences, we need a way
	for user queues to update memory that the kernel can use for a fence, that can't
	be messed with by a bad actor. To support this, we've added a protected fence
	packet. This packet works by writing a monotonically increasing value to
	a memory location that only privileged clients have write access to. User
	queues only have read access. When this packet is executed, the memory location
	is updated and other queues (kernel or user) can see the results. The
	user application would submit this packet in their command stream. The actual
	packet format varies from IP to IP (GFX/Compute, SDMA, VCN, etc.), but the
	behavior is the same. The packet submission is handled in userspace. The
	kernel driver sets up the privileged memory used for each user queue when it
	sets the queues up when the application creates them.


	Memory Management
	=================

	It is assumed that all buffers mapped into the GPUVM space for the process are
	valid when engines on the GPU are running. The kernel driver will only allow
	user queues to run when all buffers are mapped. If there is a memory event that
	requires buffer migration, the kernel driver will preempt the user queues,
	migrate buffers to where they need to be, update the GPUVM page tables and
	invaldidate the TLB, and then resume the user queues.

	Interaction with Kernel Queues
	==============================

	Depending on the IP and the scheduling firmware, you can enable kernel queues
	and user queues at the same time, however, you are limited by the HQD slots.
	Kernel queues are always mapped so any work that goes into kernel queues will
	take priority. This limits the available HQD slots for user queues.

	Not all IPs will support user queues on all GPUs. As such, UMDs will need to
	support both user queues and kernel queues depending on the IP. For example, a
	GPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG,
	and VPE. UMDs need to support both. The kernel driver provides a way to
	determine if user queues and kernel queues are supported on a per IP basis.
	UMDs can query this information via the INFO IOCTL and determine whether to use
	kernel queues or user queues for each IP.

	Queue Resets
	============

	For most engines, queues can be reset individually. GFX, compute, and SDMA
	queues can be reset individually. When a hung queue is detected, it can be
	reset either via the scheduling firmware or MMIO. Since there are no kernel
	fences for most user queues, they will usually only be detected when some other
	event happens; e.g., a memory event which requires migration of buffers. When
	the queues are preempted, if the queue is hung, the preemption will fail.
	Driver will then look up the queues that failed to preempt and reset them and
	record which queues are hung.

	On the UMD side, we will add a USERQ QUERY_STATUS IOCTL to query the queue
	status. UMD will provide the queue id in the IOCTL and the kernel driver
	will check if it has already recorded the queue as hung (e.g., due to failed
	peemption) and report back the status.

	IOCTL Interfaces
	================

	GPU virtual addresses used for queues and related data (rptrs, wptrs, context
	save areas, etc.) should be validated by the kernel mode driver to prevent the
	user from specifying invalid GPU virtual addresses. If the user provides
	invalid GPU virtual addresses or doorbell indicies, the IOCTL should return an
	error message. These buffers should also be tracked in the kernel driver so
	that if the user attempts to unmap the buffer(s) from the GPUVM, the umap call
	would return an error.

	INFO
	----
	There are several new INFO queries related to user queues in order to query the
	size of user queue meta data needed for a user queue (e.g., context save areas
	or shadow buffers), whether kernel or user queues or both are supported
	for each IP type, and the offsets for each IP type in each doorbell page.

	USERQ
	-----
	The USERQ IOCTL is used for creating, freeing, and querying the status of user
	queues. It supports 3 opcodes:

	1. CREATE - Create a user queue. The application provides an MQD-like structure
	that defines the type of queue and associated metadata and flags for that
	queue type. Returns the queue id.
	2. FREE - Free a user queue.
	3. QUERY_STATUS - Query that status of a queue. Used to check if the queue is
	healthy or not. E.g., if the queue has been reset. (WIP)

	USERQ_SIGNAL
	------------
	The USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled.

	USERQ_WAIT
	----------
	The USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on.

	Kernel and User Queues
	======================

	In order to properly validate and test performance, we have a driver option to
	select what type of queues are enabled (kernel queues, user queues or both).
	The user_queue driver parameter allows you to enable kernel queues only (0),
	user queues and kernel queues (1), and user queues only (2). Enabling user
	queues only will free up static queue assignments that would otherwise be used
	by kernel queues for use by the scheduling firmware. Some kernel queues are
	required for kernel driver operation and they will always be created. When the
	kernel queues are not enabled, they are not registered with the drm scheduler
	and the CS IOCTL will reject any incoming command submissions which target those
	queue types. Kernel queues only mirrors the behavior on all existing GPUs.
	Enabling both queues allows for backwards compatibility with old userspace while
	still supporting user queues.