|  | .. SPDX-License-Identifier: (GPL-2.0+ OR MIT) | 
|  |  | 
|  | =============== | 
|  | VM_BIND locking | 
|  | =============== | 
|  |  | 
|  | This document attempts to describe what's needed to get VM_BIND locking right, | 
|  | including the userptr mmu_notifier locking. It also discusses some | 
|  | optimizations to get rid of the looping through of all userptr mappings and | 
|  | external / shared object mappings that is needed in the simplest | 
|  | implementation. In addition, there is a section describing the VM_BIND locking | 
|  | required for implementing recoverable pagefaults. | 
|  |  | 
|  | The DRM GPUVM set of helpers | 
|  | ============================ | 
|  |  | 
|  | There is a set of helpers for drivers implementing VM_BIND, and this | 
|  | set of helpers implements much, but not all of the locking described | 
|  | in this document. In particular, it is currently lacking a userptr | 
|  | implementation. This document does not intend to describe the DRM GPUVM | 
|  | implementation in detail, but it is covered in :ref:`its own | 
|  | documentation <drm_gpuvm>`. It is highly recommended for any driver | 
|  | implementing VM_BIND to use the DRM GPUVM helpers and to extend it if | 
|  | common functionality is missing. | 
|  |  | 
|  | Nomenclature | 
|  | ============ | 
|  |  | 
|  | * ``gpu_vm``: Abstraction of a virtual GPU address space with | 
|  | meta-data. Typically one per client (DRM file-private), or one per | 
|  | execution context. | 
|  | * ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with | 
|  | associated meta-data. The backing storage of a gpu_vma can either be | 
|  | a GEM object or anonymous or page-cache pages mapped also into the CPU | 
|  | address space for the process. | 
|  | * ``gpu_vm_bo``: Abstracts the association of a GEM object and | 
|  | a VM. The GEM object maintains a list of gpu_vm_bos, where each gpu_vm_bo | 
|  | maintains a list of gpu_vmas. | 
|  | * ``userptr gpu_vma or just userptr``: A gpu_vma, whose backing store | 
|  | is anonymous or page-cache pages as described above. | 
|  | * ``revalidating``: Revalidating a gpu_vma means making the latest version | 
|  | of the backing store resident and making sure the gpu_vma's | 
|  | page-table entries point to that backing store. | 
|  | * ``dma_fence``: A struct dma_fence that is similar to a struct completion | 
|  | and which tracks GPU activity. When the GPU activity is finished, | 
|  | the dma_fence signals. Please refer to the ``DMA Fences`` section of | 
|  | the :doc:`dma-buf doc </driver-api/dma-buf>`. | 
|  | * ``dma_resv``: A struct dma_resv (a.k.a reservation object) that is used | 
|  | to track GPU activity in the form of multiple dma_fences on a | 
|  | gpu_vm or a GEM object. The dma_resv contains an array / list | 
|  | of dma_fences and a lock that needs to be held when adding | 
|  | additional dma_fences to the dma_resv. The lock is of a type that | 
|  | allows deadlock-safe locking of multiple dma_resvs in arbitrary | 
|  | order. Please refer to the ``Reservation Objects`` section of the | 
|  | :doc:`dma-buf doc </driver-api/dma-buf>`. | 
|  | * ``exec function``: An exec function is a function that revalidates all | 
|  | affected gpu_vmas, submits a GPU command batch and registers the | 
|  | dma_fence representing the GPU command's activity with all affected | 
|  | dma_resvs. For completeness, although not covered by this document, | 
|  | it's worth mentioning that an exec function may also be the | 
|  | revalidation worker that is used by some drivers in compute / | 
|  | long-running mode. | 
|  | * ``local object``: A GEM object which is only mapped within a | 
|  | single VM. Local GEM objects share the gpu_vm's dma_resv. | 
|  | * ``external object``: a.k.a shared object: A GEM object which may be shared | 
|  | by multiple gpu_vms and whose backing storage may be shared with | 
|  | other drivers. | 
|  |  | 
|  | Locks and locking order | 
|  | ======================= | 
|  |  | 
|  | One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's | 
|  | dma_resv object and hence the dma_resv lock. So, even with a huge | 
|  | number of local GEM objects, only one lock is needed to make the exec | 
|  | sequence atomic. | 
|  |  | 
|  | The following locks and locking orders are used: | 
|  |  | 
|  | * The ``gpu_vm->lock`` (optionally an rwsem). Protects the gpu_vm's | 
|  | data structure keeping track of gpu_vmas. It can also protect the | 
|  | gpu_vm's list of userptr gpu_vmas. With a CPU mm analogy this would | 
|  | correspond to the mmap_lock. An rwsem allows several readers to walk | 
|  | the VM tree concurrently, but the benefit of that concurrency most | 
|  | likely varies from driver to driver. | 
|  | * The ``userptr_seqlock``. This lock is taken in read mode for each | 
|  | userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu | 
|  | notifier invalidation. This is not a real seqlock but described in | 
|  | ``mm/mmu_notifier.c`` as a "Collision-retry read-side/write-side | 
|  | 'lock' a lot like a seqcount. However this allows multiple | 
|  | write-sides to hold it at once...". The read side critical section | 
|  | is enclosed by ``mmu_interval_read_begin() / | 
|  | mmu_interval_read_retry()`` with ``mmu_interval_read_begin()`` | 
|  | sleeping if the write side is held. | 
|  | The write side is held by the core mm while calling mmu interval | 
|  | invalidation notifiers. | 
|  | * The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing | 
|  | rebinding, as well as the residency state of all the gpu_vm's local | 
|  | GEM objects. | 
|  | Furthermore, it typically protects the gpu_vm's list of evicted and | 
|  | external GEM objects. | 
|  | * The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is | 
|  | taken in read mode during exec and write mode during a mmu notifier | 
|  | invalidation. The userptr notifier lock is per gpu_vm. | 
|  | * The ``gem_object->gpuva_lock`` This lock protects the GEM object's | 
|  | list of gpu_vm_bos. This is usually the same lock as the GEM | 
|  | object's dma_resv, but some drivers protects this list differently, | 
|  | see below. | 
|  | * The ``gpu_vm list spinlocks``. With some implementations they are needed | 
|  | to be able to update the gpu_vm evicted- and external object | 
|  | list. For those implementations, the spinlocks are grabbed when the | 
|  | lists are manipulated. However, to avoid locking order violations | 
|  | with the dma_resv locks, a special scheme is needed when iterating | 
|  | over the lists. | 
|  |  | 
|  | .. _gpu_vma lifetime: | 
|  |  | 
|  | Protection and lifetime of gpu_vm_bos and gpu_vmas | 
|  | ================================================== | 
|  |  | 
|  | The GEM object's list of gpu_vm_bos, and the gpu_vm_bo's list of gpu_vmas | 
|  | is protected by the ``gem_object->gpuva_lock``, which is typically the | 
|  | same as the GEM object's dma_resv, but if the driver | 
|  | needs to access these lists from within a dma_fence signalling | 
|  | critical section, it can instead choose to protect it with a | 
|  | separate lock, which can be locked from within the dma_fence signalling | 
|  | critical section. Such drivers then need to pay additional attention | 
|  | to what locks need to be taken from within the loop when iterating | 
|  | over the gpu_vm_bo and gpu_vma lists to avoid locking-order violations. | 
|  |  | 
|  | The DRM GPUVM set of helpers provide lockdep asserts that this lock is | 
|  | held in relevant situations and also provides a means of making itself | 
|  | aware of which lock is actually used: :c:func:`drm_gem_gpuva_set_lock`. | 
|  |  | 
|  | Each gpu_vm_bo holds a reference counted pointer to the underlying GEM | 
|  | object, and each gpu_vma holds a reference counted pointer to the | 
|  | gpu_vm_bo. When iterating over the GEM object's list of gpu_vm_bos and | 
|  | over the gpu_vm_bo's list of gpu_vmas, the ``gem_object->gpuva_lock`` must | 
|  | not be dropped, otherwise, gpu_vmas attached to a gpu_vm_bo may | 
|  | disappear without notice since those are not reference-counted. A | 
|  | driver may implement its own scheme to allow this at the expense of | 
|  | additional complexity, but this is outside the scope of this document. | 
|  |  | 
|  | In the DRM GPUVM implementation, each gpu_vm_bo and each gpu_vma | 
|  | holds a reference count on the gpu_vm itself. Due to this, and to avoid circular | 
|  | reference counting, cleanup of the gpu_vm's gpu_vmas must not be done from the | 
|  | gpu_vm's destructor. Drivers typically implements a gpu_vm close | 
|  | function for this cleanup. The gpu_vm close function will abort gpu | 
|  | execution using this VM, unmap all gpu_vmas and release page-table memory. | 
|  |  | 
|  | Revalidation and eviction of local objects | 
|  | ========================================== | 
|  |  | 
|  | Note that in all the code examples given below we use simplified | 
|  | pseudo-code. In particular, the dma_resv deadlock avoidance algorithm | 
|  | as well as reserving memory for dma_resv fences is left out. | 
|  |  | 
|  | Revalidation | 
|  | ____________ | 
|  | With VM_BIND, all local objects need to be resident when the gpu is | 
|  | executing using the gpu_vm, and the objects need to have valid | 
|  | gpu_vmas set up pointing to them. Typically, each gpu command buffer | 
|  | submission is therefore preceded with a re-validation section: | 
|  |  | 
|  | .. code-block:: C | 
|  |  | 
|  | dma_resv_lock(gpu_vm->resv); | 
|  |  | 
|  | // Validation section starts here. | 
|  | for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) { | 
|  | validate_gem_bo(&gpu_vm_bo->gem_bo); | 
|  |  | 
|  | // The following list iteration needs the Gem object's | 
|  | // dma_resv to be held (it protects the gpu_vm_bo's list of | 
|  | // gpu_vmas, but since local gem objects share the gpu_vm's | 
|  | // dma_resv, it is already held at this point. | 
|  | for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma) | 
|  | move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list); | 
|  | } | 
|  |  | 
|  | for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) { | 
|  | rebind_gpu_vma(&gpu_vma); | 
|  | remove_gpu_vma_from_rebind_list(&gpu_vma); | 
|  | } | 
|  | // Validation section ends here, and job submission starts. | 
|  |  | 
|  | add_dependencies(&gpu_job, &gpu_vm->resv); | 
|  | job_dma_fence = gpu_submit(&gpu_job)); | 
|  |  | 
|  | add_dma_fence(job_dma_fence, &gpu_vm->resv); | 
|  | dma_resv_unlock(gpu_vm->resv); | 
|  |  | 
|  | The reason for having a separate gpu_vm rebind list is that there | 
|  | might be userptr gpu_vmas that are not mapping a buffer object that | 
|  | also need rebinding. | 
|  |  | 
|  | Eviction | 
|  | ________ | 
|  |  | 
|  | Eviction of one of these local objects will then look similar to the | 
|  | following: | 
|  |  | 
|  | .. code-block:: C | 
|  |  | 
|  | obj = get_object_from_lru(); | 
|  |  | 
|  | dma_resv_lock(obj->resv); | 
|  | for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo); | 
|  | add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); | 
|  |  | 
|  | add_dependencies(&eviction_job, &obj->resv); | 
|  | job_dma_fence = gpu_submit(&eviction_job); | 
|  | add_dma_fence(&obj->resv, job_dma_fence); | 
|  |  | 
|  | dma_resv_unlock(&obj->resv); | 
|  | put_object(obj); | 
|  |  | 
|  | Note that since the object is local to the gpu_vm, it will share the gpu_vm's | 
|  | dma_resv lock such that ``obj->resv == gpu_vm->resv``. | 
|  | The gpu_vm_bos marked for eviction are put on the gpu_vm's evict list, | 
|  | which is protected by ``gpu_vm->resv``. During eviction all local | 
|  | objects have their dma_resv locked and, due to the above equality, also | 
|  | the gpu_vm's dma_resv protecting the gpu_vm's evict list is locked. | 
|  |  | 
|  | With VM_BIND, gpu_vmas don't need to be unbound before eviction, | 
|  | since the driver must ensure that the eviction blit or copy will wait | 
|  | for GPU idle or depend on all previous GPU activity. Furthermore, any | 
|  | subsequent attempt by the GPU to access freed memory through the | 
|  | gpu_vma will be preceded by a new exec function, with a revalidation | 
|  | section which will make sure all gpu_vmas are rebound. The eviction | 
|  | code holding the object's dma_resv while revalidating will ensure a | 
|  | new exec function may not race with the eviction. | 
|  |  | 
|  | A driver can be implemented in such a way that, on each exec function, | 
|  | only a subset of vmas are selected for rebind.  In this case, all vmas that are | 
|  | *not* selected for rebind must be unbound before the exec | 
|  | function workload is submitted. | 
|  |  | 
|  | Locking with external buffer objects | 
|  | ==================================== | 
|  |  | 
|  | Since external buffer objects may be shared by multiple gpu_vm's they | 
|  | can't share their reservation object with a single gpu_vm. Instead | 
|  | they need to have a reservation object of their own. The external | 
|  | objects bound to a gpu_vm using one or many gpu_vmas are therefore put on a | 
|  | per-gpu_vm list which is protected by the gpu_vm's dma_resv lock or | 
|  | one of the :ref:`gpu_vm list spinlocks <Spinlock iteration>`. Once | 
|  | the gpu_vm's reservation object is locked, it is safe to traverse the | 
|  | external object list and lock the dma_resvs of all external | 
|  | objects. However, if instead a list spinlock is used, a more elaborate | 
|  | iteration scheme needs to be used. | 
|  |  | 
|  | At eviction time, the gpu_vm_bos of *all* the gpu_vms an external | 
|  | object is bound to need to be put on their gpu_vm's evict list. | 
|  | However, when evicting an external object, the dma_resvs of the | 
|  | gpu_vms the object is bound to are typically not held. Only | 
|  | the object's private dma_resv can be guaranteed to be held. If there | 
|  | is a ww_acquire context at hand at eviction time we could grab those | 
|  | dma_resvs but that could cause expensive ww_mutex rollbacks. A simple | 
|  | option is to just mark the gpu_vm_bos of the evicted gem object with | 
|  | an ``evicted`` bool that is inspected before the next time the | 
|  | corresponding gpu_vm evicted list needs to be traversed. For example, when | 
|  | traversing the list of external objects and locking them. At that time, | 
|  | both the gpu_vm's dma_resv and the object's dma_resv is held, and the | 
|  | gpu_vm_bo marked evicted, can then be added to the gpu_vm's list of | 
|  | evicted gpu_vm_bos. The ``evicted`` bool is formally protected by the | 
|  | object's dma_resv. | 
|  |  | 
|  | The exec function becomes | 
|  |  | 
|  | .. code-block:: C | 
|  |  | 
|  | dma_resv_lock(gpu_vm->resv); | 
|  |  | 
|  | // External object list is protected by the gpu_vm->resv lock. | 
|  | for_each_gpu_vm_bo_on_extobj_list(gpu_vm, &gpu_vm_bo) { | 
|  | dma_resv_lock(gpu_vm_bo.gem_obj->resv); | 
|  | if (gpu_vm_bo_marked_evicted(&gpu_vm_bo)) | 
|  | add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); | 
|  | } | 
|  |  | 
|  | for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) { | 
|  | validate_gem_bo(&gpu_vm_bo->gem_bo); | 
|  |  | 
|  | for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma) | 
|  | move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list); | 
|  | } | 
|  |  | 
|  | for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) { | 
|  | rebind_gpu_vma(&gpu_vma); | 
|  | remove_gpu_vma_from_rebind_list(&gpu_vma); | 
|  | } | 
|  |  | 
|  | add_dependencies(&gpu_job, &gpu_vm->resv); | 
|  | job_dma_fence = gpu_submit(&gpu_job)); | 
|  |  | 
|  | add_dma_fence(job_dma_fence, &gpu_vm->resv); | 
|  | for_each_external_obj(gpu_vm, &obj) | 
|  | add_dma_fence(job_dma_fence, &obj->resv); | 
|  | dma_resv_unlock_all_resv_locks(); | 
|  |  | 
|  | And the corresponding shared-object aware eviction would look like: | 
|  |  | 
|  | .. code-block:: C | 
|  |  | 
|  | obj = get_object_from_lru(); | 
|  |  | 
|  | dma_resv_lock(obj->resv); | 
|  | for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo) | 
|  | if (object_is_vm_local(obj)) | 
|  | add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); | 
|  | else | 
|  | mark_gpu_vm_bo_evicted(&gpu_vm_bo); | 
|  |  | 
|  | add_dependencies(&eviction_job, &obj->resv); | 
|  | job_dma_fence = gpu_submit(&eviction_job); | 
|  | add_dma_fence(&obj->resv, job_dma_fence); | 
|  |  | 
|  | dma_resv_unlock(&obj->resv); | 
|  | put_object(obj); | 
|  |  | 
|  | .. _Spinlock iteration: | 
|  |  | 
|  | Accessing the gpu_vm's lists without the dma_resv lock held | 
|  | =========================================================== | 
|  |  | 
|  | Some drivers will hold the gpu_vm's dma_resv lock when accessing the | 
|  | gpu_vm's evict list and external objects lists. However, there are | 
|  | drivers that need to access these lists without the dma_resv lock | 
|  | held, for example due to asynchronous state updates from within the | 
|  | dma_fence signalling critical path. In such cases, a spinlock can be | 
|  | used to protect manipulation of the lists. However, since higher level | 
|  | sleeping locks need to be taken for each list item while iterating | 
|  | over the lists, the items already iterated over need to be | 
|  | temporarily moved to a private list and the spinlock released | 
|  | while processing each item: | 
|  |  | 
|  | .. code block:: C | 
|  |  | 
|  | struct list_head still_in_list; | 
|  |  | 
|  | INIT_LIST_HEAD(&still_in_list); | 
|  |  | 
|  | spin_lock(&gpu_vm->list_lock); | 
|  | do { | 
|  | struct list_head *entry = list_first_entry_or_null(&gpu_vm->list, head); | 
|  |  | 
|  | if (!entry) | 
|  | break; | 
|  |  | 
|  | list_move_tail(&entry->head, &still_in_list); | 
|  | list_entry_get_unless_zero(entry); | 
|  | spin_unlock(&gpu_vm->list_lock); | 
|  |  | 
|  | process(entry); | 
|  |  | 
|  | spin_lock(&gpu_vm->list_lock); | 
|  | list_entry_put(entry); | 
|  | } while (true); | 
|  |  | 
|  | list_splice_tail(&still_in_list, &gpu_vm->list); | 
|  | spin_unlock(&gpu_vm->list_lock); | 
|  |  | 
|  | Due to the additional locking and atomic operations, drivers that *can* | 
|  | avoid accessing the gpu_vm's list outside of the dma_resv lock | 
|  | might want to avoid also this iteration scheme. Particularly, if the | 
|  | driver anticipates a large number of list items. For lists where the | 
|  | anticipated number of list items is small, where list iteration doesn't | 
|  | happen very often or if there is a significant additional cost | 
|  | associated with each iteration, the atomic operation overhead | 
|  | associated with this type of iteration is, most likely, negligible. Note that | 
|  | if this scheme is used, it is necessary to make sure this list | 
|  | iteration is protected by an outer level lock or semaphore, since list | 
|  | items are temporarily pulled off the list while iterating, and it is | 
|  | also worth mentioning that the local list ``still_in_list`` should | 
|  | also be considered protected by the ``gpu_vm->list_lock``, and it is | 
|  | thus possible that items can be removed also from the local list | 
|  | concurrently with list iteration. | 
|  |  | 
|  | Please refer to the :ref:`DRM GPUVM locking section | 
|  | <drm_gpuvm_locking>` and its internal | 
|  | :c:func:`get_next_vm_bo_from_list` function. | 
|  |  | 
|  |  | 
|  | userptr gpu_vmas | 
|  | ================ | 
|  |  | 
|  | A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a | 
|  | GPU virtual address range, directly maps a CPU mm range of anonymous- | 
|  | or file page-cache pages. | 
|  | A very simple approach would be to just pin the pages using | 
|  | pin_user_pages() at bind time and unpin them at unbind time, but this | 
|  | creates a Denial-Of-Service vector since a single user-space process | 
|  | would be able to pin down all of system memory, which is not | 
|  | desirable. (For special use-cases and assuming proper accounting pinning might | 
|  | still be a desirable feature, though). What we need to do in the | 
|  | general case is to obtain a reference to the desired pages, make sure | 
|  | we are notified using a MMU notifier just before the CPU mm unmaps the | 
|  | pages, dirty them if they are not mapped read-only to the GPU, and | 
|  | then drop the reference. | 
|  | When we are notified by the MMU notifier that CPU mm is about to drop the | 
|  | pages, we need to stop GPU access to the pages by waiting for VM idle | 
|  | in the MMU notifier and make sure that before the next time the GPU | 
|  | tries to access whatever is now present in the CPU mm range, we unmap | 
|  | the old pages from the GPU page tables and repeat the process of | 
|  | obtaining new page references. (See the :ref:`notifier example | 
|  | <Invalidation example>` below). Note that when the core mm decides to | 
|  | laundry pages, we get such an unmap MMU notification and can mark the | 
|  | pages dirty again before the next GPU access. We also get similar MMU | 
|  | notifications for NUMA accounting which the GPU driver doesn't really | 
|  | need to care about, but so far it has proven difficult to exclude | 
|  | certain notifications. | 
|  |  | 
|  | Using a MMU notifier for device DMA (and other methods) is described in | 
|  | :ref:`the pin_user_pages() documentation <mmu-notifier-registration-case>`. | 
|  |  | 
|  | Now, the method of obtaining struct page references using | 
|  | get_user_pages() unfortunately can't be used under a dma_resv lock | 
|  | since that would violate the locking order of the dma_resv lock vs the | 
|  | mmap_lock that is grabbed when resolving a CPU pagefault. This means | 
|  | the gpu_vm's list of userptr gpu_vmas needs to be protected by an | 
|  | outer lock, which in our example below is the ``gpu_vm->lock``. | 
|  |  | 
|  | The MMU interval seqlock for a userptr gpu_vma is used in the following | 
|  | way: | 
|  |  | 
|  | .. code-block:: C | 
|  |  | 
|  | // Exclusive locking mode here is strictly needed only if there are | 
|  | // invalidated userptr gpu_vmas present, to avoid concurrent userptr | 
|  | // revalidations of the same userptr gpu_vma. | 
|  | down_write(&gpu_vm->lock); | 
|  | retry: | 
|  |  | 
|  | // Note: mmu_interval_read_begin() blocks until there is no | 
|  | // invalidation notifier running anymore. | 
|  | seq = mmu_interval_read_begin(&gpu_vma->userptr_interval); | 
|  | if (seq != gpu_vma->saved_seq) { | 
|  | obtain_new_page_pointers(&gpu_vma); | 
|  | dma_resv_lock(&gpu_vm->resv); | 
|  | add_gpu_vma_to_revalidate_list(&gpu_vma, &gpu_vm); | 
|  | dma_resv_unlock(&gpu_vm->resv); | 
|  | gpu_vma->saved_seq = seq; | 
|  | } | 
|  |  | 
|  | // The usual revalidation goes here. | 
|  |  | 
|  | // Final userptr sequence validation may not happen before the | 
|  | // submission dma_fence is added to the gpu_vm's resv, from the POW | 
|  | // of the MMU invalidation notifier. Hence the | 
|  | // userptr_notifier_lock that will make them appear atomic. | 
|  |  | 
|  | add_dependencies(&gpu_job, &gpu_vm->resv); | 
|  | down_read(&gpu_vm->userptr_notifier_lock); | 
|  | if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) { | 
|  | up_read(&gpu_vm->userptr_notifier_lock); | 
|  | goto retry; | 
|  | } | 
|  |  | 
|  | job_dma_fence = gpu_submit(&gpu_job)); | 
|  |  | 
|  | add_dma_fence(job_dma_fence, &gpu_vm->resv); | 
|  |  | 
|  | for_each_external_obj(gpu_vm, &obj) | 
|  | add_dma_fence(job_dma_fence, &obj->resv); | 
|  |  | 
|  | dma_resv_unlock_all_resv_locks(); | 
|  | up_read(&gpu_vm->userptr_notifier_lock); | 
|  | up_write(&gpu_vm->lock); | 
|  |  | 
|  | The code between ``mmu_interval_read_begin()`` and the | 
|  | ``mmu_interval_read_retry()`` marks the read side critical section of | 
|  | what we call the ``userptr_seqlock``. In reality, the gpu_vm's userptr | 
|  | gpu_vma list is looped through, and the check is done for *all* of its | 
|  | userptr gpu_vmas, although we only show a single one here. | 
|  |  | 
|  | The userptr gpu_vma MMU invalidation notifier might be called from | 
|  | reclaim context and, again, to avoid locking order violations, we can't | 
|  | take any dma_resv lock nor the gpu_vm->lock from within it. | 
|  |  | 
|  | .. _Invalidation example: | 
|  | .. code-block:: C | 
|  |  | 
|  | bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq) | 
|  | { | 
|  | // Make sure the exec function either sees the new sequence | 
|  | // and backs off or we wait for the dma-fence: | 
|  |  | 
|  | down_write(&gpu_vm->userptr_notifier_lock); | 
|  | mmu_interval_set_seq(userptr_interval, cur_seq); | 
|  | up_write(&gpu_vm->userptr_notifier_lock); | 
|  |  | 
|  | // At this point, the exec function can't succeed in | 
|  | // submitting a new job, because cur_seq is an invalid | 
|  | // sequence number and will always cause a retry. When all | 
|  | // invalidation callbacks, the mmu notifier core will flip | 
|  | // the sequence number to a valid one. However we need to | 
|  | // stop gpu access to the old pages here. | 
|  |  | 
|  | dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP, | 
|  | false, MAX_SCHEDULE_TIMEOUT); | 
|  | return true; | 
|  | } | 
|  |  | 
|  | When this invalidation notifier returns, the GPU can no longer be | 
|  | accessing the old pages of the userptr gpu_vma and needs to redo the | 
|  | page-binding before a new GPU submission can succeed. | 
|  |  | 
|  | Efficient userptr gpu_vma exec_function iteration | 
|  | _________________________________________________ | 
|  |  | 
|  | If the gpu_vm's list of userptr gpu_vmas becomes large, it's | 
|  | inefficient to iterate through the complete lists of userptrs on each | 
|  | exec function to check whether each userptr gpu_vma's saved | 
|  | sequence number is stale. A solution to this is to put all | 
|  | *invalidated* userptr gpu_vmas on a separate gpu_vm list and | 
|  | only check the gpu_vmas present on this list on each exec | 
|  | function. This list will then lend itself very-well to the spinlock | 
|  | locking scheme that is | 
|  | :ref:`described in the spinlock iteration section <Spinlock iteration>`, since | 
|  | in the mmu notifier, where we add the invalidated gpu_vmas to the | 
|  | list, it's not possible to take any outer locks like the | 
|  | ``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. Note that the | 
|  | ``gpu_vm->lock`` still needs to be taken while iterating to ensure the list is | 
|  | complete, as also mentioned in that section. | 
|  |  | 
|  | If using an invalidated userptr list like this, the retry check in the | 
|  | exec function trivially becomes a check for invalidated list empty. | 
|  |  | 
|  | Locking at bind and unbind time | 
|  | =============================== | 
|  |  | 
|  | At bind time, assuming a GEM object backed gpu_vma, each | 
|  | gpu_vma needs to be associated with a gpu_vm_bo and that | 
|  | gpu_vm_bo in turn needs to be added to the GEM object's | 
|  | gpu_vm_bo list, and possibly to the gpu_vm's external object | 
|  | list. This is referred to as *linking* the gpu_vma, and typically | 
|  | requires that the ``gpu_vm->lock`` and the ``gem_object->gpuva_lock`` | 
|  | are held. When unlinking a gpu_vma the same locks should be held, | 
|  | and that ensures that when iterating over ``gpu_vmas`, either under | 
|  | the ``gpu_vm->resv`` or the GEM object's dma_resv, that the gpu_vmas | 
|  | stay alive as long as the lock under which we iterate is not released. For | 
|  | userptr gpu_vmas it's similarly required that during vma destroy, the | 
|  | outer ``gpu_vm->lock`` is held, since otherwise when iterating over | 
|  | the invalidated userptr list as described in the previous section, | 
|  | there is nothing keeping those userptr gpu_vmas alive. | 
|  |  | 
|  | Locking for recoverable page-fault page-table updates | 
|  | ===================================================== | 
|  |  | 
|  | There are two important things we need to ensure with locking for | 
|  | recoverable page-faults: | 
|  |  | 
|  | * At the time we return pages back to the system / allocator for | 
|  | reuse, there should be no remaining GPU mappings and any GPU TLB | 
|  | must have been flushed. | 
|  | * The unmapping and mapping of a gpu_vma must not race. | 
|  |  | 
|  | Since the unmapping (or zapping) of GPU ptes is typically taking place | 
|  | where it is hard or even impossible to take any outer level locks we | 
|  | must either introduce a new lock that is held at both mapping and | 
|  | unmapping time, or look at the locks we do hold at unmapping time and | 
|  | make sure that they are held also at mapping time. For userptr | 
|  | gpu_vmas, the ``userptr_seqlock`` is held in write mode in the mmu | 
|  | invalidation notifier where zapping happens. Hence, if the | 
|  | ``userptr_seqlock`` as well as the ``gpu_vm->userptr_notifier_lock`` | 
|  | is held in read mode during mapping, it will not race with the | 
|  | zapping. For GEM object backed gpu_vmas, zapping will take place under | 
|  | the GEM object's dma_resv and ensuring that the dma_resv is held also | 
|  | when populating the page-tables for any gpu_vma pointing to the GEM | 
|  | object, will similarly ensure we are race-free. | 
|  |  | 
|  | If any part of the mapping is performed asynchronously | 
|  | under a dma-fence with these locks released, the zapping will need to | 
|  | wait for that dma-fence to signal under the relevant lock before | 
|  | starting to modify the page-table. | 
|  |  | 
|  | Since modifying the | 
|  | page-table structure in a way that frees up page-table memory | 
|  | might also require outer level locks, the zapping of GPU ptes | 
|  | typically focuses only on zeroing page-table or page-directory entries | 
|  | and flushing TLB, whereas freeing of page-table memory is deferred to | 
|  | unbind or rebind time. |