IB/uverbs: Add support for passing memory region invalidations to userspace

As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface.
Unlike previous attempts at this, which implemented a new generic
character device, this patch works within the existing RDMA userspace
verbs support.  Specifically, we implement three new userspace
operations:

 1. A new version of the "register MR" operation, which creates an MR
    for which the kernel will notify userspace when the virtual
    mapping changes.  We need a new operation for this (rather than,
    say, simply adding another flag to the access_flags field of the
    existing reg_mr operation) because we need to extend the ABI to
    allow userspace to pass in a cookie that will be returned as part
    of invalidation events.

 2. A new version of the "deregister MR" operation that returns the
    number of invalidate events passed to userspace.  Now that we
    generate events for MRs, we need this event count in the destroy
    operation to avoid unfixable races in userspace for exactly
    analogous reasons to the existing destroy CQ, QP and SRQ operations.

 3. A new command to create an MMU notification file descriptor for a
    userspace verbs context.  We require this FD to be created before
    allowing any MRs with invalidation notification to be registered.
    When an invalidation event occurs, the kernel queues an event on
    this FD that userspace can retrieve with read().

    We also allow userspace to mmap() one page at offset 0 to map a
    kernel page that contains a generation counter that is incremented
    each time an event is queued.  This allows userspace to have a
    fast path that checks that no events have occurred, without
    needing to do a system call.

Thanks to Jason Gunthorpe for suggestions on the interface design.  Also
thanks to Jeff Squyres for prototyping support for this in Open MPI, which
helped find several bugs during development.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
7 files changed