| |
| When managing a RAID1 array which uses metadata other than the |
| "native" metadata understood by the kernel, mdadm makes use of a |
| partner program named 'mdmon' to manage some aspects of updating |
| that metadata and synchronising the metadata with the array state. |
| |
| This document provides some details on how mdmon works. |
| |
| Containers |
| ---------- |
| |
| As background: mdadm makes a distinction between an 'array' and a |
| 'container'. Other sources sometimes use the term 'volume' or |
| 'device' for an 'array', and may use the term 'array' for a |
| 'container'. |
| |
| For our purposes: |
| - a 'container' is a collection of devices which are described by a |
| single set of metadata. The metadata may be stored equally |
| on all devices, or different devices may have quite different |
| subsets of the total metadata. But there is conceptually one set |
| of metadata that unifies the devices. |
| |
| - an 'array' is a set of datablock from various devices which |
| together are used to present the abstraction of a single linear |
| sequence of block, which may provide data redundancy or enhanced |
| performance. |
| |
| So a container has some metadata and provides a number of arrays which |
| are described by that metadata. |
| |
| Sometimes this model doesn't work perfectly. For example, global |
| spares may have their own metadata which is quite different from the |
| metadata from any device that participates in one or more arrays. |
| Such a global spare might still need to belong to some container so |
| that it is available to be used should a failure arise. In that case |
| we consider the 'metadata' to be the union of the metadata on the |
| active devices which describes the arrays, and the metadata on the |
| global spares which only describes the spares. In this case different |
| devices in the one container will have quite different metadata. |
| |
| |
| Purpose |
| ------- |
| |
| The main purpose of mdmon is to update the metadata in response to |
| changes to the array which need to be reflected in the metadata before |
| futures writes to the array can safely be performed. |
| These include: |
| - transitions from 'clean' to 'dirty'. |
| - recording the devices have failed. |
| - recording the progress of a 'reshape' |
| |
| This requires mdmon to be running at any time that the array is |
| writable (a read-only array does not require mdmon to be running). |
| |
| Because mdmon must be able to process these metadata updates at any |
| time, it must (when running) have exclusive write access to the |
| metadata. Any other changes (e.g. reconfiguration of the array) must |
| go through mdmon. |
| |
| A secondary role for mdmon is to activate spares when a device fails. |
| This role is much less time-critical than the other metadata updates, |
| so it could be performed by a separate process, possibly |
| "mdadm --monitor" which has a related role of moving devices between |
| arrays. A main reason for including this functionality in mdmon is |
| that in the native-metadata case this function is handled in the |
| kernel, and mdmon's reason for existence to provide functionality |
| which is otherwise handled by the kernel. |
| |
| |
| Design overview |
| --------------- |
| |
| mdmon is structured as two threads with a common address space and |
| common data structures. These threads are know as the 'monitor' and |
| the 'manager'. |
| |
| The 'monitor' has the primary role of monitoring the array for |
| important state changes and updating the metadata accordingly. As |
| writes to the array can be blocked until 'monitor' completes and |
| acknowledges the update, it much be very careful not to block itself. |
| In particular it must not block waiting for any write to complete else |
| it could deadlock. This means that it must not allocate memory as |
| doing this can require dirty memory to be written out and if the |
| system choose to write to the array that mdmon is monitoring, the |
| memory allocation could deadlock. |
| |
| So 'monitor' must never allocate memory and must limit the number of |
| other system call it performs. It may: |
| - use select (or poll) to wait for activity on a file descriptor |
| - read from a sysfs file descriptor |
| - write to a sysfs file descriptor |
| - write the metadata out to the block devices using O_DIRECT |
| - send a signal (kill) to the manager thread |
| |
| It must not e.g. open files or do anything similar that might allocate |
| resources. |
| |
| The 'manager' thread does everything else that is needed. If any |
| files are to be opened (e.g. because a device has been added to the |
| array), the manager does that. If any memory needs to be allocated |
| (e.g. to hold data about a new array as can happen when one set of |
| metadata describes several arrays), the manager performs that |
| allocation. |
| |
| The 'manager' is also responsible for communicating with mdadm and |
| assigning spares to replace failed devices. |
| |
| |
| Handling metadata updates |
| ------------------------- |
| |
| There are a number of cases in which mdadm needs to update the |
| metdata which mdmon is managing. These include: |
| - creating a new array in an active container |
| - adding a device to a container |
| - reconfiguring an array |
| etc. |
| |
| To complete these updates, mdadm must send a message to mdmon which |
| will merge the update into the metadata as it is at that moment. |
| |
| To achieve this, mdmon creates a Unix Domain Socket which the manager |
| thread listens on. mdadm sends a message over this socket. The |
| manager thread examines the message to see if it will require |
| allocating any memory and allocates it. This is done in the |
| 'prepare_update' metadata method. |
| |
| The update message is then queued for handling by the monitor thread |
| which it will do when convenient. The monitor thread calls |
| ->process_update which should atomically make the required changes to |
| the metadata, making use of the pre-allocate memory as required. Any |
| memory the is no-longer needed can be placed back in the request and |
| the manager thread will free it. |
| |
| The exact format of a metadata update is up to the implementer of the |
| metadata handlers. It will simply describe a change that needs to be |
| made. It will sometimes contain fragments of the metadata to be |
| copied in to place. However the ->process_update routine must make |
| sure not to over-write any field that the monitor thread might have |
| updated, such as a 'device failed' or 'array is dirty' state. |
| |
| When the monitor thread has completed the update and written it to the |
| devices, an acknowledgement message is sent back over the socket so |
| that mdadm knows it is complete. |