| The (New) Linux Kernel Driver Model |
| |
| Version 0.04 |
| |
| Patrick Mochel <mochel@osdl.org> |
| |
| 03 December 2001 |
| |
| |
| Overview |
| ~~~~~~~~ |
| |
| This driver model is a unification of all the current, disparate driver models |
| that are currently in the kernel. It is intended is to augment the |
| bus-specific drivers for bridges and devices by consolidating a set of data |
| and operations into globally accessible data structures. |
| |
| Current driver models implement some sort of tree-like structure (sometimes |
| just a list) for the devices they control. But, there is no linkage between |
| the different bus types. |
| |
| A common data structure can provide this linkage with little overhead: when a |
| bus driver discovers a particular device, it can insert it into the global |
| tree as well as its local tree. In fact, the local tree becomes just a subset |
| of the global tree. |
| |
| Common data fields can also be moved out of the local bus models into the |
| global model. Some of the manipulation of these fields can also be |
| consolidated. Most likely, manipulation functions will become a set |
| of helper functions, which the bus drivers wrap around to include any |
| bus-specific items. |
| |
| The common device and bridge interface currently reflects the goals of the |
| modern PC: namely the ability to do seamless Plug and Play, power management, |
| and hot plug. (The model dictated by Intel and Microsoft (read: ACPI) ensures |
| us that any device in the system may fit any of these criteria.) |
| |
| In reality, not every bus will be able to support such operations. But, most |
| buses will support a majority of those operations, and all future buses will. |
| In other words, a bus that doesn't support an operation is the exception, |
| instead of the other way around. |
| |
| |
| Drivers |
| ~~~~~~~ |
| |
| The callbacks for bridges and devices are intended to be singular for a |
| particular type of bus. For each type of bus that has support compiled in the |
| kernel, there should be one statically allocated structure with the |
| appropriate callbacks that each device (or bridge) of that type share. |
| |
| Each bus layer should implement the callbacks for these drivers. It then |
| forwards the calls on to the device-specific callbacks. This means that |
| device-specific drivers must still implement callbacks for each operation. |
| But, they are not called from the top level driver layer. [So for example |
| PCI devices will not call device_register but pci_device_register.] |
| |
| This does add another layer of indirection for calling one of these functions, |
| but there are benefits that are believed to outweigh this slowdown. |
| |
| First, it prevents device-specific drivers from having to know about the |
| global device layer. This speeds up integration time incredibly. It also |
| allows drivers to be more portable across kernel versions. Note that the |
| former was intentional, the latter is an added bonus. |
| |
| Second, this added indirection allows the bus to perform any additional logic |
| necessary for its child devices. A bus layer may add additional information to |
| the call, or translate it into something meaningful for its children. |
| |
| This could be done in the driver, but if it happens for every object of a |
| particular type, it is best done at a higher level. |
| |
| Recap |
| ~~~~~ |
| |
| Instances of devices and bridges are allocated dynamically as the system |
| discovers their existence. Their fields describe the individual object. |
| Drivers - in the global sense - are statically allocated and singular for a |
| particular type of bus. They describe a set of operations that every type of |
| bus could implement, the implementation following the bus's semantics. |
| |
| |
| Downstream Access |
| ~~~~~~~~~~~~~~~~~ |
| |
| Common data fields have been moved out of individual bus layers into a common |
| data structure. But, these fields must still be accessed by the bus layers, |
| and sometimes by the device-specific drivers. |
| |
| Other bus layers are encouraged to do what has been done for the PCI layer. |
| struct pci_dev now looks like this: |
| |
| struct pci_dev { |
| ... |
| |
| struct device device; |
| }; |
| |
| Note first that it is statically allocated. This means only one allocation on |
| device discovery. Note also that it is at the _end_ of struct pci_dev. This is |
| to make people think about what they're doing when switching between the bus |
| driver and the global driver; and to prevent against mindless casts between |
| the two. |
| |
| The PCI bus layer freely accesses the fields of struct device. It knows about |
| the structure of struct pci_dev, and it should know the structure of struct |
| device. PCI devices that have been converted generally do not touch the fields |
| of struct device. More precisely, device-specific drivers should not touch |
| fields of struct device unless there is a strong compelling reason to do so. |
| |
| This abstraction is prevention of unnecessary pain during transitional phases. |
| If the name of the field changes or is removed, then every downstream driver |
| will break. On the other hand, if only the bus layer (and not the device |
| layer) accesses struct device, it is only those that need to change. |
| |
| |
| User Interface |
| ~~~~~~~~~~~~~~ |
| |
| By virtue of having a complete hierarchical view of all the devices in the |
| system, exporting a complete hierarchical view to userspace becomes relatively |
| easy. This has been accomplished by implementing a special purpose virtual |
| file system named driverfs. It is hence possible for the user to mount the |
| whole driverfs on a particular mount point in the unified UNIX file hierarchy. |
| |
| This can be done permanently by providing the following entry into the |
| /dev/fstab (under the provision that the mount point does exist, of course): |
| |
| none /devices driverfs defaults 0 0 |
| |
| Or by hand on the command line: |
| |
| ~: mount -t driverfs none /devices |
| |
| Whenever a device is inserted into the tree, a directory is created for it. |
| This directory may be populated at each layer of discovery - the global layer, |
| the bus layer, or the device layer. |
| |
| The global layer currently creates two files - 'status' and 'power'. The |
| former only reports the name of the device and its bus ID. The latter reports |
| the current power state of the device. It also be used to set the current |
| power state. |
| |
| The bus layer may also create files for the devices it finds while probing the |
| bus. For example, the PCI layer currently creates 'wake' and 'resource' files |
| for each PCI device. |
| |
| A device-specific driver may also export files in its directory to expose |
| device-specific data or tunable interfaces. |
| |
| These features were initially implemented using procfs. However, after one |
| conversation with Linus, a new filesystem - driverfs - was created to |
| implement these features. It is an in-memory filesystem, based heavily off of |
| ramfs, though it uses procfs as inspiration for its callback functionality. |
| |
| Each struct device has a 'struct driver_dir_entry' which encapsulates the |
| device's directory and the files within. |
| |
| Device Structures |
| ~~~~~~~~~~~~~~~~~ |
| |
| struct device { |
| struct list_head bus_list; |
| struct iobus *parent; |
| struct iobus *subordinate; |
| |
| char name[DEVICE_NAME_SIZE]; |
| char bus_id[BUS_ID_SIZE]; |
| |
| struct driver_dir_entry * dir; |
| |
| spinlock_t lock; |
| atomic_t refcount; |
| |
| struct device_driver *driver; |
| void *driver_data; |
| void *platform_data; |
| |
| u32 current_state; |
| unsigned char *saved_state; |
| }; |
| |
| bus_list: |
| List of all devices on a particular bus; i.e. the device's siblings |
| |
| parent: |
| The parent bridge for the device. |
| |
| subordinate: |
| If the device is a bridge itself, this points to the struct io_bus that is |
| created for it. |
| |
| name: |
| Human readable (descriptive) name of device. E.g. "Intel EEPro 100" |
| |
| bus_id: |
| Parsable (yet ASCII) bus id. E.g. "00:04.00" (PCI Bus 0, Device 4, Function |
| 0). It is necessary to have a searchable bus id for each device; making it |
| ASCII allows us to use it for its directory name without translating it. |
| |
| dir: |
| Driver's driverfs directory. |
| |
| lock: |
| Driver specific lock. |
| |
| refcount: |
| Driver's usage count. |
| When this goes to 0, the device is assumed to be removed. It will be removed |
| from its parent's list of children. It's remove() callback will be called to |
| inform the driver to clean up after itself. |
| |
| driver: |
| Pointer to a struct device_driver, the common operations for each device. See |
| next section. |
| |
| driver_data: |
| Private data for the driver. |
| Much like the PCI implementation of this field, this allows device-specific |
| drivers to keep a pointer to a device-specific data. |
| |
| platform_data: |
| Data that the platform (firmware) provides about the device. |
| For example, the ACPI BIOS or EFI may have additional information about the |
| device that is not directly mappable to any existing kernel data structure. |
| It also allows the platform driver (e.g. ACPI) to a driver without the driver |
| having to have explicit knowledge of (atrocities like) ACPI. |
| |
| current_state: |
| Current power state of the device. For PCI and other modern devices, this is |
| 0-3, though it's not necessarily limited to those values. |
| |
| saved_state: |
| Pointer to driver-specific set of saved state. |
| Having it here allows modules to be unloaded on system suspend and reloaded |
| on resume and maintain state across transitions. |
| It also allows generic drivers to maintain state across system state |
| transitions. |
| (I've implemented a generic PCI driver for devices that don't have a |
| device-specific driver. Instead of managing some vector of saved state |
| for each device the generic driver supports, it can simply store it here.) |
| |
| |
| |
| struct device_driver { |
| int (*probe) (struct device *dev); |
| int (*remove) (struct device *dev); |
| |
| int (*suspend) (struct device *dev, u32 state, u32 level); |
| int (*resume) (struct device *dev, u32 level); |
| } |
| |
| probe: |
| Check for device existence and associate driver with it. In case of device |
| insertion, *all* drivers are called. Struct device has parent and bus_id |
| valid at this point. probe() may only be called from process context. Returns |
| 0 if it handles that device, -ESRCH if this driver does not know how to handle |
| this device, valid error otherwise. |
| |
| remove: |
| Dissociate driver with device. Releases device so that it could be used by |
| another driver. Also, if it is a hotplug device (hotplug PCI, Cardbus), an |
| ejection event could take place here. remove() can be called from interrupt |
| context. [Fixme: Is that good?] Returns 0 on success. [Can we recover from |
| failed remove or should I define that remove() never fails?] |
| |
| suspend: |
| Perform one step of the device suspend process. Returns 0 on success. |
| |
| resume: |
| Perform one step of the device resume process. Returns 0 on success. |
| |
| The probe() and remove() callbacks are intended to be much simpler than the |
| current PCI correspondents. |
| |
| probe() should do the following only: |
| |
| - Check if hardware is present |
| - Register device interface |
| - Disable DMA/interrupts, etc, just in case. |
| |
| Some device initialisation was done in probe(). This should not be the case |
| anymore. All initialisation should take place in the open() call for the |
| device. [FIXME: How do you "open" uhci?] |
| |
| Breaking initialisation code out must also be done for the resume() callback, |
| as most devices will have to be completely reinitialised when coming back from |
| a suspend state. |
| |
| remove() should simply unregister the device interface. |
| |
| |
| Device power management can be quite complicated, based exactly what is |
| desired to be done. Four operations sum up most of it: |
| |
| - OS directed power management. |
| The OS takes care of notifying all drivers that a suspend is requested, |
| saving device state, and powering devices down. |
| - Firmware controlled power management. |
| The OS only wants to notify devices that a suspend is requested. |
| - Device power management. |
| A user wants to place only one device in a low power state, and maybe save |
| state. |
| - System reboot. |
| The system wants to place devices in a quiescent state before the system is |
| reset. |
| |
| In an attempt to please all of these scenarios, the power management |
| transition for any device is broken up into several stages - notify, save |
| state, and power down. The disable stage, which should happen after notify and |
| before save state has been considered and may be implemented in the future. |
| |
| Depending on what the system-wide policy is (usually dictated by the power |
| management scheme present), each driver's suspend callback may be called |
| multiple times, each with a different stage. |
| |
| On all power management transitions, the stages should be called sequentially |
| (notify before save state; save state before power down). However, drivers |
| should not assume that any stage was called before hand. (If a driver gets a |
| power down call, it shouldn't assume notify or save state was called first.) |
| This allows the framework to be used seamlessly by all power management |
| actions. Hopefully. |
| |
| Resume transitions happen in a similar manner. They are broken up into two |
| stages currently (power on and restore state), though a third stage (enable) |
| may be added later. |
| |
| For suspend and resume transitions, the following values are defined to denote |
| the stage: |
| |
| enum{ |
| SUSPEND_NOTIFY, |
| SUSPEND_DISABLE, |
| SUSPEND_SAVE_STATE, |
| SUSPEND_POWER_DOWN, |
| }; |
| |
| enum { |
| RESUME_POWER_ON, |
| RESUME_RESTORE_STATE, |
| RESUME_ENABLE, |
| }; |
| |
| |
| During a system power transition, the device tree must be walked in order, |
| calling the suspend() or resume() callback for each node. This may happen |
| several times. |
| |
| Initially, this was done in kernel space. However, it has occurred to me that |
| doing recursion to a non-bounded depth is dangerous, and that there are a lot |
| of inherent race conditions in such an operation. |
| |
| Non-recursive walking of the device tree is possible. However, this makes for |
| convoluted code. |
| |
| No matter what, if the transition happens in kernel space, it is difficult to |
| gracefully recover from errors or to implement a policy that prevents one from |
| shutting down the device(s) you want to save state to. |
| |
| Instead, the walking of the device tree has been moved to userspace. When a |
| user requests the system to suspend, it will walk the device tree, as exported |
| via driverfs, and tell each device to go to sleep. It will do this multiple |
| times based on what the system policy is. [Not possible. Take ACPI enabled |
| system, with battery critically low. In such state, you want to suspend-to-disk, |
| *fast*. User maybe is not even running powerd (think system startup)!] |
| |
| Device resume should happen in the same manner when the system awakens. |
| |
| Each suspend stage is described below: |
| |
| SUSPEND_NOTIFY: |
| |
| This level to notify the driver that it is going to sleep. If it knows that it |
| cannot resume the hardware from the requested level, or it feels that it is |
| too important to be put to sleep, it should return an error from this function. |
| |
| It does not have to stop I/O requests or actually save state at this point. Called |
| from process context. |
| |
| SUSPEND_DISABLE: |
| |
| The driver should stop taking I/O requests at this stage. Because the save |
| state stage happens afterwards, the driver may not want to physically disable |
| the device; only mark itself unavailable if possible. Called from process |
| context. |
| |
| SUSPEND_SAVE_STATE: |
| |
| The driver should allocate memory and save any device state that is relevant |
| for the state it is going to enter. Called from process context. |
| |
| SUSPEND_POWER_DOWN: |
| |
| The driver should place the device in the power state requested. May be called |
| from interrupt context. |
| |
| |
| For resume, the stages are defined as follows: |
| |
| RESUME_POWER_ON: |
| |
| Devices should be powered on and reinitialised to some known working state. |
| Called from process context. |
| |
| RESUME_RESTORE_STATE: |
| |
| The driver should restore device state to its pre-suspend state and free any |
| memory allocated for its saved state. Called from process context. |
| |
| RESUME_ENABLE: |
| |
| The device should start taking I/O requests again. Called from process context. |
| |
| |
| Each driver does not have to implement each stage. But, it if it does |
| implement a stage, it should do what is described above. It should not assume |
| that it performed any stage previously, or that it will perform any stage |
| later. [Really? It makes sense to support SAVE_STATE only after DISABLE]. |
| |
| It is quite possible that a driver can fail during the suspend process, for |
| whatever reason. In this event, the calling process must gracefully recover |
| and restore everything to their states before the suspend transition began. |
| [Suspend may not fail, think battery low.] |
| |
| If a driver knows that it cannot suspend or resume properly, it should fail |
| during the notify stage. Properly implemented power management schemes should |
| make sure that this is the first stage that is called. |
| |
| If a driver gets a power down request, it should obey it, as it may very |
| likely be during a reboot. |
| |
| |
| Bus Structures |
| ~~~~~~~~~~~~~~ |
| |
| struct iobus { |
| struct list_head node; |
| struct iobus *parent; |
| struct list_head children; |
| struct list_head devices; |
| |
| struct list_head bus_list; |
| |
| spinlock_t lock; |
| atomic_t refcount; |
| |
| struct device *self; |
| struct driver_dir_entry * dir; |
| |
| char name[DEVICE_NAME_SIZE]; |
| char bus_id[BUS_ID_SIZE]; |
| |
| struct bus_driver *driver; |
| }; |
| |
| node: |
| Bus's node in sibling list (its parent's list of child buses). |
| |
| parent: |
| Pointer to parent bridge. |
| |
| children: |
| List of subordinate buses. |
| In the children, this correlates to their 'node' field. |
| |
| devices: |
| List of devices on the bus this bridge controls. |
| This field corresponds to the 'bus_list' field in each child device. |
| |
| bus_list: |
| Each type of bus keeps a list of all bridges that it finds. This is the |
| bridges entry in that list. |
| |
| self: |
| Pointer to the struct device for this bridge. |
| |
| lock: |
| Lock for the bus. |
| |
| refcount: |
| Usage count for the bus. |
| |
| dir: |
| Driverfs directory. |
| |
| name: |
| Human readable ASCII name of bus. |
| |
| bus_id: |
| Machine readable (though ASCII) description of position on parent bus. |
| |
| driver: |
| Pointer to operations for bus. |
| |
| |
| struct iobus_driver { |
| char name[16]; |
| struct list_head node; |
| |
| int (*scan) (struct io_bus*); |
| int (*add_device) (struct io_bus*, char*); |
| }; |
| |
| name: |
| ASCII name of bus. |
| |
| node: |
| List of buses of this type in system. |
| |
| scan: |
| Search the bus for new devices. This may happen either at boot - where every |
| device discovered will be new - or later on - in which there may only be a few |
| (or no) new devices. |
| |
| add_device: |
| Trigger a device insertion at a particular location. |
| |
| |
| |
| The API |
| ~~~~~~~ |
| |
| There are several functions exported by the global device layer, including |
| several optional helper functions, written solely to try and make your life |
| easier. |
| |
| void device_init_dev(struct device * dev); |
| |
| Initialise a device structure. It first zeros the device, the initialises all |
| of the lists. (Note that this would have been called device_init(), but that |
| name was already taken. :/) |
| |
| |
| struct device * device_alloc(void) |
| |
| Allocate memory for a device structure and initialise it. |
| First, allocates memory, then calls device_init_dev() with the new pointer. |
| |
| |
| int device_register(struct device * dev); |
| |
| Register a device with the global device layer. |
| The bus layer should call this function upon device discovery, e.g. when |
| probing the bus. |
| dev should be fully initialised when this is called. |
| If dev->parent is not set, it sets its parent to be the device root. |
| It then does the following: |
| - inserts it into its parent's list of children |
| - creates a driverfs directory for it |
| - creates a set of default files for the device in its directory |
| - calls platform_notify() to notify the firmware driver of its existence. |
| |
| |
| void get_device(struct device * dev); |
| |
| Increment the refcount for a device. |
| |
| |
| int valid_device(struct device * dev); |
| |
| Check if reference count is positive for a device (it's not waiting to be |
| freed). If it is positive, it increments the reference count for the device. |
| It returns whether or not the device is usable. |
| |
| |
| void put_device(struct device * dev); |
| |
| Decrement the reference count for the device. If it hits 0, it removes the |
| device from its parent's list of children and calls the remove() callback for |
| the device. |
| |
| |
| void lock_device(struct device * dev); |
| |
| Take the spinlock for the device. |
| |
| |
| void unlock_device(struct device * dev); |
| |
| Release the spinlock for the device. |
| |
| |
| |
| void iobus_init(struct iobus * iobus); |
| struct iobus * iobus_alloc(void); |
| int iobus_register(struct iobus * iobus); |
| void get_iobus(struct iobus * iobus); |
| int valid_iobus(struct iobus * iobus); |
| void put_iobus(struct iobus * iobus); |
| void lock_iobus(struct iobus * iobus); |
| void unlock_iobus(struct iobus * iobus); |
| |
| These functions provide the same functionality as the device_* |
| counterparts, only operating on a struct iobus. One important thing to note, |
| though is that iobus_register() and iobus_unregister() operate recursively. It |
| is possible to add an entire tree in one call. |
| |
| |
| |
| int device_driver_init(void); |
| |
| Main initialisation routine. |
| |
| This makes sure driverfs is up and running and initialises the device tree. |
| |
| |
| void device_driver_exit(void); |
| |
| This frees up the device tree. |
| |
| |
| |
| |
| Credits |
| ~~~~~~~ |
| |
| The following people have been extremely helpful in solidifying this document |
| and the driver model. |
| |
| Randy Dunlap rddunlap@osdl.org |
| Jeff Garzik jgarzik@mandrakesoft.com |
| Ben Herrenschmidt benh@kernel.crashing.org |
| |
| |