| ========================================================= | 
 | Cluster-wide Power-up/power-down race avoidance algorithm | 
 | ========================================================= | 
 |  | 
 | This file documents the algorithm which is used to coordinate CPU and | 
 | cluster setup and teardown operations and to manage hardware coherency | 
 | controls safely. | 
 |  | 
 | The section "Rationale" explains what the algorithm is for and why it is | 
 | needed.  "Basic model" explains general concepts using a simplified view | 
 | of the system.  The other sections explain the actual details of the | 
 | algorithm in use. | 
 |  | 
 |  | 
 | Rationale | 
 | --------- | 
 |  | 
 | In a system containing multiple CPUs, it is desirable to have the | 
 | ability to turn off individual CPUs when the system is idle, reducing | 
 | power consumption and thermal dissipation. | 
 |  | 
 | In a system containing multiple clusters of CPUs, it is also desirable | 
 | to have the ability to turn off entire clusters. | 
 |  | 
 | Turning entire clusters off and on is a risky business, because it | 
 | involves performing potentially destructive operations affecting a group | 
 | of independently running CPUs, while the OS continues to run.  This | 
 | means that we need some coordination in order to ensure that critical | 
 | cluster-level operations are only performed when it is truly safe to do | 
 | so. | 
 |  | 
 | Simple locking may not be sufficient to solve this problem, because | 
 | mechanisms like Linux spinlocks may rely on coherency mechanisms which | 
 | are not immediately enabled when a cluster powers up.  Since enabling or | 
 | disabling those mechanisms may itself be a non-atomic operation (such as | 
 | writing some hardware registers and invalidating large caches), other | 
 | methods of coordination are required in order to guarantee safe | 
 | power-down and power-up at the cluster level. | 
 |  | 
 | The mechanism presented in this document describes a coherent memory | 
 | based protocol for performing the needed coordination.  It aims to be as | 
 | lightweight as possible, while providing the required safety properties. | 
 |  | 
 |  | 
 | Basic model | 
 | ----------- | 
 |  | 
 | Each cluster and CPU is assigned a state, as follows: | 
 |  | 
 | 	- DOWN | 
 | 	- COMING_UP | 
 | 	- UP | 
 | 	- GOING_DOWN | 
 |  | 
 | :: | 
 |  | 
 | 	    +---------> UP ----------+ | 
 | 	    |                        v | 
 |  | 
 | 	COMING_UP                GOING_DOWN | 
 |  | 
 | 	    ^                        | | 
 | 	    +--------- DOWN <--------+ | 
 |  | 
 |  | 
 | DOWN: | 
 | 	The CPU or cluster is not coherent, and is either powered off or | 
 | 	suspended, or is ready to be powered off or suspended. | 
 |  | 
 | COMING_UP: | 
 | 	The CPU or cluster has committed to moving to the UP state. | 
 | 	It may be part way through the process of initialisation and | 
 | 	enabling coherency. | 
 |  | 
 | UP: | 
 | 	The CPU or cluster is active and coherent at the hardware | 
 | 	level.  A CPU in this state is not necessarily being used | 
 | 	actively by the kernel. | 
 |  | 
 | GOING_DOWN: | 
 | 	The CPU or cluster has committed to moving to the DOWN | 
 | 	state.  It may be part way through the process of teardown and | 
 | 	coherency exit. | 
 |  | 
 |  | 
 | Each CPU has one of these states assigned to it at any point in time. | 
 | The CPU states are described in the "CPU state" section, below. | 
 |  | 
 | Each cluster is also assigned a state, but it is necessary to split the | 
 | state value into two parts (the "cluster" state and "inbound" state) and | 
 | to introduce additional states in order to avoid races between different | 
 | CPUs in the cluster simultaneously modifying the state.  The cluster- | 
 | level states are described in the "Cluster state" section. | 
 |  | 
 | To help distinguish the CPU states from cluster states in this | 
 | discussion, the state names are given a `CPU_` prefix for the CPU states, | 
 | and a `CLUSTER_` or `INBOUND_` prefix for the cluster states. | 
 |  | 
 |  | 
 | CPU state | 
 | --------- | 
 |  | 
 | In this algorithm, each individual core in a multi-core processor is | 
 | referred to as a "CPU".  CPUs are assumed to be single-threaded: | 
 | therefore, a CPU can only be doing one thing at a single point in time. | 
 |  | 
 | This means that CPUs fit the basic model closely. | 
 |  | 
 | The algorithm defines the following states for each CPU in the system: | 
 |  | 
 | 	- CPU_DOWN | 
 | 	- CPU_COMING_UP | 
 | 	- CPU_UP | 
 | 	- CPU_GOING_DOWN | 
 |  | 
 | :: | 
 |  | 
 | 	 cluster setup and | 
 | 	CPU setup complete          policy decision | 
 | 	      +-----------> CPU_UP ------------+ | 
 | 	      |                                v | 
 |  | 
 | 	CPU_COMING_UP                   CPU_GOING_DOWN | 
 |  | 
 | 	      ^                                | | 
 | 	      +----------- CPU_DOWN <----------+ | 
 | 	 policy decision           CPU teardown complete | 
 | 	or hardware event | 
 |  | 
 |  | 
 | The definitions of the four states correspond closely to the states of | 
 | the basic model. | 
 |  | 
 | Transitions between states occur as follows. | 
 |  | 
 | A trigger event (spontaneous) means that the CPU can transition to the | 
 | next state as a result of making local progress only, with no | 
 | requirement for any external event to happen. | 
 |  | 
 |  | 
 | CPU_DOWN: | 
 | 	A CPU reaches the CPU_DOWN state when it is ready for | 
 | 	power-down.  On reaching this state, the CPU will typically | 
 | 	power itself down or suspend itself, via a WFI instruction or a | 
 | 	firmware call. | 
 |  | 
 | 	Next state: | 
 | 		CPU_COMING_UP | 
 | 	Conditions: | 
 | 		none | 
 |  | 
 | 	Trigger events: | 
 | 		a) an explicit hardware power-up operation, resulting | 
 | 		   from a policy decision on another CPU; | 
 |  | 
 | 		b) a hardware event, such as an interrupt. | 
 |  | 
 |  | 
 | CPU_COMING_UP: | 
 | 	A CPU cannot start participating in hardware coherency until the | 
 | 	cluster is set up and coherent.  If the cluster is not ready, | 
 | 	then the CPU will wait in the CPU_COMING_UP state until the | 
 | 	cluster has been set up. | 
 |  | 
 | 	Next state: | 
 | 		CPU_UP | 
 | 	Conditions: | 
 | 		The CPU's parent cluster must be in CLUSTER_UP. | 
 | 	Trigger events: | 
 | 		Transition of the parent cluster to CLUSTER_UP. | 
 |  | 
 | 	Refer to the "Cluster state" section for a description of the | 
 | 	CLUSTER_UP state. | 
 |  | 
 |  | 
 | CPU_UP: | 
 | 	When a CPU reaches the CPU_UP state, it is safe for the CPU to | 
 | 	start participating in local coherency. | 
 |  | 
 | 	This is done by jumping to the kernel's CPU resume code. | 
 |  | 
 | 	Note that the definition of this state is slightly different | 
 | 	from the basic model definition: CPU_UP does not mean that the | 
 | 	CPU is coherent yet, but it does mean that it is safe to resume | 
 | 	the kernel.  The kernel handles the rest of the resume | 
 | 	procedure, so the remaining steps are not visible as part of the | 
 | 	race avoidance algorithm. | 
 |  | 
 | 	The CPU remains in this state until an explicit policy decision | 
 | 	is made to shut down or suspend the CPU. | 
 |  | 
 | 	Next state: | 
 | 		CPU_GOING_DOWN | 
 | 	Conditions: | 
 | 		none | 
 | 	Trigger events: | 
 | 		explicit policy decision | 
 |  | 
 |  | 
 | CPU_GOING_DOWN: | 
 | 	While in this state, the CPU exits coherency, including any | 
 | 	operations required to achieve this (such as cleaning data | 
 | 	caches). | 
 |  | 
 | 	Next state: | 
 | 		CPU_DOWN | 
 | 	Conditions: | 
 | 		local CPU teardown complete | 
 | 	Trigger events: | 
 | 		(spontaneous) | 
 |  | 
 |  | 
 | Cluster state | 
 | ------------- | 
 |  | 
 | A cluster is a group of connected CPUs with some common resources. | 
 | Because a cluster contains multiple CPUs, it can be doing multiple | 
 | things at the same time.  This has some implications.  In particular, a | 
 | CPU can start up while another CPU is tearing the cluster down. | 
 |  | 
 | In this discussion, the "outbound side" is the view of the cluster state | 
 | as seen by a CPU tearing the cluster down.  The "inbound side" is the | 
 | view of the cluster state as seen by a CPU setting the CPU up. | 
 |  | 
 | In order to enable safe coordination in such situations, it is important | 
 | that a CPU which is setting up the cluster can advertise its state | 
 | independently of the CPU which is tearing down the cluster.  For this | 
 | reason, the cluster state is split into two parts: | 
 |  | 
 | 	"cluster" state: The global state of the cluster; or the state | 
 | 	on the outbound side: | 
 |  | 
 | 		- CLUSTER_DOWN | 
 | 		- CLUSTER_UP | 
 | 		- CLUSTER_GOING_DOWN | 
 |  | 
 | 	"inbound" state: The state of the cluster on the inbound side. | 
 |  | 
 | 		- INBOUND_NOT_COMING_UP | 
 | 		- INBOUND_COMING_UP | 
 |  | 
 |  | 
 | 	The different pairings of these states results in six possible | 
 | 	states for the cluster as a whole:: | 
 |  | 
 | 	                            CLUSTER_UP | 
 | 	          +==========> INBOUND_NOT_COMING_UP -------------+ | 
 | 	          #                                               | | 
 | 	                                                          | | 
 | 	     CLUSTER_UP     <----+                                | | 
 | 	  INBOUND_COMING_UP      |                                v | 
 |  | 
 | 	          ^             CLUSTER_GOING_DOWN       CLUSTER_GOING_DOWN | 
 | 	          #              INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP | 
 |  | 
 | 	    CLUSTER_DOWN         |                                | | 
 | 	  INBOUND_COMING_UP <----+                                | | 
 | 	                                                          | | 
 | 	          ^                                               | | 
 | 	          +===========     CLUSTER_DOWN      <------------+ | 
 | 	                       INBOUND_NOT_COMING_UP | 
 |  | 
 | 	Transitions -----> can only be made by the outbound CPU, and | 
 | 	only involve changes to the "cluster" state. | 
 |  | 
 | 	Transitions ===##> can only be made by the inbound CPU, and only | 
 | 	involve changes to the "inbound" state, except where there is no | 
 | 	further transition possible on the outbound side (i.e., the | 
 | 	outbound CPU has put the cluster into the CLUSTER_DOWN state). | 
 |  | 
 | 	The race avoidance algorithm does not provide a way to determine | 
 | 	which exact CPUs within the cluster play these roles.  This must | 
 | 	be decided in advance by some other means.  Refer to the section | 
 | 	"Last man and first man selection" for more explanation. | 
 |  | 
 |  | 
 | 	CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the | 
 | 	cluster can actually be powered down. | 
 |  | 
 | 	The parallelism of the inbound and outbound CPUs is observed by | 
 | 	the existence of two different paths from CLUSTER_GOING_DOWN/ | 
 | 	INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic | 
 | 	model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to | 
 | 	COMING_UP in the basic model).  The second path avoids cluster | 
 | 	teardown completely. | 
 |  | 
 | 	CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic | 
 | 	model.  The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP | 
 | 	is trivial and merely resets the state machine ready for the | 
 | 	next cycle. | 
 |  | 
 | 	Details of the allowable transitions follow. | 
 |  | 
 | 	The next state in each case is notated | 
 |  | 
 | 		<cluster state>/<inbound state> (<transitioner>) | 
 |  | 
 | 	where the <transitioner> is the side on which the transition | 
 | 	can occur; either the inbound or the outbound side. | 
 |  | 
 |  | 
 | CLUSTER_DOWN/INBOUND_NOT_COMING_UP: | 
 | 	Next state: | 
 | 		CLUSTER_DOWN/INBOUND_COMING_UP (inbound) | 
 | 	Conditions: | 
 | 		none | 
 |  | 
 | 	Trigger events: | 
 | 		a) an explicit hardware power-up operation, resulting | 
 | 		   from a policy decision on another CPU; | 
 |  | 
 | 		b) a hardware event, such as an interrupt. | 
 |  | 
 |  | 
 | CLUSTER_DOWN/INBOUND_COMING_UP: | 
 |  | 
 | 	In this state, an inbound CPU sets up the cluster, including | 
 | 	enabling of hardware coherency at the cluster level and any | 
 | 	other operations (such as cache invalidation) which are required | 
 | 	in order to achieve this. | 
 |  | 
 | 	The purpose of this state is to do sufficient cluster-level | 
 | 	setup to enable other CPUs in the cluster to enter coherency | 
 | 	safely. | 
 |  | 
 | 	Next state: | 
 | 		CLUSTER_UP/INBOUND_COMING_UP (inbound) | 
 | 	Conditions: | 
 | 		cluster-level setup and hardware coherency complete | 
 | 	Trigger events: | 
 | 		(spontaneous) | 
 |  | 
 |  | 
 | CLUSTER_UP/INBOUND_COMING_UP: | 
 |  | 
 | 	Cluster-level setup is complete and hardware coherency is | 
 | 	enabled for the cluster.  Other CPUs in the cluster can safely | 
 | 	enter coherency. | 
 |  | 
 | 	This is a transient state, leading immediately to | 
 | 	CLUSTER_UP/INBOUND_NOT_COMING_UP.  All other CPUs on the cluster | 
 | 	should consider treat these two states as equivalent. | 
 |  | 
 | 	Next state: | 
 | 		CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound) | 
 | 	Conditions: | 
 | 		none | 
 | 	Trigger events: | 
 | 		(spontaneous) | 
 |  | 
 |  | 
 | CLUSTER_UP/INBOUND_NOT_COMING_UP: | 
 |  | 
 | 	Cluster-level setup is complete and hardware coherency is | 
 | 	enabled for the cluster.  Other CPUs in the cluster can safely | 
 | 	enter coherency. | 
 |  | 
 | 	The cluster will remain in this state until a policy decision is | 
 | 	made to power the cluster down. | 
 |  | 
 | 	Next state: | 
 | 		CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound) | 
 | 	Conditions: | 
 | 		none | 
 | 	Trigger events: | 
 | 		policy decision to power down the cluster | 
 |  | 
 |  | 
 | CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP: | 
 |  | 
 | 	An outbound CPU is tearing the cluster down.  The selected CPU | 
 | 	must wait in this state until all CPUs in the cluster are in the | 
 | 	CPU_DOWN state. | 
 |  | 
 | 	When all CPUs are in the CPU_DOWN state, the cluster can be torn | 
 | 	down, for example by cleaning data caches and exiting | 
 | 	cluster-level coherency. | 
 |  | 
 | 	To avoid wasteful unnecessary teardown operations, the outbound | 
 | 	should check the inbound cluster state for asynchronous | 
 | 	transitions to INBOUND_COMING_UP.  Alternatively, individual | 
 | 	CPUs can be checked for entry into CPU_COMING_UP or CPU_UP. | 
 |  | 
 |  | 
 | 	Next states: | 
 |  | 
 | 	CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound) | 
 | 		Conditions: | 
 | 			cluster torn down and ready to power off | 
 | 		Trigger events: | 
 | 			(spontaneous) | 
 |  | 
 | 	CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound) | 
 | 		Conditions: | 
 | 			none | 
 |  | 
 | 		Trigger events: | 
 | 			a) an explicit hardware power-up operation, | 
 | 			   resulting from a policy decision on another | 
 | 			   CPU; | 
 |  | 
 | 			b) a hardware event, such as an interrupt. | 
 |  | 
 |  | 
 | CLUSTER_GOING_DOWN/INBOUND_COMING_UP: | 
 |  | 
 | 	The cluster is (or was) being torn down, but another CPU has | 
 | 	come online in the meantime and is trying to set up the cluster | 
 | 	again. | 
 |  | 
 | 	If the outbound CPU observes this state, it has two choices: | 
 |  | 
 | 		a) back out of teardown, restoring the cluster to the | 
 | 		   CLUSTER_UP state; | 
 |  | 
 | 		b) finish tearing the cluster down and put the cluster | 
 | 		   in the CLUSTER_DOWN state; the inbound CPU will | 
 | 		   set up the cluster again from there. | 
 |  | 
 | 	Choice (a) permits the removal of some latency by avoiding | 
 | 	unnecessary teardown and setup operations in situations where | 
 | 	the cluster is not really going to be powered down. | 
 |  | 
 |  | 
 | 	Next states: | 
 |  | 
 | 	CLUSTER_UP/INBOUND_COMING_UP (outbound) | 
 | 		Conditions: | 
 | 				cluster-level setup and hardware | 
 | 				coherency complete | 
 |  | 
 | 		Trigger events: | 
 | 				(spontaneous) | 
 |  | 
 | 	CLUSTER_DOWN/INBOUND_COMING_UP (outbound) | 
 | 		Conditions: | 
 | 			cluster torn down and ready to power off | 
 |  | 
 | 		Trigger events: | 
 | 			(spontaneous) | 
 |  | 
 |  | 
 | Last man and First man selection | 
 | -------------------------------- | 
 |  | 
 | The CPU which performs cluster tear-down operations on the outbound side | 
 | is commonly referred to as the "last man". | 
 |  | 
 | The CPU which performs cluster setup on the inbound side is commonly | 
 | referred to as the "first man". | 
 |  | 
 | The race avoidance algorithm documented above does not provide a | 
 | mechanism to choose which CPUs should play these roles. | 
 |  | 
 |  | 
 | Last man: | 
 |  | 
 | When shutting down the cluster, all the CPUs involved are initially | 
 | executing Linux and hence coherent.  Therefore, ordinary spinlocks can | 
 | be used to select a last man safely, before the CPUs become | 
 | non-coherent. | 
 |  | 
 |  | 
 | First man: | 
 |  | 
 | Because CPUs may power up asynchronously in response to external wake-up | 
 | events, a dynamic mechanism is needed to make sure that only one CPU | 
 | attempts to play the first man role and do the cluster-level | 
 | initialisation: any other CPUs must wait for this to complete before | 
 | proceeding. | 
 |  | 
 | Cluster-level initialisation may involve actions such as configuring | 
 | coherency controls in the bus fabric. | 
 |  | 
 | The current implementation in mcpm_head.S uses a separate mutual exclusion | 
 | mechanism to do this arbitration.  This mechanism is documented in | 
 | detail in vlocks.txt. | 
 |  | 
 |  | 
 | Features and Limitations | 
 | ------------------------ | 
 |  | 
 | Implementation: | 
 |  | 
 | 	The current ARM-based implementation is split between | 
 | 	arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and | 
 | 	arch/arm/common/mcpm_entry.c (everything else): | 
 |  | 
 | 	__mcpm_cpu_going_down() signals the transition of a CPU to the | 
 | 	CPU_GOING_DOWN state. | 
 |  | 
 | 	__mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN | 
 | 	state. | 
 |  | 
 | 	A CPU transitions to CPU_COMING_UP and then to CPU_UP via the | 
 | 	low-level power-up code in mcpm_head.S.  This could | 
 | 	involve CPU-specific setup code, but in the current | 
 | 	implementation it does not. | 
 |  | 
 | 	__mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical() | 
 | 	handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN | 
 | 	and from there to CLUSTER_DOWN or back to CLUSTER_UP (in | 
 | 	the case of an aborted cluster power-down). | 
 |  | 
 | 	These functions are more complex than the __mcpm_cpu_*() | 
 | 	functions due to the extra inter-CPU coordination which | 
 | 	is needed for safe transitions at the cluster level. | 
 |  | 
 | 	A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via | 
 | 	the low-level power-up code in mcpm_head.S.  This | 
 | 	typically involves platform-specific setup code, | 
 | 	provided by the platform-specific power_up_setup | 
 | 	function registered via mcpm_sync_init. | 
 |  | 
 | Deep topologies: | 
 |  | 
 | 	As currently described and implemented, the algorithm does not | 
 | 	support CPU topologies involving more than two levels (i.e., | 
 | 	clusters of clusters are not supported).  The algorithm could be | 
 | 	extended by replicating the cluster-level states for the | 
 | 	additional topological levels, and modifying the transition | 
 | 	rules for the intermediate (non-outermost) cluster levels. | 
 |  | 
 |  | 
 | Colophon | 
 | -------- | 
 |  | 
 | Originally created and documented by Dave Martin for Linaro Limited, in | 
 | collaboration with Nicolas Pitre and Achin Gupta. | 
 |  | 
 | Copyright (C) 2012-2013  Linaro Limited | 
 | Distributed under the terms of Version 2 of the GNU General Public | 
 | License, as defined in linux/COPYING. |