| ====================================== | 
 | vlocks for Bare-Metal Mutual Exclusion | 
 | ====================================== | 
 |  | 
 | Voting Locks, or "vlocks" provide a simple low-level mutual exclusion | 
 | mechanism, with reasonable but minimal requirements on the memory | 
 | system. | 
 |  | 
 | These are intended to be used to coordinate critical activity among CPUs | 
 | which are otherwise non-coherent, in situations where the hardware | 
 | provides no other mechanism to support this and ordinary spinlocks | 
 | cannot be used. | 
 |  | 
 |  | 
 | vlocks make use of the atomicity provided by the memory system for | 
 | writes to a single memory location.  To arbitrate, every CPU "votes for | 
 | itself", by storing a unique number to a common memory location.  The | 
 | final value seen in that memory location when all the votes have been | 
 | cast identifies the winner. | 
 |  | 
 | In order to make sure that the election produces an unambiguous result | 
 | in finite time, a CPU will only enter the election in the first place if | 
 | no winner has been chosen and the election does not appear to have | 
 | started yet. | 
 |  | 
 |  | 
 | Algorithm | 
 | --------- | 
 |  | 
 | The easiest way to explain the vlocks algorithm is with some pseudo-code:: | 
 |  | 
 |  | 
 | 	int currently_voting[NR_CPUS] = { 0, }; | 
 | 	int last_vote = -1; /* no votes yet */ | 
 |  | 
 | 	bool vlock_trylock(int this_cpu) | 
 | 	{ | 
 | 		/* signal our desire to vote */ | 
 | 		currently_voting[this_cpu] = 1; | 
 | 		if (last_vote != -1) { | 
 | 			/* someone already volunteered himself */ | 
 | 			currently_voting[this_cpu] = 0; | 
 | 			return false; /* not ourself */ | 
 | 		} | 
 |  | 
 | 		/* let's suggest ourself */ | 
 | 		last_vote = this_cpu; | 
 | 		currently_voting[this_cpu] = 0; | 
 |  | 
 | 		/* then wait until everyone else is done voting */ | 
 | 		for_each_cpu(i) { | 
 | 			while (currently_voting[i] != 0) | 
 | 				/* wait */; | 
 | 		} | 
 |  | 
 | 		/* result */ | 
 | 		if (last_vote == this_cpu) | 
 | 			return true; /* we won */ | 
 | 		return false; | 
 | 	} | 
 |  | 
 | 	bool vlock_unlock(void) | 
 | 	{ | 
 | 		last_vote = -1; | 
 | 	} | 
 |  | 
 |  | 
 | The currently_voting[] array provides a way for the CPUs to determine | 
 | whether an election is in progress, and plays a role analogous to the | 
 | "entering" array in Lamport's bakery algorithm [1]. | 
 |  | 
 | However, once the election has started, the underlying memory system | 
 | atomicity is used to pick the winner.  This avoids the need for a static | 
 | priority rule to act as a tie-breaker, or any counters which could | 
 | overflow. | 
 |  | 
 | As long as the last_vote variable is globally visible to all CPUs, it | 
 | will contain only one value that won't change once every CPU has cleared | 
 | its currently_voting flag. | 
 |  | 
 |  | 
 | Features and limitations | 
 | ------------------------ | 
 |  | 
 |  * vlocks are not intended to be fair.  In the contended case, it is the | 
 |    _last_ CPU which attempts to get the lock which will be most likely | 
 |    to win. | 
 |  | 
 |    vlocks are therefore best suited to situations where it is necessary | 
 |    to pick a unique winner, but it does not matter which CPU actually | 
 |    wins. | 
 |  | 
 |  * Like other similar mechanisms, vlocks will not scale well to a large | 
 |    number of CPUs. | 
 |  | 
 |    vlocks can be cascaded in a voting hierarchy to permit better scaling | 
 |    if necessary, as in the following hypothetical example for 4096 CPUs:: | 
 |  | 
 | 	/* first level: local election */ | 
 | 	my_town = towns[(this_cpu >> 4) & 0xf]; | 
 | 	I_won = vlock_trylock(my_town, this_cpu & 0xf); | 
 | 	if (I_won) { | 
 | 		/* we won the town election, let's go for the state */ | 
 | 		my_state = states[(this_cpu >> 8) & 0xf]; | 
 | 		I_won = vlock_lock(my_state, this_cpu & 0xf)); | 
 | 		if (I_won) { | 
 | 			/* and so on */ | 
 | 			I_won = vlock_lock(the_whole_country, this_cpu & 0xf]; | 
 | 			if (I_won) { | 
 | 				/* ... */ | 
 | 			} | 
 | 			vlock_unlock(the_whole_country); | 
 | 		} | 
 | 		vlock_unlock(my_state); | 
 | 	} | 
 | 	vlock_unlock(my_town); | 
 |  | 
 |  | 
 | ARM implementation | 
 | ------------------ | 
 |  | 
 | The current ARM implementation [2] contains some optimisations beyond | 
 | the basic algorithm: | 
 |  | 
 |  * By packing the members of the currently_voting array close together, | 
 |    we can read the whole array in one transaction (providing the number | 
 |    of CPUs potentially contending the lock is small enough).  This | 
 |    reduces the number of round-trips required to external memory. | 
 |  | 
 |    In the ARM implementation, this means that we can use a single load | 
 |    and comparison:: | 
 |  | 
 | 	LDR	Rt, [Rn] | 
 | 	CMP	Rt, #0 | 
 |  | 
 |    ...in place of code equivalent to:: | 
 |  | 
 | 	LDRB	Rt, [Rn] | 
 | 	CMP	Rt, #0 | 
 | 	LDRBEQ	Rt, [Rn, #1] | 
 | 	CMPEQ	Rt, #0 | 
 | 	LDRBEQ	Rt, [Rn, #2] | 
 | 	CMPEQ	Rt, #0 | 
 | 	LDRBEQ	Rt, [Rn, #3] | 
 | 	CMPEQ	Rt, #0 | 
 |  | 
 |    This cuts down on the fast-path latency, as well as potentially | 
 |    reducing bus contention in contended cases. | 
 |  | 
 |    The optimisation relies on the fact that the ARM memory system | 
 |    guarantees coherency between overlapping memory accesses of | 
 |    different sizes, similarly to many other architectures.  Note that | 
 |    we do not care which element of currently_voting appears in which | 
 |    bits of Rt, so there is no need to worry about endianness in this | 
 |    optimisation. | 
 |  | 
 |    If there are too many CPUs to read the currently_voting array in | 
 |    one transaction then multiple transations are still required.  The | 
 |    implementation uses a simple loop of word-sized loads for this | 
 |    case.  The number of transactions is still fewer than would be | 
 |    required if bytes were loaded individually. | 
 |  | 
 |  | 
 |    In principle, we could aggregate further by using LDRD or LDM, but | 
 |    to keep the code simple this was not attempted in the initial | 
 |    implementation. | 
 |  | 
 |  | 
 |  * vlocks are currently only used to coordinate between CPUs which are | 
 |    unable to enable their caches yet.  This means that the | 
 |    implementation removes many of the barriers which would be required | 
 |    when executing the algorithm in cached memory. | 
 |  | 
 |    packing of the currently_voting array does not work with cached | 
 |    memory unless all CPUs contending the lock are cache-coherent, due | 
 |    to cache writebacks from one CPU clobbering values written by other | 
 |    CPUs.  (Though if all the CPUs are cache-coherent, you should be | 
 |    probably be using proper spinlocks instead anyway). | 
 |  | 
 |  | 
 |  * The "no votes yet" value used for the last_vote variable is 0 (not | 
 |    -1 as in the pseudocode).  This allows statically-allocated vlocks | 
 |    to be implicitly initialised to an unlocked state simply by putting | 
 |    them in .bss. | 
 |  | 
 |    An offset is added to each CPU's ID for the purpose of setting this | 
 |    variable, so that no CPU uses the value 0 for its ID. | 
 |  | 
 |  | 
 | Colophon | 
 | -------- | 
 |  | 
 | Originally created and documented by Dave Martin for Linaro Limited, for | 
 | use in ARM-based big.LITTLE platforms, with review and input gratefully | 
 | received from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for | 
 | grabbing most of this text out of the relevant mail thread and writing | 
 | up the pseudocode. | 
 |  | 
 | Copyright (C) 2012-2013  Linaro Limited | 
 | Distributed under the terms of Version 2 of the GNU General Public | 
 | License, as defined in linux/COPYING. | 
 |  | 
 |  | 
 | References | 
 | ---------- | 
 |  | 
 | [1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming | 
 |     Problem", Communications of the ACM 17, 8 (August 1974), 453-455. | 
 |  | 
 |     https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm | 
 |  | 
 | [2] linux/arch/arm/common/vlock.S, www.kernel.org. |