util: Add ALIGN_POWER2

Add static inline function to align a value to it's next power of 2.
This is commonly done by a SWAR like the one in:

http://aggregate.org/MAGIC/#Next Largest Power of 2

However a microbench shows that the implementation herer is a faster.
It doesn't really impact the possible user of this function, but it's
interesting nonetheless.

Using a x86_64 i7 Ivy Bridge it shows a ~4% advantage by using clz
instead instead of the OR and SHL chain. And this is by using a BSR
since Ivy Bridge doesn't have LZCNT. New Haswell processors have the
LZCNT instruction which can make this even better. ARM also has a CLZ
instruction so it should be better, too.

Code used to test:

	v = val[i];
	t1 = get_cycles(0);
	a = ALIGN_POWER2(v);
	t1 = get_cycles(t1);

	t2 = get_cycles(0);
	v = nlpo2(v);
	t2 = get_cycles(t2);

	printf("%u\t%llu\t%llu\t%d\n", v, t1, t2, v == a);

In which val is an array of 20 random unsigned int, nlop2 is the SWAR
implementation and get_cycles uses RDTSC to measure the performance.

	ALIGN_POWER2: 	30 cycles
	nlop2:		31.4 cycles
1 file changed