todo.txt - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 o	https://arstechnica.com/science/2017/06/crystal-stacks-up-qubits-in-first-in-first-out-memory/
 	http://spectrum.ieee.org/tech-talk/computing/hardware/qudits-the-real-future-of-quantum-computing

 o	Also "Quantum Volume":
 	https://phys.org/news/2017-05-ibm-powerful-universal-quantum-processors.html
 	https://dal.objectstorage.open.softlayer.com/v1/AUTH_039c3bf6e6e54d76b8e66152e2f87877/community-documents/quatnum-volumehp08co1vbo0cc8fr.pdf

 o	Julia language as parallel panacea?  https://julialang.org/ and
 	https://en.wikipedia.org/wiki/Julia_(programming_language

 @@@

 o	Consider adding kfifo to NBS section.  Perhaps also
 	atomic queues and stacks.  Reference hazard pointers.

 o	Remove ACCESS_ONCE() in favor of READ_ONCE() and WRITE_ONCE().

 o	Add verbiage on relative success of transactional lock elision.

 o	Update Test/perf.gpl set of tests for atomic operations and
 	cache-line movement and add to CodeSamples/cpu, then rerun
 	this to create a new table of atomic operations.

 o	Single writer principle.  Mechanical sympathy.
 	("Disruptor" presentation at LCA2013.)

 	Martin Thompson: mechanical-sympathy.blogspot.com
 	Mike Barker: bad-concurrency.blogspot.com
 	github.com/LMAX-Exchange/disruptor

 o	Add gcc padding and alignment directives to toolsoftrade.

 o	Add RACES paper material, perhaps as another "future" section.
 	Also add notes from RACES workshop.

 o	In cpu/hwfreelunch.tex, add asynchronous-hardware citation
 	from Ivan Sutherland or some such.

 o	In intro/intro.tex in the "Use Existing Parallel Software"
 	section, add citations for web-application serving and
 	map-reduce.

 o	Reorganize memory-ordering introduction with Michel Dagenais's
 	feedback in mind.  See email dated August 20, 2012.

 o	Consider refactoring test code in CodeSamples directory tree.
 	Probably some opportunity for common measurement code and
 	also common argument-parsing code.

 o	Add discussion of reader-writer locking to the locking chapter
 	implementation discussion.  Ticket rwlocks, brlock/lglock,
 	unfair locks (and reasons for them), forward reference to
 	seqlock, RCU lock, and barrier lock.

 o	Consider a barrier section in defer chapter.

 o	Consider expanding on dequeue discussion, especially on queues,
 	in data-structures chapter (e.g., limits of ordering guarantees).
 	Need hash tables there, of course!

 o	Add the various livejournal posts on memory order, especially
 	the one on transitivity.  Consider also adding Will Deacon's
 	January 6 2011 QoS story.

 o	Data-race detection [Bart Van Assche January 2010]
 	Helgrind, DRD, ThreadSanitizer, Intel Thread Checker,
 	Intel Parallel Studio, DIOTA, ...

 	Valgrind-based tools allow suppression patterns that
 	span multiple call stack frames, other tools reportedly
 	do not.

 	Citation:

 		J. Maebe, M. Ronsse, and K. De Bosschere. *DIOTA: Dynamic
 		instrumentation, optimization and transformation
 		of applications*. In Proceedings of WBT-2002,
 		Charlottesville, Virginia, USA, September 2002.

 o	Consider adding discussion of Amdahl's Law and Gustafson's
 	Law to the introduction.  [Kay Muller]

 o	Consider adding something about VLIW and SIMD to hwhabits?
 	Maybe something more about memory hierarchy?  [Kay Muller]

 o	Need some list somewhere of environments that allow you to
 	avoid worrying about locking and threading:  BOINC, map-reduce,
 	SQL, sequential-program optimizations, ...

 o	Update "SMP Synchronization Design" to cover parallelism in
 	general.  Discuss partitioning up front.  Describe two major
 	cases, pipelining and data parallelism.  Code locking can be
 	analogous to pipelining, but probably needs a separate section.

 	o	What is the form of the need for added performance?

 		o	Throughput in face of many relatively small
 			units of work.	(multiple instances.
 			data parallelism.  DBMS.  Web serving.)

 		o	Throughput of large block(s) of work.
 			(pipelining, or "series".  parallel decompostion,
 			or "parallel".)

 		o	Latency of units of work that are large relative
 			to the communications overhead.  (overlapped
 			pipelining, parallel decomposition.)

 		o	Latency of an aggregate of many units of work
 			that are small relative to the communications
 			overhead.  (batch the units of work into larger
 			units and then re-ask the question relative to
 			the batches)

 		o	Latency of units of work that are small relative
 			to the communications overhead.  (sequential
 			programming optimizations.  full-up redesign.
 			approximations/heuristic techniques.
 			overclocking.  implement in hardware, e.g.,
 			FPGAs.)

 	o	Large programs or systems might have different needs
 		in different places -- thus might need combination of
 		techniques.

 	o	Move data-structures discussion to this chapter.

 o	Add quote "weak ordering requires strong minds" and dissection
 	thereof to memory-ordering discussion.

 o	Add section to intro/hwhabits.tex giving intuitive notion of
 	how cache coherence protocol and write buffer makes atomic
 	instructions expensive.  Draw from URochester slide set
 	(in paper/scalability/TR-TMeval or thereabouts).

 o	Add more terms to vocabulary.  More TM stuff, for example.
 	Also RCU nomenclature [done].

 o	Find Herlihy's topology-analogy stuff and add hints to the
 	formal-methods section.

 o	Fill out material.

 	o	GPGPUs in Data Ownership chapter.

 	o	Fill out "defer" section, starting with fork-join examples.
 		Relate to state-space complexity.

 	o	Add material from "is parallel programming hard" position
 		paper.  (After publication.)

 	o	Pull in material from LJ2003 RCU and dissertation
 		demonstrating locking performance issues.

 	o	Need citation for Matlab*p.

 	x	Add Steve's and my dyntick paper to analysis as an example
 		of simpler examples.

 	x	Add stuff from differential profiling paper.

 	x	Memory barriers -- get principles lined out, then show
 		"bookend" implementations.

 	x	Add appendix with mathematics for hard realtime response
 		using locks.  Or just cite Bjoern Brandenberg's dissertation.

 	x	Create code examples for SMPdesign stuff, fill out
 		Hierarchical Locking, Resource Allocation Caches,
 		Performance Summary.

 		MOSTLY DONE...

 	x	Refocus the SMPdesign chapter to partitioning.

 	o	Fill out RCU patterns, also corresponding code and
 		performance summary.

 		o	RCU and atomic list movement (Josh?)

 		o	RCU and reference counts

 		o	Converting rwlocks to RCU

 	x	Consider adding history of RCU from OSR paper (low priority)

 	o	Other stuff Paul McKenney has from previous publications.

 	o	Other stuff that must be generated or harvested from
 		elsewhere.

 	o	Note that pthread_mutex permits static initialization.
 		(Windows apparently does not.)

 	x	Finish whymemorybarriers.tex

 o	Come up with suitable API for example code, so that a single
 	program can be used for each example, covering the pthreads,
 	Linux-kernel-module, and other C-language environments.

 	double dgettimeofday(void);
 	void spin_lock_init(spinlock_t *sp); /* Must be called. */
 	void spin_lock(spinlock_t *sp);
 	int spin_trylock(spinlock_t *sp);
 	void spin_unlock(spinlock_t *sp);
 	thread_id_t create_thread(void *(*func), void *arg);
 	void *wait_thread(pthread_id_t tid);

 	-- See CodeSamples/pthreads/api-pthreads.h for a start.

 	Clearly Java will need its own code.

 ------------------------------------------------------------------------

 x	Fix the references in defer/rcu*.tex to "green" and "red" cells
 	in the various tables.  Change to refer to the symbols actually
 	used.

 x	Verify versions of cartoons.  (Need whippersnapper in intro)

 x	Upgrade Quick Quiz scripts to remove material in second parameter
 	when aggregating into "answers" section.  Trivialize by changing
 	required format to put closing "}" before end tag.  Then get rid
 	of the special cases for code formatting in QuickQuiz answers.

 x	Move API description to an appendix.

 x	Fill out material.

 	x	Finish converting WhyRCU LWN series.

 	x	Add cautions to analysis section.  Steve's and
 		my "Integrating and Validating dynticks and Preemptable RCU"
 		could be a good start, comparing to the rcutree.c
 		dynticks implementation.

 	x	Also add "what to do instead of parallel programming"
 		section.

 x	Move the formal-verification sections to an appendix.

 x	Add the dyntick verification. (~/paper/KernelJanitor/RCU/dyntick/LWN)

 x	Fix rcutorture.h to correctly compute from time.  Allow the
 	test time as a third argument.

 x	Reference counting: basic problem: make data structure stick
 	around while you are incrementing the reference count.

 	Strategies: lock, pre-existing reference, atomic check for
 		zero reference count coupled with GC/RCU, various
 		non-blocking-synchronization tricks.

 x	Copy rcutree full data-structure diagram to section describing
 	force_quiescent_state() callee traversal of the leaf rcu_node
 	structures.  Decorate with right-facing arrow to show search
 	direction.  Or perhaps copy diagram that shows initialization
 	state.

 x	Add \RevisedFrom to origpub.tex to allow something like:

 		The ideas discussed in Section... were originally presented
 		in XXX\cite{}.

 	Allow ranges of sections for OriginallyFrom.

 x	Discuss other textbooks in intro.

 x	Figure out some way to get section headers between "Answers to
 	Quick Quizzes" for different chapters.  Want to avoid any
 	section header for a chapter that has no "quick quizzes".
 	Not a big deal, nice to have.

 x	Add a "Tools of the Trade" chapter before "SMP Synchronization
 	Design" giving an overview of how locking and atomic operations
 	work.  Need simple fork/join, along with foreshadowing of how
 	good design can simplify implementation.

 x	Add variables to rest of examples in toyrcu.tex.

 x	In the double-ended-queue code, consider renaming the
 	"enqueue" operations to "push" and the "dequeue" operations
 	to "pop".

 x	Need a chapter on counting.

 x	Add the separate-thread approach suggested unwittingly by
 	Christoph Lameter to the count.tex chapter.

 x	Embed fonts from .eps files: http://www.wkiri.com/today/?p=60
 	See the sed script:

 	cat original_graphics.eps | sed 's+Times-Bold+NimbusSanL-Bold+g' |\
 	sed 's+Times-Roman+NimbusSanL-Regu+g' |\
 	sed 's+Times+NimbusSanL-Regu+g' |\
 	sed 's+Helvetica-BoldOblique+NimbusSanL-BoldItal+g' |\
 	sed 's+Helvetica-Oblique+NimbusSanL-ReguItal+g' |\
 	sed 's+Helvetica-Bold+NimbusSanL-Bold+g' |\
 	sed 's+Helvetica-Bold-iso+NimbusSanL-Bold+g' |\
 	sed 's+Helvetica+NimbusSanL-Regu+g' |\
 	sed 's+Helvetica-iso+NimbusSanL-Regu+g' |\
 	sed 's+Symbol+StandardSymL+g' > new_graphics.eps

 	And the additional conversions:

 	Courier 	NimbusMonL-Regu
 	Courier-Bold 	NimbusMonL-Bold
 	Courier-Oblique 	NimbusMonL-ReguObli
 	Courier-BoldOblique 	NimbusMonL-BoldObli

 x	Change calls to bzero() to memset(). [Horst]

 x	dvipdf is quite slow, especially on 1-up setups:

 	time dvips perfbook
 		real    0m3.405s
 		user    0m3.184s
 		sys     0m0.224s
 	time psnup -2 perfbook.ps perfbook-2up.ps
 		real    0m0.852s
 		user    0m0.248s
 		sys     0m0.308s
 	time ps2pdf perfbook-2up.ps perfbook-2up.pdf
 		real    0m25.383s
 		user    0m24.646s
 		sys     0m0.276s

 	time dvipdf perfbook
 		real    1m31.456s
 		user    0m48.347s
 		sys     0m45.599s

 	Could parallelism be profitably applied here?  ;-)
 	Or perhaps just careful avoidance of quadratic algorithms?

 	Use pdflatex instead, much faster!

 x	Explicitly permit reformatted copies (single-column, ebook, etc.).

 x	Change from "preemptable" to "preemptible". [Horst]

 x	Produce an HTML version.  One possibility is latex2html
 	and scripting, and tamasrepus suggests
 	http://johnmacfarlane.net/pandoc/.

 	Challenges include the scripting and special latex macros
 	that hendle the quick quizzes and the illustration credits.

 x	Add discussion of seqlock to defer chapter between refcnt and RCU.
 	Use trivial example showing consistency.  Note inapplicability
 	to pointer-based data structures.

 x	Update chain of RCU implementations to add http://lttng.org/urcu.

 x	Line up reviewers.  Need a range of viewpoints, and especially
 	a range of familiarity with SMP coding.  [Now that the book is
 	public, just let them line themselves up!]

 x	Work out who else to invite to contribute.  Possibilities include:
 	[Now that the book is public, just let them line themselves up!]

 	o	Real-life use of RCU: Dipankar Sarma, Steve Hemminger,
 		Maneesh Soni, David Howells, ...

 		o	Nick Piggin: radix tree.

 		o	Robert Ohlsson: fib trie.

 		o	Steve Hemminger: brlock replacement.

 		o	SELinux guys.

 	o	Extreme scalability: Christoph Lameter, Jesse Barnes,
 		other SGI and ex-SGI folks.

 	o	Various architecture maintainers for issues on abstract
 		CPUs and other hardware issues.

 	o	Chinese angle: guys from Tsinghua U. in Beijing.

 	o	Stuff from Rusty's Unreliable Guides.

 	o	Rusty Commentary version.

 	o	Tom Hart on RCU performance measurement.
 		Perhaps also on Hazard Pointers (maybe Maged Michael
 		as well or instead).

 	o	Josh Triplett on RCU error checking in sparse.

 	o	Val Henson on academic research on parallel SMP.

 	o	http://globaltext.org/ -- mailto:rwatson@terry.uga.edu
 		+1-706-542-3706.  "Wiki-based" textbooks.  This is
 		the Terry College of Business at U. of Georgia,
 		but what is wrong with a little cross-pollenation?

 	The usual experience is that somewhere between 20% and 50%
 	of the people who get excited about contributing really will
 	do so.

 x	Need to make a chapter on locking (as opposed to the current
 	one on synchronization).  [IN PROGRESS]

 x	Consider using parallel maze solving as an example.  See:

 	http://www.futurechips.org/tips-for-power-coders/parallel-programming-tutorial-parallelizing-somewhat-complex-code-part-1.html

 	A better approach might assign one processor to each end of
 	the maze.  Chopping the maze up into blocks might help, but
 	this would likely encounter pathological cases with lots
 	of boundary crossing.

 	Another approach is to start each CPU at a random point in
 	the maze (in addition to the one each at start and end).
 	Give each CPU a color, and when there is a path from beginning
 	to end, you are done.  Possibly an application for parallel
 	union-find.  ;-)

 x	Update TM section based on OSR "grass is greener" paper and
 	relevant paragraph from urcu paper.

 x	Move OSR RCU history paper to appendix.

 x	Add mechanism to automatically display authorhood.

 x	My local view adds a modified book.cls that adds the bibliography
 	to the table of contents.  However, the license on book.cls forbids
 	distributing it without also distributing latex.  Need to come up
 	with a better way!

 	In the meantime, edit your local copy of book.cls and remove the
 	"*" following the "chapter" directive in "thebibliography" macro.

 x	Add support for real 64-bit x86 builds.

 x	Use "inkscape --export-pdf=fn.pdf fn.svg" to create PDFs from
 	inkscape .svg files.  Alternatively:

 		inkscape --export-pdf=fn.pdf --export-latex fn.svg

 	gets an fn.pdf_tex file that allows text in the diagram to
 	be typeset by Latex, allowing things like references and
 	citations to be automatically generated within inkscape drawings.

 x	Add Hazard-Pointer section between refcnt and seqlock sections
 	of Deferred Processing chapter.

 x	Add SNZI to the counting chapter.

 x	In the intro/intro.tex quick-quiz answer citing Herb
 	Sutter's "Effective Concurrency" column, also cite
 	Anthony Williams's book.

 x	Put RCUApplicability.eps into the defer chapter.

 x	Change "non-idempotent" to "irrevocable" and "idempotent"
 	to "revocable" in TM discussion.

 x	Add a chapter on performance analysis and benchmarking.

 X	Move memory-allocation code from "Partitioning and Synchronization
 	Design" to "Data Ownership"?

 x	Cite http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf
 	for reference on CUDA, MPI, etc.

 x	Change ASCII double quotes to TeX quotes. [Horst]

 x	Use the existence problem as motivation for RCU.  Show some
 	solutions in the locking chapter.  See the day-2 ACACES slideset.
 	See also the RCU tutorial and the SC22WG14 email.

 	o	spin_lock_irqsave(): Critical sections must guarantee forward
 		progress against everything except NMI handlers.

 	o	raw_spin_lock(): Critical sections must guarantee forward
 		progress against everything except IRQ (including softirq)
 		and NMI handlers.

 	o	spin_lock(): Critical sections must guarantee forward
 		progress against everything except IRQ (again including
 		softirq) and NMI handlers and (given CONFIG_PREEMPT_RT)
 		higher-priority realtime tasks.

 	o	mutex_lock(): Critical sections need not guarantee
 		forward progress, as general blocking is permitted.

 	The other issue is the scope of the lock.  The Linux kernel has
 	the following:

 	o	BKL: global scope.

 	o	Everything else: scope defined by the use of the underlying
 		lock variable.

 	One of the many reasons that we are trying to get rid of
 	BKL is because it combines global scope with relatively weak
 	forward-progress guarantees.

 x	First edition changes:

 	X	Move "Data Structures" to precede "Applying RCU".

 		Add a tree or two.

 	x	Change "Applying RCU" name to "Putting It All Together".

 		Add examples from dcache, semaphore mapping, ...

 	Publish v0.9-yyyy.mm.dd, call for comments.  Best 3? sets of comments
 	get a dead-tree signed first edition.

 	Which will be v1.0-yyyy.mm.dd.

 x	Consider linking quick quizzes to the answers.

 X	Convert bloatwatch RCU and add to appendix/rcuimpl.

 X	RCU implementations appendix:

 	x	RCU requirements/desiderata.  (See paper on
 		desk at home for start, along with the usual
 		papers.)  Also look in the advsync/rcu.tex section
 		on RCU, which needs to be folded into the
 		sections in defer/

 	o	Uniprocessor RCU implementation.  Simplify
 		the current one so that the rcu and rcu_bh
 		implementations are identical, as appropriate
 		for a device not subject to networking DoS
 		attacks.

 	x	Toy implementations.  Follow wikipedia, but add
 		reference-count approach.

 	x	Performance and scalability for toy versions.

 	x	User-level RCU implementations (after linux.conf.au
 		2008)

 	o	Non-realtime kernel implementations.  Take from
 		dissertation and from LWN articles.  Or, given
 		treercu, just cite dissertation.

 	x	Realtime kernel implementations.

 	o	RCU priority boosting.  I suppose getting a good
 		boosting implementation would help...

 	x	Hierarchical implementation (after November 20, 2008
 		due to LWN subscriber-only period).
 		Also propagate references into
 		appendix/formal/dyntickrcu.tex.

 		Also add code walkthrough.  You will need to do
 		the walkthrough to find rcutree bugs anyway...

 	o	Derive RCU requirements from list of examples,
 		propagate list up to defer chapter.
 		(rcufundamental.tex)

 	o	Crisply present advantages and disadvantages of
 		RCU in defer chapter.

 x	https://lwn.net/Articles/667593/ to QC futures.  Plus
 	https://lwn.net/Articles/667720/.

 x	Expant SAT-solver section with recent experience.

 x	Move intro/hwhabits.tex to cpu/.  Add sections with increasingly
 	detailed pictures of what a CPU does, including speed-of-light
 	round-trip time (~3 cm, or a bit more than an inch) at 5GHz.
 	3D fabrication might help, but watch the fabrication cost and
 	the heat dissipation!!!  Also, electrons travel from 3-30 times
 	more slowly in silicon than light does in a vacuum, and even
 	more slowly if there are levels of clocked logic in the way.

 	Include actual numbers/graphs from real machines.  Show effects
 	of various forms of misbehavior.

 x	Add Makefile (and comments) to CodeSamples directories.
 	(defer and SMPdesign done.)

 ------------------------------------------------------------------------

 PROGRESS

 August 5, 2006: 76 pages
 August 12, 2006: 84 pages
 August 20, 2006: 90 pages
 September 3, 2006: 113 pages
 September 17, 2006: 123 pages
 -- LJ paper, discussions on memory barriers --
 October 11, 2006: 111 pages
 November 5, 2006: 135 pages
 October 20, 2008: 176 pages
 October 30, 2008: 199 pages
 November 7, 2008: 205 pages
 Change page-layout parameters...
 November 7, 2008: 168 pages
 November 9, 2008: 200 pages
 November 16, 2008: 220 pages
 November 30, 2008: 232 pages
 December 8, 2008: 250 pages
 December 23, 2008: 266 pages
 January 13, 2009: 276 pages
 January 31, 2009: 288 pages
 February 28, 2009: 292 pages
 May 24, 2009: 302 pages
 July 3, 2009: 308 pages
 February 21, 2010: 342 pages
 February 21, 2010: 354 pages
 Change page-layout parameters...
 February 20, 2011: 379 pages
 March 23, 2011: Test edition posted on lulu.com: http://www.lulu.com/content/paperback-book/is-parallel-programming-hard-and-if-so-what-can-you-do-about-it/10357070
	o https://arstechnica.com/science/2017/06/crystal-stacks-up-qubits-in-first-in-first-out-memory/
	http://spectrum.ieee.org/tech-talk/computing/hardware/qudits-the-real-future-of-quantum-computing

	o Also "Quantum Volume":
	https://phys.org/news/2017-05-ibm-powerful-universal-quantum-processors.html
	https://dal.objectstorage.open.softlayer.com/v1/AUTH_039c3bf6e6e54d76b8e66152e2f87877/community-documents/quatnum-volumehp08co1vbo0cc8fr.pdf

	o Julia language as parallel panacea? https://julialang.org/ and
	https://en.wikipedia.org/wiki/Julia_(programming_language

	@@@

	o Consider adding kfifo to NBS section. Perhaps also
	atomic queues and stacks. Reference hazard pointers.

	o Remove ACCESS_ONCE() in favor of READ_ONCE() and WRITE_ONCE().

	o Add verbiage on relative success of transactional lock elision.

	o Update Test/perf.gpl set of tests for atomic operations and
	cache-line movement and add to CodeSamples/cpu, then rerun
	this to create a new table of atomic operations.

	o Single writer principle. Mechanical sympathy.
	("Disruptor" presentation at LCA2013.)

	Martin Thompson: mechanical-sympathy.blogspot.com
	Mike Barker: bad-concurrency.blogspot.com
	github.com/LMAX-Exchange/disruptor

	o Add gcc padding and alignment directives to toolsoftrade.

	o Add RACES paper material, perhaps as another "future" section.
	Also add notes from RACES workshop.

	o In cpu/hwfreelunch.tex, add asynchronous-hardware citation
	from Ivan Sutherland or some such.

	o In intro/intro.tex in the "Use Existing Parallel Software"
	section, add citations for web-application serving and
	map-reduce.

	o Reorganize memory-ordering introduction with Michel Dagenais's
	feedback in mind. See email dated August 20, 2012.

	o Consider refactoring test code in CodeSamples directory tree.
	Probably some opportunity for common measurement code and
	also common argument-parsing code.

	o Add discussion of reader-writer locking to the locking chapter
	implementation discussion. Ticket rwlocks, brlock/lglock,
	unfair locks (and reasons for them), forward reference to
	seqlock, RCU lock, and barrier lock.

	o Consider a barrier section in defer chapter.

	o Consider expanding on dequeue discussion, especially on queues,
	in data-structures chapter (e.g., limits of ordering guarantees).
	Need hash tables there, of course!

	o Add the various livejournal posts on memory order, especially
	the one on transitivity. Consider also adding Will Deacon's
	January 6 2011 QoS story.

	o Data-race detection [Bart Van Assche January 2010]
	Helgrind, DRD, ThreadSanitizer, Intel Thread Checker,
	Intel Parallel Studio, DIOTA, ...

	Valgrind-based tools allow suppression patterns that
	span multiple call stack frames, other tools reportedly
	do not.

	Citation:

	J. Maebe, M. Ronsse, and K. De Bosschere. *DIOTA: Dynamic
	instrumentation, optimization and transformation
	of applications*. In Proceedings of WBT-2002,
	Charlottesville, Virginia, USA, September 2002.

	o Consider adding discussion of Amdahl's Law and Gustafson's
	Law to the introduction. [Kay Muller]

	o Consider adding something about VLIW and SIMD to hwhabits?
	Maybe something more about memory hierarchy? [Kay Muller]

	o Need some list somewhere of environments that allow you to
	avoid worrying about locking and threading: BOINC, map-reduce,
	SQL, sequential-program optimizations, ...

	o Update "SMP Synchronization Design" to cover parallelism in
	general. Discuss partitioning up front. Describe two major
	cases, pipelining and data parallelism. Code locking can be
	analogous to pipelining, but probably needs a separate section.

	o What is the form of the need for added performance?

	o Throughput in face of many relatively small
	units of work. (multiple instances.
	data parallelism. DBMS. Web serving.)

	o Throughput of large block(s) of work.
	(pipelining, or "series". parallel decompostion,
	or "parallel".)

	o Latency of units of work that are large relative
	to the communications overhead. (overlapped
	pipelining, parallel decomposition.)

	o Latency of an aggregate of many units of work
	that are small relative to the communications
	overhead. (batch the units of work into larger
	units and then re-ask the question relative to
	the batches)

	o Latency of units of work that are small relative
	to the communications overhead. (sequential
	programming optimizations. full-up redesign.
	approximations/heuristic techniques.
	overclocking. implement in hardware, e.g.,
	FPGAs.)

	o Large programs or systems might have different needs
	in different places -- thus might need combination of
	techniques.

	o Move data-structures discussion to this chapter.

	o Add quote "weak ordering requires strong minds" and dissection
	thereof to memory-ordering discussion.

	o Add section to intro/hwhabits.tex giving intuitive notion of
	how cache coherence protocol and write buffer makes atomic
	instructions expensive. Draw from URochester slide set
	(in paper/scalability/TR-TMeval or thereabouts).

	o Add more terms to vocabulary. More TM stuff, for example.
	Also RCU nomenclature [done].

	o Find Herlihy's topology-analogy stuff and add hints to the
	formal-methods section.

	o Fill out material.

	o GPGPUs in Data Ownership chapter.

	o Fill out "defer" section, starting with fork-join examples.
	Relate to state-space complexity.

	o Add material from "is parallel programming hard" position
	paper. (After publication.)

	o Pull in material from LJ2003 RCU and dissertation
	demonstrating locking performance issues.

	o Need citation for Matlab*p.

	x Add Steve's and my dyntick paper to analysis as an example
	of simpler examples.

	x Add stuff from differential profiling paper.

	x Memory barriers -- get principles lined out, then show
	"bookend" implementations.

	x Add appendix with mathematics for hard realtime response
	using locks. Or just cite Bjoern Brandenberg's dissertation.

	x Create code examples for SMPdesign stuff, fill out
	Hierarchical Locking, Resource Allocation Caches,
	Performance Summary.

	MOSTLY DONE...

	x Refocus the SMPdesign chapter to partitioning.

	o Fill out RCU patterns, also corresponding code and
	performance summary.

	o RCU and atomic list movement (Josh?)

	o RCU and reference counts

	o Converting rwlocks to RCU

	x Consider adding history of RCU from OSR paper (low priority)

	o Other stuff Paul McKenney has from previous publications.

	o Other stuff that must be generated or harvested from
	elsewhere.

	o Note that pthread_mutex permits static initialization.
	(Windows apparently does not.)

	x Finish whymemorybarriers.tex

	o Come up with suitable API for example code, so that a single
	program can be used for each example, covering the pthreads,
	Linux-kernel-module, and other C-language environments.

	double dgettimeofday(void);
	void spin_lock_init(spinlock_t sp); / Must be called. */
	void spin_lock(spinlock_t *sp);
	int spin_trylock(spinlock_t *sp);
	void spin_unlock(spinlock_t *sp);
	thread_id_t create_thread(void (func), void *arg);
	void *wait_thread(pthread_id_t tid);

	-- See CodeSamples/pthreads/api-pthreads.h for a start.

	Clearly Java will need its own code.

	------------------------------------------------------------------------

	x Fix the references in defer/rcu*.tex to "green" and "red" cells
	in the various tables. Change to refer to the symbols actually
	used.

	x Verify versions of cartoons. (Need whippersnapper in intro)

	x Upgrade Quick Quiz scripts to remove material in second parameter
	when aggregating into "answers" section. Trivialize by changing
	required format to put closing "}" before end tag. Then get rid
	of the special cases for code formatting in QuickQuiz answers.

	x Move API description to an appendix.

	x Fill out material.

	x Finish converting WhyRCU LWN series.

	x Add cautions to analysis section. Steve's and
	my "Integrating and Validating dynticks and Preemptable RCU"
	could be a good start, comparing to the rcutree.c
	dynticks implementation.

	x Also add "what to do instead of parallel programming"
	section.

	x Move the formal-verification sections to an appendix.

	x Add the dyntick verification. (~/paper/KernelJanitor/RCU/dyntick/LWN)

	x Fix rcutorture.h to correctly compute from time. Allow the
	test time as a third argument.

	x Reference counting: basic problem: make data structure stick
	around while you are incrementing the reference count.

	Strategies: lock, pre-existing reference, atomic check for
	zero reference count coupled with GC/RCU, various
	non-blocking-synchronization tricks.

	x Copy rcutree full data-structure diagram to section describing
	force_quiescent_state() callee traversal of the leaf rcu_node
	structures. Decorate with right-facing arrow to show search
	direction. Or perhaps copy diagram that shows initialization
	state.

	x Add \RevisedFrom to origpub.tex to allow something like:

	The ideas discussed in Section... were originally presented
	in XXX\cite{}.

	Allow ranges of sections for OriginallyFrom.

	x Discuss other textbooks in intro.

	x Figure out some way to get section headers between "Answers to
	Quick Quizzes" for different chapters. Want to avoid any
	section header for a chapter that has no "quick quizzes".
	Not a big deal, nice to have.

	x Add a "Tools of the Trade" chapter before "SMP Synchronization
	Design" giving an overview of how locking and atomic operations
	work. Need simple fork/join, along with foreshadowing of how
	good design can simplify implementation.

	x Add variables to rest of examples in toyrcu.tex.

	x In the double-ended-queue code, consider renaming the
	"enqueue" operations to "push" and the "dequeue" operations
	to "pop".

	x Need a chapter on counting.

	x Add the separate-thread approach suggested unwittingly by
	Christoph Lameter to the count.tex chapter.

	x Embed fonts from .eps files: http://www.wkiri.com/today/?p=60
	See the sed script:

	cat original_graphics.eps \| sed 's+Times-Bold+NimbusSanL-Bold+g' \|\
	sed 's+Times-Roman+NimbusSanL-Regu+g' \|\
	sed 's+Times+NimbusSanL-Regu+g' \|\
	sed 's+Helvetica-BoldOblique+NimbusSanL-BoldItal+g' \|\
	sed 's+Helvetica-Oblique+NimbusSanL-ReguItal+g' \|\
	sed 's+Helvetica-Bold+NimbusSanL-Bold+g' \|\
	sed 's+Helvetica-Bold-iso+NimbusSanL-Bold+g' \|\
	sed 's+Helvetica+NimbusSanL-Regu+g' \|\
	sed 's+Helvetica-iso+NimbusSanL-Regu+g' \|\
	sed 's+Symbol+StandardSymL+g' > new_graphics.eps

	And the additional conversions:

	Courier NimbusMonL-Regu
	Courier-Bold NimbusMonL-Bold
	Courier-Oblique NimbusMonL-ReguObli
	Courier-BoldOblique NimbusMonL-BoldObli

	x Change calls to bzero() to memset(). [Horst]

	x dvipdf is quite slow, especially on 1-up setups:

	time dvips perfbook
	real 0m3.405s
	user 0m3.184s
	sys 0m0.224s
	time psnup -2 perfbook.ps perfbook-2up.ps
	real 0m0.852s
	user 0m0.248s
	sys 0m0.308s
	time ps2pdf perfbook-2up.ps perfbook-2up.pdf
	real 0m25.383s
	user 0m24.646s
	sys 0m0.276s

	time dvipdf perfbook
	real 1m31.456s
	user 0m48.347s
	sys 0m45.599s

	Could parallelism be profitably applied here? ;-)
	Or perhaps just careful avoidance of quadratic algorithms?

	Use pdflatex instead, much faster!

	x Explicitly permit reformatted copies (single-column, ebook, etc.).

	x Change from "preemptable" to "preemptible". [Horst]

	x Produce an HTML version. One possibility is latex2html
	and scripting, and tamasrepus suggests
	http://johnmacfarlane.net/pandoc/.

	Challenges include the scripting and special latex macros
	that hendle the quick quizzes and the illustration credits.

	x Add discussion of seqlock to defer chapter between refcnt and RCU.
	Use trivial example showing consistency. Note inapplicability
	to pointer-based data structures.

	x Update chain of RCU implementations to add http://lttng.org/urcu.

	x Line up reviewers. Need a range of viewpoints, and especially
	a range of familiarity with SMP coding. [Now that the book is
	public, just let them line themselves up!]

	x Work out who else to invite to contribute. Possibilities include:
	[Now that the book is public, just let them line themselves up!]

	o Real-life use of RCU: Dipankar Sarma, Steve Hemminger,
	Maneesh Soni, David Howells, ...

	o Nick Piggin: radix tree.

	o Robert Ohlsson: fib trie.

	o Steve Hemminger: brlock replacement.

	o SELinux guys.

	o Extreme scalability: Christoph Lameter, Jesse Barnes,
	other SGI and ex-SGI folks.

	o Various architecture maintainers for issues on abstract
	CPUs and other hardware issues.

	o Chinese angle: guys from Tsinghua U. in Beijing.

	o Stuff from Rusty's Unreliable Guides.

	o Rusty Commentary version.

	o Tom Hart on RCU performance measurement.
	Perhaps also on Hazard Pointers (maybe Maged Michael
	as well or instead).

	o Josh Triplett on RCU error checking in sparse.

	o Val Henson on academic research on parallel SMP.

	o http://globaltext.org/ -- mailto:rwatson@terry.uga.edu
	+1-706-542-3706. "Wiki-based" textbooks. This is
	the Terry College of Business at U. of Georgia,
	but what is wrong with a little cross-pollenation?

	The usual experience is that somewhere between 20% and 50%
	of the people who get excited about contributing really will
	do so.

	x Need to make a chapter on locking (as opposed to the current
	one on synchronization). [IN PROGRESS]

	x Consider using parallel maze solving as an example. See:

	http://www.futurechips.org/tips-for-power-coders/parallel-programming-tutorial-parallelizing-somewhat-complex-code-part-1.html

	A better approach might assign one processor to each end of
	the maze. Chopping the maze up into blocks might help, but
	this would likely encounter pathological cases with lots
	of boundary crossing.

	Another approach is to start each CPU at a random point in
	the maze (in addition to the one each at start and end).
	Give each CPU a color, and when there is a path from beginning
	to end, you are done. Possibly an application for parallel
	union-find. ;-)

	x Update TM section based on OSR "grass is greener" paper and
	relevant paragraph from urcu paper.

	x Move OSR RCU history paper to appendix.

	x Add mechanism to automatically display authorhood.

	x My local view adds a modified book.cls that adds the bibliography
	to the table of contents. However, the license on book.cls forbids
	distributing it without also distributing latex. Need to come up
	with a better way!

	In the meantime, edit your local copy of book.cls and remove the
	"*" following the "chapter" directive in "thebibliography" macro.

	x Add support for real 64-bit x86 builds.

	x Use "inkscape --export-pdf=fn.pdf fn.svg" to create PDFs from
	inkscape .svg files. Alternatively:

	inkscape --export-pdf=fn.pdf --export-latex fn.svg

	gets an fn.pdf_tex file that allows text in the diagram to
	be typeset by Latex, allowing things like references and
	citations to be automatically generated within inkscape drawings.

	x Add Hazard-Pointer section between refcnt and seqlock sections
	of Deferred Processing chapter.

	x Add SNZI to the counting chapter.

	x In the intro/intro.tex quick-quiz answer citing Herb
	Sutter's "Effective Concurrency" column, also cite
	Anthony Williams's book.

	x Put RCUApplicability.eps into the defer chapter.

	x Change "non-idempotent" to "irrevocable" and "idempotent"
	to "revocable" in TM discussion.

	x Add a chapter on performance analysis and benchmarking.

	X Move memory-allocation code from "Partitioning and Synchronization
	Design" to "Data Ownership"?

	x Cite http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf
	for reference on CUDA, MPI, etc.

	x Change ASCII double quotes to TeX quotes. [Horst]

	x Use the existence problem as motivation for RCU. Show some
	solutions in the locking chapter. See the day-2 ACACES slideset.
	See also the RCU tutorial and the SC22WG14 email.

	o spin_lock_irqsave(): Critical sections must guarantee forward
	progress against everything except NMI handlers.

	o raw_spin_lock(): Critical sections must guarantee forward
	progress against everything except IRQ (including softirq)
	and NMI handlers.

	o spin_lock(): Critical sections must guarantee forward
	progress against everything except IRQ (again including
	softirq) and NMI handlers and (given CONFIG_PREEMPT_RT)
	higher-priority realtime tasks.

	o mutex_lock(): Critical sections need not guarantee
	forward progress, as general blocking is permitted.

	The other issue is the scope of the lock. The Linux kernel has
	the following:

	o BKL: global scope.

	o Everything else: scope defined by the use of the underlying
	lock variable.

	One of the many reasons that we are trying to get rid of
	BKL is because it combines global scope with relatively weak
	forward-progress guarantees.

	x First edition changes:

	X Move "Data Structures" to precede "Applying RCU".

	Add a tree or two.

	x Change "Applying RCU" name to "Putting It All Together".

	Add examples from dcache, semaphore mapping, ...

	Publish v0.9-yyyy.mm.dd, call for comments. Best 3? sets of comments
	get a dead-tree signed first edition.

	Which will be v1.0-yyyy.mm.dd.

	x Consider linking quick quizzes to the answers.

	X Convert bloatwatch RCU and add to appendix/rcuimpl.

	X RCU implementations appendix:

	x RCU requirements/desiderata. (See paper on
	desk at home for start, along with the usual
	papers.) Also look in the advsync/rcu.tex section
	on RCU, which needs to be folded into the
	sections in defer/

	o Uniprocessor RCU implementation. Simplify
	the current one so that the rcu and rcu_bh
	implementations are identical, as appropriate
	for a device not subject to networking DoS
	attacks.

	x Toy implementations. Follow wikipedia, but add
	reference-count approach.

	x Performance and scalability for toy versions.

	x User-level RCU implementations (after linux.conf.au
	2008)

	o Non-realtime kernel implementations. Take from
	dissertation and from LWN articles. Or, given
	treercu, just cite dissertation.

	x Realtime kernel implementations.

	o RCU priority boosting. I suppose getting a good
	boosting implementation would help...

	x Hierarchical implementation (after November 20, 2008
	due to LWN subscriber-only period).
	Also propagate references into
	appendix/formal/dyntickrcu.tex.

	Also add code walkthrough. You will need to do
	the walkthrough to find rcutree bugs anyway...

	o Derive RCU requirements from list of examples,
	propagate list up to defer chapter.
	(rcufundamental.tex)

	o Crisply present advantages and disadvantages of
	RCU in defer chapter.

	x https://lwn.net/Articles/667593/ to QC futures. Plus
	https://lwn.net/Articles/667720/.

	x Expant SAT-solver section with recent experience.

	x Move intro/hwhabits.tex to cpu/. Add sections with increasingly
	detailed pictures of what a CPU does, including speed-of-light
	round-trip time (~3 cm, or a bit more than an inch) at 5GHz.
	3D fabrication might help, but watch the fabrication cost and
	the heat dissipation!!! Also, electrons travel from 3-30 times
	more slowly in silicon than light does in a vacuum, and even
	more slowly if there are levels of clocked logic in the way.

	Include actual numbers/graphs from real machines. Show effects
	of various forms of misbehavior.

	x Add Makefile (and comments) to CodeSamples directories.
	(defer and SMPdesign done.)

	------------------------------------------------------------------------

	PROGRESS

	August 5, 2006: 76 pages
	August 12, 2006: 84 pages
	August 20, 2006: 90 pages
	September 3, 2006: 113 pages
	September 17, 2006: 123 pages
	-- LJ paper, discussions on memory barriers --
	October 11, 2006: 111 pages
	November 5, 2006: 135 pages
	October 20, 2008: 176 pages
	October 30, 2008: 199 pages
	November 7, 2008: 205 pages
	Change page-layout parameters...
	November 7, 2008: 168 pages
	November 9, 2008: 200 pages
	November 16, 2008: 220 pages
	November 30, 2008: 232 pages
	December 8, 2008: 250 pages
	December 23, 2008: 266 pages
	January 13, 2009: 276 pages
	January 31, 2009: 288 pages
	February 28, 2009: 292 pages
	May 24, 2009: 302 pages
	July 3, 2009: 308 pages
	February 21, 2010: 342 pages
	February 21, 2010: 354 pages
	Change page-layout parameters...
	February 20, 2011: 379 pages
	March 23, 2011: Test edition posted on lulu.com: http://www.lulu.com/content/paperback-book/is-parallel-programming-hard-and-if-so-what-can-you-do-about-it/10357070