| o https://arstechnica.com/science/2017/06/crystal-stacks-up-qubits-in-first-in-first-out-memory/ |
| http://spectrum.ieee.org/tech-talk/computing/hardware/qudits-the-real-future-of-quantum-computing |
| |
| o Also "Quantum Volume": |
| https://phys.org/news/2017-05-ibm-powerful-universal-quantum-processors.html |
| https://dal.objectstorage.open.softlayer.com/v1/AUTH_039c3bf6e6e54d76b8e66152e2f87877/community-documents/quatnum-volumehp08co1vbo0cc8fr.pdf |
| |
| o Julia language as parallel panacea? https://julialang.org/ and |
| https://en.wikipedia.org/wiki/Julia_(programming_language |
| |
| @@@ |
| |
| o Consider adding kfifo to NBS section. Perhaps also |
| atomic queues and stacks. Reference hazard pointers. |
| |
| o Remove ACCESS_ONCE() in favor of READ_ONCE() and WRITE_ONCE(). |
| |
| o Add verbiage on relative success of transactional lock elision. |
| |
| o Update Test/perf.gpl set of tests for atomic operations and |
| cache-line movement and add to CodeSamples/cpu, then rerun |
| this to create a new table of atomic operations. |
| |
| o Single writer principle. Mechanical sympathy. |
| ("Disruptor" presentation at LCA2013.) |
| |
| Martin Thompson: mechanical-sympathy.blogspot.com |
| Mike Barker: bad-concurrency.blogspot.com |
| github.com/LMAX-Exchange/disruptor |
| |
| o Add gcc padding and alignment directives to toolsoftrade. |
| |
| o Add RACES paper material, perhaps as another "future" section. |
| Also add notes from RACES workshop. |
| |
| o In cpu/hwfreelunch.tex, add asynchronous-hardware citation |
| from Ivan Sutherland or some such. |
| |
| o In intro/intro.tex in the "Use Existing Parallel Software" |
| section, add citations for web-application serving and |
| map-reduce. |
| |
| o Reorganize memory-ordering introduction with Michel Dagenais's |
| feedback in mind. See email dated August 20, 2012. |
| |
| o Consider refactoring test code in CodeSamples directory tree. |
| Probably some opportunity for common measurement code and |
| also common argument-parsing code. |
| |
| o Add discussion of reader-writer locking to the locking chapter |
| implementation discussion. Ticket rwlocks, brlock/lglock, |
| unfair locks (and reasons for them), forward reference to |
| seqlock, RCU lock, and barrier lock. |
| |
| o Consider a barrier section in defer chapter. |
| |
| o Consider expanding on dequeue discussion, especially on queues, |
| in data-structures chapter (e.g., limits of ordering guarantees). |
| Need hash tables there, of course! |
| |
| o Add the various livejournal posts on memory order, especially |
| the one on transitivity. Consider also adding Will Deacon's |
| January 6 2011 QoS story. |
| |
| o Data-race detection [Bart Van Assche January 2010] |
| Helgrind, DRD, ThreadSanitizer, Intel Thread Checker, |
| Intel Parallel Studio, DIOTA, ... |
| |
| Valgrind-based tools allow suppression patterns that |
| span multiple call stack frames, other tools reportedly |
| do not. |
| |
| Citation: |
| |
| J. Maebe, M. Ronsse, and K. De Bosschere. *DIOTA: Dynamic |
| instrumentation, optimization and transformation |
| of applications*. In Proceedings of WBT-2002, |
| Charlottesville, Virginia, USA, September 2002. |
| |
| o Consider adding discussion of Amdahl's Law and Gustafson's |
| Law to the introduction. [Kay Muller] |
| |
| o Consider adding something about VLIW and SIMD to hwhabits? |
| Maybe something more about memory hierarchy? [Kay Muller] |
| |
| o Need some list somewhere of environments that allow you to |
| avoid worrying about locking and threading: BOINC, map-reduce, |
| SQL, sequential-program optimizations, ... |
| |
| o Update "SMP Synchronization Design" to cover parallelism in |
| general. Discuss partitioning up front. Describe two major |
| cases, pipelining and data parallelism. Code locking can be |
| analogous to pipelining, but probably needs a separate section. |
| |
| o What is the form of the need for added performance? |
| |
| o Throughput in face of many relatively small |
| units of work. (multiple instances. |
| data parallelism. DBMS. Web serving.) |
| |
| o Throughput of large block(s) of work. |
| (pipelining, or "series". parallel decompostion, |
| or "parallel".) |
| |
| o Latency of units of work that are large relative |
| to the communications overhead. (overlapped |
| pipelining, parallel decomposition.) |
| |
| o Latency of an aggregate of many units of work |
| that are small relative to the communications |
| overhead. (batch the units of work into larger |
| units and then re-ask the question relative to |
| the batches) |
| |
| o Latency of units of work that are small relative |
| to the communications overhead. (sequential |
| programming optimizations. full-up redesign. |
| approximations/heuristic techniques. |
| overclocking. implement in hardware, e.g., |
| FPGAs.) |
| |
| o Large programs or systems might have different needs |
| in different places -- thus might need combination of |
| techniques. |
| |
| o Move data-structures discussion to this chapter. |
| |
| o Add quote "weak ordering requires strong minds" and dissection |
| thereof to memory-ordering discussion. |
| |
| o Add section to intro/hwhabits.tex giving intuitive notion of |
| how cache coherence protocol and write buffer makes atomic |
| instructions expensive. Draw from URochester slide set |
| (in paper/scalability/TR-TMeval or thereabouts). |
| |
| o Add more terms to vocabulary. More TM stuff, for example. |
| Also RCU nomenclature [done]. |
| |
| o Find Herlihy's topology-analogy stuff and add hints to the |
| formal-methods section. |
| |
| o Fill out material. |
| |
| o GPGPUs in Data Ownership chapter. |
| |
| o Fill out "defer" section, starting with fork-join examples. |
| Relate to state-space complexity. |
| |
| o Add material from "is parallel programming hard" position |
| paper. (After publication.) |
| |
| o Pull in material from LJ2003 RCU and dissertation |
| demonstrating locking performance issues. |
| |
| o Need citation for Matlab*p. |
| |
| x Add Steve's and my dyntick paper to analysis as an example |
| of simpler examples. |
| |
| x Add stuff from differential profiling paper. |
| |
| x Memory barriers -- get principles lined out, then show |
| "bookend" implementations. |
| |
| x Add appendix with mathematics for hard realtime response |
| using locks. Or just cite Bjoern Brandenberg's dissertation. |
| |
| x Create code examples for SMPdesign stuff, fill out |
| Hierarchical Locking, Resource Allocation Caches, |
| Performance Summary. |
| |
| MOSTLY DONE... |
| |
| x Refocus the SMPdesign chapter to partitioning. |
| |
| o Fill out RCU patterns, also corresponding code and |
| performance summary. |
| |
| o RCU and atomic list movement (Josh?) |
| |
| o RCU and reference counts |
| |
| o Converting rwlocks to RCU |
| |
| x Consider adding history of RCU from OSR paper (low priority) |
| |
| o Other stuff Paul McKenney has from previous publications. |
| |
| o Other stuff that must be generated or harvested from |
| elsewhere. |
| |
| o Note that pthread_mutex permits static initialization. |
| (Windows apparently does not.) |
| |
| x Finish whymemorybarriers.tex |
| |
| o Come up with suitable API for example code, so that a single |
| program can be used for each example, covering the pthreads, |
| Linux-kernel-module, and other C-language environments. |
| |
| double dgettimeofday(void); |
| void spin_lock_init(spinlock_t *sp); /* Must be called. */ |
| void spin_lock(spinlock_t *sp); |
| int spin_trylock(spinlock_t *sp); |
| void spin_unlock(spinlock_t *sp); |
| thread_id_t create_thread(void *(*func), void *arg); |
| void *wait_thread(pthread_id_t tid); |
| |
| -- See CodeSamples/pthreads/api-pthreads.h for a start. |
| |
| Clearly Java will need its own code. |
| |
| ------------------------------------------------------------------------ |
| |
| x Fix the references in defer/rcu*.tex to "green" and "red" cells |
| in the various tables. Change to refer to the symbols actually |
| used. |
| |
| x Verify versions of cartoons. (Need whippersnapper in intro) |
| |
| x Upgrade Quick Quiz scripts to remove material in second parameter |
| when aggregating into "answers" section. Trivialize by changing |
| required format to put closing "}" before end tag. Then get rid |
| of the special cases for code formatting in QuickQuiz answers. |
| |
| x Move API description to an appendix. |
| |
| x Fill out material. |
| |
| x Finish converting WhyRCU LWN series. |
| |
| x Add cautions to analysis section. Steve's and |
| my "Integrating and Validating dynticks and Preemptable RCU" |
| could be a good start, comparing to the rcutree.c |
| dynticks implementation. |
| |
| x Also add "what to do instead of parallel programming" |
| section. |
| |
| x Move the formal-verification sections to an appendix. |
| |
| x Add the dyntick verification. (~/paper/KernelJanitor/RCU/dyntick/LWN) |
| |
| x Fix rcutorture.h to correctly compute from time. Allow the |
| test time as a third argument. |
| |
| x Reference counting: basic problem: make data structure stick |
| around while you are incrementing the reference count. |
| |
| Strategies: lock, pre-existing reference, atomic check for |
| zero reference count coupled with GC/RCU, various |
| non-blocking-synchronization tricks. |
| |
| x Copy rcutree full data-structure diagram to section describing |
| force_quiescent_state() callee traversal of the leaf rcu_node |
| structures. Decorate with right-facing arrow to show search |
| direction. Or perhaps copy diagram that shows initialization |
| state. |
| |
| x Add \RevisedFrom to origpub.tex to allow something like: |
| |
| The ideas discussed in Section... were originally presented |
| in XXX\cite{}. |
| |
| Allow ranges of sections for OriginallyFrom. |
| |
| x Discuss other textbooks in intro. |
| |
| x Figure out some way to get section headers between "Answers to |
| Quick Quizzes" for different chapters. Want to avoid any |
| section header for a chapter that has no "quick quizzes". |
| Not a big deal, nice to have. |
| |
| x Add a "Tools of the Trade" chapter before "SMP Synchronization |
| Design" giving an overview of how locking and atomic operations |
| work. Need simple fork/join, along with foreshadowing of how |
| good design can simplify implementation. |
| |
| x Add variables to rest of examples in toyrcu.tex. |
| |
| x In the double-ended-queue code, consider renaming the |
| "enqueue" operations to "push" and the "dequeue" operations |
| to "pop". |
| |
| x Need a chapter on counting. |
| |
| x Add the separate-thread approach suggested unwittingly by |
| Christoph Lameter to the count.tex chapter. |
| |
| x Embed fonts from .eps files: http://www.wkiri.com/today/?p=60 |
| See the sed script: |
| |
| cat original_graphics.eps | sed 's+Times-Bold+NimbusSanL-Bold+g' |\ |
| sed 's+Times-Roman+NimbusSanL-Regu+g' |\ |
| sed 's+Times+NimbusSanL-Regu+g' |\ |
| sed 's+Helvetica-BoldOblique+NimbusSanL-BoldItal+g' |\ |
| sed 's+Helvetica-Oblique+NimbusSanL-ReguItal+g' |\ |
| sed 's+Helvetica-Bold+NimbusSanL-Bold+g' |\ |
| sed 's+Helvetica-Bold-iso+NimbusSanL-Bold+g' |\ |
| sed 's+Helvetica+NimbusSanL-Regu+g' |\ |
| sed 's+Helvetica-iso+NimbusSanL-Regu+g' |\ |
| sed 's+Symbol+StandardSymL+g' > new_graphics.eps |
| |
| And the additional conversions: |
| |
| Courier NimbusMonL-Regu |
| Courier-Bold NimbusMonL-Bold |
| Courier-Oblique NimbusMonL-ReguObli |
| Courier-BoldOblique NimbusMonL-BoldObli |
| |
| x Change calls to bzero() to memset(). [Horst] |
| |
| x dvipdf is quite slow, especially on 1-up setups: |
| |
| time dvips perfbook |
| real 0m3.405s |
| user 0m3.184s |
| sys 0m0.224s |
| time psnup -2 perfbook.ps perfbook-2up.ps |
| real 0m0.852s |
| user 0m0.248s |
| sys 0m0.308s |
| time ps2pdf perfbook-2up.ps perfbook-2up.pdf |
| real 0m25.383s |
| user 0m24.646s |
| sys 0m0.276s |
| |
| time dvipdf perfbook |
| real 1m31.456s |
| user 0m48.347s |
| sys 0m45.599s |
| |
| Could parallelism be profitably applied here? ;-) |
| Or perhaps just careful avoidance of quadratic algorithms? |
| |
| Use pdflatex instead, much faster! |
| |
| x Explicitly permit reformatted copies (single-column, ebook, etc.). |
| |
| x Change from "preemptable" to "preemptible". [Horst] |
| |
| x Produce an HTML version. One possibility is latex2html |
| and scripting, and tamasrepus suggests |
| http://johnmacfarlane.net/pandoc/. |
| |
| Challenges include the scripting and special latex macros |
| that hendle the quick quizzes and the illustration credits. |
| |
| x Add discussion of seqlock to defer chapter between refcnt and RCU. |
| Use trivial example showing consistency. Note inapplicability |
| to pointer-based data structures. |
| |
| x Update chain of RCU implementations to add http://lttng.org/urcu. |
| |
| x Line up reviewers. Need a range of viewpoints, and especially |
| a range of familiarity with SMP coding. [Now that the book is |
| public, just let them line themselves up!] |
| |
| x Work out who else to invite to contribute. Possibilities include: |
| [Now that the book is public, just let them line themselves up!] |
| |
| o Real-life use of RCU: Dipankar Sarma, Steve Hemminger, |
| Maneesh Soni, David Howells, ... |
| |
| o Nick Piggin: radix tree. |
| |
| o Robert Ohlsson: fib trie. |
| |
| o Steve Hemminger: brlock replacement. |
| |
| o SELinux guys. |
| |
| o Extreme scalability: Christoph Lameter, Jesse Barnes, |
| other SGI and ex-SGI folks. |
| |
| o Various architecture maintainers for issues on abstract |
| CPUs and other hardware issues. |
| |
| o Chinese angle: guys from Tsinghua U. in Beijing. |
| |
| o Stuff from Rusty's Unreliable Guides. |
| |
| o Rusty Commentary version. |
| |
| o Tom Hart on RCU performance measurement. |
| Perhaps also on Hazard Pointers (maybe Maged Michael |
| as well or instead). |
| |
| o Josh Triplett on RCU error checking in sparse. |
| |
| o Val Henson on academic research on parallel SMP. |
| |
| o http://globaltext.org/ -- mailto:rwatson@terry.uga.edu |
| +1-706-542-3706. "Wiki-based" textbooks. This is |
| the Terry College of Business at U. of Georgia, |
| but what is wrong with a little cross-pollenation? |
| |
| The usual experience is that somewhere between 20% and 50% |
| of the people who get excited about contributing really will |
| do so. |
| |
| x Need to make a chapter on locking (as opposed to the current |
| one on synchronization). [IN PROGRESS] |
| |
| x Consider using parallel maze solving as an example. See: |
| |
| http://www.futurechips.org/tips-for-power-coders/parallel-programming-tutorial-parallelizing-somewhat-complex-code-part-1.html |
| |
| A better approach might assign one processor to each end of |
| the maze. Chopping the maze up into blocks might help, but |
| this would likely encounter pathological cases with lots |
| of boundary crossing. |
| |
| Another approach is to start each CPU at a random point in |
| the maze (in addition to the one each at start and end). |
| Give each CPU a color, and when there is a path from beginning |
| to end, you are done. Possibly an application for parallel |
| union-find. ;-) |
| |
| x Update TM section based on OSR "grass is greener" paper and |
| relevant paragraph from urcu paper. |
| |
| x Move OSR RCU history paper to appendix. |
| |
| x Add mechanism to automatically display authorhood. |
| |
| x My local view adds a modified book.cls that adds the bibliography |
| to the table of contents. However, the license on book.cls forbids |
| distributing it without also distributing latex. Need to come up |
| with a better way! |
| |
| In the meantime, edit your local copy of book.cls and remove the |
| "*" following the "chapter" directive in "thebibliography" macro. |
| |
| x Add support for real 64-bit x86 builds. |
| |
| x Use "inkscape --export-pdf=fn.pdf fn.svg" to create PDFs from |
| inkscape .svg files. Alternatively: |
| |
| inkscape --export-pdf=fn.pdf --export-latex fn.svg |
| |
| gets an fn.pdf_tex file that allows text in the diagram to |
| be typeset by Latex, allowing things like references and |
| citations to be automatically generated within inkscape drawings. |
| |
| x Add Hazard-Pointer section between refcnt and seqlock sections |
| of Deferred Processing chapter. |
| |
| x Add SNZI to the counting chapter. |
| |
| x In the intro/intro.tex quick-quiz answer citing Herb |
| Sutter's "Effective Concurrency" column, also cite |
| Anthony Williams's book. |
| |
| x Put RCUApplicability.eps into the defer chapter. |
| |
| x Change "non-idempotent" to "irrevocable" and "idempotent" |
| to "revocable" in TM discussion. |
| |
| x Add a chapter on performance analysis and benchmarking. |
| |
| X Move memory-allocation code from "Partitioning and Synchronization |
| Design" to "Data Ownership"? |
| |
| x Cite http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf |
| for reference on CUDA, MPI, etc. |
| |
| x Change ASCII double quotes to TeX quotes. [Horst] |
| |
| x Use the existence problem as motivation for RCU. Show some |
| solutions in the locking chapter. See the day-2 ACACES slideset. |
| See also the RCU tutorial and the SC22WG14 email. |
| |
| o spin_lock_irqsave(): Critical sections must guarantee forward |
| progress against everything except NMI handlers. |
| |
| o raw_spin_lock(): Critical sections must guarantee forward |
| progress against everything except IRQ (including softirq) |
| and NMI handlers. |
| |
| o spin_lock(): Critical sections must guarantee forward |
| progress against everything except IRQ (again including |
| softirq) and NMI handlers and (given CONFIG_PREEMPT_RT) |
| higher-priority realtime tasks. |
| |
| o mutex_lock(): Critical sections need not guarantee |
| forward progress, as general blocking is permitted. |
| |
| The other issue is the scope of the lock. The Linux kernel has |
| the following: |
| |
| o BKL: global scope. |
| |
| o Everything else: scope defined by the use of the underlying |
| lock variable. |
| |
| One of the many reasons that we are trying to get rid of |
| BKL is because it combines global scope with relatively weak |
| forward-progress guarantees. |
| |
| x First edition changes: |
| |
| X Move "Data Structures" to precede "Applying RCU". |
| |
| Add a tree or two. |
| |
| x Change "Applying RCU" name to "Putting It All Together". |
| |
| Add examples from dcache, semaphore mapping, ... |
| |
| Publish v0.9-yyyy.mm.dd, call for comments. Best 3? sets of comments |
| get a dead-tree signed first edition. |
| |
| Which will be v1.0-yyyy.mm.dd. |
| |
| x Consider linking quick quizzes to the answers. |
| |
| X Convert bloatwatch RCU and add to appendix/rcuimpl. |
| |
| X RCU implementations appendix: |
| |
| x RCU requirements/desiderata. (See paper on |
| desk at home for start, along with the usual |
| papers.) Also look in the advsync/rcu.tex section |
| on RCU, which needs to be folded into the |
| sections in defer/ |
| |
| o Uniprocessor RCU implementation. Simplify |
| the current one so that the rcu and rcu_bh |
| implementations are identical, as appropriate |
| for a device not subject to networking DoS |
| attacks. |
| |
| x Toy implementations. Follow wikipedia, but add |
| reference-count approach. |
| |
| x Performance and scalability for toy versions. |
| |
| x User-level RCU implementations (after linux.conf.au |
| 2008) |
| |
| o Non-realtime kernel implementations. Take from |
| dissertation and from LWN articles. Or, given |
| treercu, just cite dissertation. |
| |
| x Realtime kernel implementations. |
| |
| o RCU priority boosting. I suppose getting a good |
| boosting implementation would help... |
| |
| x Hierarchical implementation (after November 20, 2008 |
| due to LWN subscriber-only period). |
| Also propagate references into |
| appendix/formal/dyntickrcu.tex. |
| |
| Also add code walkthrough. You will need to do |
| the walkthrough to find rcutree bugs anyway... |
| |
| o Derive RCU requirements from list of examples, |
| propagate list up to defer chapter. |
| (rcufundamental.tex) |
| |
| o Crisply present advantages and disadvantages of |
| RCU in defer chapter. |
| |
| x https://lwn.net/Articles/667593/ to QC futures. Plus |
| https://lwn.net/Articles/667720/. |
| |
| x Expant SAT-solver section with recent experience. |
| |
| x Move intro/hwhabits.tex to cpu/. Add sections with increasingly |
| detailed pictures of what a CPU does, including speed-of-light |
| round-trip time (~3 cm, or a bit more than an inch) at 5GHz. |
| 3D fabrication might help, but watch the fabrication cost and |
| the heat dissipation!!! Also, electrons travel from 3-30 times |
| more slowly in silicon than light does in a vacuum, and even |
| more slowly if there are levels of clocked logic in the way. |
| |
| Include actual numbers/graphs from real machines. Show effects |
| of various forms of misbehavior. |
| |
| x Add Makefile (and comments) to CodeSamples directories. |
| (defer and SMPdesign done.) |
| |
| ------------------------------------------------------------------------ |
| |
| PROGRESS |
| |
| August 5, 2006: 76 pages |
| August 12, 2006: 84 pages |
| August 20, 2006: 90 pages |
| September 3, 2006: 113 pages |
| September 17, 2006: 123 pages |
| -- LJ paper, discussions on memory barriers -- |
| October 11, 2006: 111 pages |
| November 5, 2006: 135 pages |
| October 20, 2008: 176 pages |
| October 30, 2008: 199 pages |
| November 7, 2008: 205 pages |
| Change page-layout parameters... |
| November 7, 2008: 168 pages |
| November 9, 2008: 200 pages |
| November 16, 2008: 220 pages |
| November 30, 2008: 232 pages |
| December 8, 2008: 250 pages |
| December 23, 2008: 266 pages |
| January 13, 2009: 276 pages |
| January 31, 2009: 288 pages |
| February 28, 2009: 292 pages |
| May 24, 2009: 302 pages |
| July 3, 2009: 308 pages |
| February 21, 2010: 342 pages |
| February 21, 2010: 354 pages |
| Change page-layout parameters... |
| February 20, 2011: 379 pages |
| March 23, 2011: Test edition posted on lulu.com: http://www.lulu.com/content/paperback-book/is-parallel-programming-hard-and-if-so-what-can-you-do-about-it/10357070 |