Monday, October 15, 2012

On Parallel Programming

"Sequential programming is really hard, and parallel programming is a step beyond that." -- Andrew Tannenbaum, USENIX 2008 Keynote Speech

A colleague quoted this in a recent email, and I feel obliged to respond to it.  From practical experience, it is easy to nod in assent and let it go at that.  An application that is not designed to run in a multithreaded environment is quite likely to fail when simulated or actual parallelism is added.  And even some applications that are carefully designed to run under parallel execution fail because the designer or coder neglected an important detail.

 But the fact that I am working on a programming language that is intended to make the job of programming parallel applications easier means that I cannot agree with this assessment wholeheartedly.  Of any profession, the promise that automation can rescue practitioners from tedium dangles closest in front of software engineers.  Early programmers struggled with machine code until the first symbolic assembler was created; later coders strove with assembly language until usable high-level languages were introduced.

Even though the concepts used to produce code that will run correctly in a parallel processing environment have been around for decades, it is reasonable to assume that parallel programming is still in its infancy, and that software tools will again come to the rescue.[1]  The main reason is that multicore processors have only become common within the last half-decade or so.  The redundant cores are currently an underused resource, but already applications must find ways to utilize this resource to remain competitive.

As with any in the technology used to program computers, it will take some time for new programming models to become widely used.  There is a lot of legacy code and many entrenched coders.  The old code bases may need to be recoded or encapsulated to run efficiently in the new environment.  Experienced coders will have to learn a new skill set to be effective with new coding paradigms.  However, we have ample examples from the past in which this paradigm shift has occurred.  The new programming model not only shapes how software is written to run on existing machines, it affects how the machines themselves are designed.  Hardware advances impinge upon the tools and programming model used, and the cycle continues.

In the present turn, the presence of parallel hardware is the motive force; tools and especially the programming model are obliged to catch up.  The programming model is central to the discussion, and carries the burden of the perceived difficulty.  Serial programming is hard, but in my view it is harder than it needs to be.  If the process of creating a parallel programming involves first writing a serial program and then parallelizing it, it is no wonder that parallel programming is thought to be more difficult.  But the excess complexity of serial programming and extra step of parallelization can both be dispensed with.

The inessential complexity in serial programming arises the expression of a linear flow of control.  Early single-processor machines could execute only one instruction at a time, so a complete ordering of the instructions fed to it was an accurate representation of what the machine was doing.  The state of the machine between each instruction could be predicted with precision.

With the advent of superscalar architectures, however, the connection between instruction sequence and execution sequence became more tenuous.  Superscalar architectures can dispatch more than one instruction at a time, perhaps blurring the exact state across several instructions.  Such processors contain internal circuitry to ensure that the observable state of the machine matches what is expressed by the code.  A hazard occurs when two simultaneously-executing instructions intend to read and write (or both write) the same register or memory location.  The circuitry enforces a stricter ordering of the execution of instructions when hazards are present.  A programming model that fundamentally assumes parallel execution would not require this extra circuitry.

If, in the absence of hazards, instructions can be executed in any order, then it becomes clear that the serial expression of execution carries extra information that is not essential to the expression of the algorithm.  It is convenient to list the work that needs to be done linearly.  But often, the exact ordering of the statements in a program is incidental.  I use this flexibility to group statements that are related and thus enhance the readability of the code.  But the very fact that I can do this reflects the presence of ordering information that is unimportant.

In fact, some compilers will perform instruction reordering to reduce execution time.  They perform a dependency analysis among the variables described in the program, and determine the order in which operations must be performed to preserve the original intent.  In effect, that optimization throws out the most of the ordering information supplied by the programmer, preserving only those partial orderings which are essential to the algorithm.  It then builds a new total ordering from scratch.

Given that much of a serial program representation is incidental, we are encouraged to consider a programming model that avoids this overhead.  The language must provide the means to express important sequence constraints while ignoring or discouraging the inclusion of incidental ordering information. 

It turns out that a number of programming languages already exhibit this property to some extent.  The goal-dependency representation in makefiles is a familiar example.  The goals express the dependence of a given target on a number of source files, which is in effect a sequence constraint:  A given source file must be present before the target can be produced.  In a serial processing environment, a feasible total ordering can be derived from an initial goal and the constraints provided.  In a parallel processing environment, each goal represents a task; a task can be added to the ready queue as soon as all of its constraints are satisfied; any available processor can execute any task from the ready queue at any time; completion of a task changes the state of each file touched by the task.

This declarative (as opposed to "procedural") style of programming is represented by many languages -- some of which have been around for decades.  Prolog is an early example, XSLT a more recent contribution.  What these languages have in common is that -- at least at a high level -- they specify the "what", not the "how".  This approach allows the implementation considerable lattitude in scheduling tasks so long as the explicit sequence constraints are met.  In fact, the algorithm for executing on a single processor is just the degenerate case of the algorithm used for parallel execution.

If declarative languages are more suitable for parallel applications and implementations exist today, why haven't they enjoyed more widespread use?  One reason is historical: When declarative langauges were introduced, available computers were almost exclusively single-processor.  Quite a bit of overhead is involved in scheduling the tasks to be run, and all of this had to be simulated in software.  Once scheduled, the tasks were run on a single processor, so no speedup was possible.  It was impossible for an algorithm specified in a declarative language to run faster than the same algorithm specified in a procedural language.  Hence, declarative languages were labelled as "slow" and dismissed.

With an abundance of processors sitting idle and hardware-assisted task scheduling, the landscape has changed.  Perhaps it is time to give declarative programming another chance.  In future posts, I'll provide a more thorough review of existing programming languages, with an eye to their applicability in a parallel execution environment.  Then if need be, I'll decribe the characteristics of a new language that is well suited to bridging the gap between serial and parallel programming.

[1] Myths About Scalable Parallel Programming Languages Part 6: Performance of Higher-Level Languages