Parallelism /= Concurrency

October 6, 2009

If you want to make programs go faster on parallel hardware, that you need some kind of concurrency.  Right?

In this article I’d like to explain why the above statement is false, and why we should be very clear about the distinction between concurrency and parallelism.  I should stress that these ideas are not mine, and are by no means new, but I think it’s important that this issue is well understood if we’re to find a way to enable everyday programmers to use multicore CPUs.  I was moved to write about this after reading Tim Bray’s articles on Concur.next: while I agree with a lot of what’s said there, particularly statements like

Exposing real pre-emptive threading with shared mutable data structures to application programmers is wrong

it seems that parallelism and concurrency are still being conflated. Yes we need concurrency in our languages, but if all we want to do is make programs run faster on a multicore, concurrency should be a last resort.

First, I’ll try to establish the terminology.

A concurrent program is one with multiple threads of control.  Each thread of control has effects on the world, and those threads are interleaved in some arbitrary way by the scheduler.  We say that a concurrent programming language is non-deterministic, because the total effect of the program may depend on the particular interleaving at runtime.  The programmer has the tricky task of controlling this non-determinism using synchronisation, to make sure that the program ends up doing what it was supposed to do regardless of the scheduling order.  And that’s no mean feat, because there’s no reasonable way to test that you have covered all the cases.  This is regardless of what synchronisation technology you’re using: yes, STM is better than locks, and message passing has its advantages, but all of these are just ways to communicate between threads in a non-deterministic language.

A parallel program, on the other hand, is one that merely runs on multiple processors, with the goal of hopefully running faster than it would on a single CPU.

So where did this dangerous assumption that Parallelism == Concurrency come from?  It’s a natural consequence of languages with side-effects: when your language has side-effects everywhere, then any time you try to do more than one thing at a time you essentially have non-determinism caused by the interleaving of the effects from each operation.  So in side-effecty languages, the only way to get parallelism is concurrency; it’s therefore not surprising that we often see the two conflated.

However, in a side-effect-free language, you are free to run different parts of the program at the same time without observing any difference in the result.  This is one reason that our salvation lies in programming languages with controlled side-effects.  The way forward for those side-effecty languages is to start being more explicit about the effects, so that the effect-free parts can be identified and exploited.

It pains me to see Haskell’s concurrency compared against the concurrency support in other languages, when the goal is simply to make use of multicore CPUs (Edit: Ted followed up with a clarification).   It’s missing the point: yes of course Haskell has the best concurrency support :-) , but for this problem domain it has something even better: deterministic parallelism.  In Haskell you can use multicore CPUs without getting your hands dirty with concurrency and non-determinism, without having to get the synchronisation right, and with a guarantee that the parallel program gives the same answer every time, just more quickly.

There are two facets to Haskell’s determinstic parallelism support:

  • par/pseq and Strategies.  These give you a way to add parallelism to an existing program, usually without requiring much restructuring.  For instance, there’s a parallel version of ‘map’.    Support for this kind of parallelism is maturing with the soon to be released GHC 6.12.1, where we made some significant performance improvements over previous versions.
  • Nested Data Parallelism.  This is for taking advantage of parallelism in algorithms that are best expressed by composing operations on (possibly nested) arrays.  The compiler takes care of flattening the array structure, fusing array operations, and dividing the work amongst the available CPUs.  Data-Parallel Haskell will let us take advantage of GPUs and many-core machines for large-scale data-parallelism in the future.  Right now, DPH support in GHC is experimental, but work on it continues.

That’s not to say that concurrency doesn’t have its place.  So when should you use concurrency?  Concurrency is most useful as a method for structuring a program that needs to communicate with multiple external clients simultaneously, or respond to multiple asynchronous inputs.  It’s perfect for a GUI that needs to respond to user input while talking to a database and updating the display at the same time, for a network application that talks to multiple clients simultaneously, or a program that communicates with multiple hardware devices, for example.  Concurrency lets you structure the program as if each individual communication is a sequential task, or a thread, and in these kinds of settings it’s often the ideal abstraction.  STM is vitally important for making this kind of programming more tractable.

As luck would have it, we can run concurrent programs in parallel without changing their semantics.  However, concurrent programs are often not compute-bound, so there’s not a great deal to be gained by actually running them in parallel, except perhaps for lower latency.

Having said all this, there is some overlap between concurrency and parallelism.  Some algorithms use multiple threads for parallelism deliberately; for example, search-type problems in which multiple threads search branches of a problem space, where knowledge gained in one branch may be exploited in other concurrent searches.  SAT-solvers and game-playing algorithms are good examples.  An open problem is how to incorporate this kind of non-deterministic parallelism in a safe way: in Haskell these algorithms would end up in the IO monad, despite the fact that the result could be deterministic.  Still, I believe these kinds of problems are in the minority, and we can get a long way with purely deterministic parallelism.

You’ll be glad to know that with GHC you can freely mix parallelism and concurrency on multicore CPUs to your heart’s content.  Knock yourself out :-)


Heads up: what you need to know about Unicode I/O in GHC 6.12.1

September 30, 2009

The GHC 6.12.1 release candidate will be out shortly, and it includes a newly rewritten I/O library including Unicode support.  Here’s what you need to know to make sure your applications/libraries continue to work with GHC 6.12.1.

We expect the release candidate phase to last a couple of weeks or so, depending on how many problems arise, after which 6.12.1 will be released.  However, 6.12 is not currently scheduled to become part of the Haskell Platform until the next platform release, due around February 2010, so package authors have a grace period for testing before 6.12.1 becomes more widely used.

The new System.IO docs can be found here, in particular the unicode-related functionality is  here.

Console and text I/O

If you are reading or writing to/from the console, or  reading/writing text files in the local encoding, then use the System.IO functions for doing text I/O (openFile, readFile, hGetContents, putStr, etc.), and you will automatically benefit from  the new Unicode support.  Text written will be encoded according to the current locale, or code page on Windows, and text read will be decoded accordingly.

If you need to use a particular encoding (e.g. UTF-8), then the  hSetEncoding function lets you set the encoding on a Handle, e.g.

  hSetEncoding stdout utf8

Binary I/O

If you’re reading or writing binary data, or for some other reason you want to bypass the Unicode encoding/decoding that the IO library now does, you have two options:

  • Use openBinaryFile or hSetBinaryMode to put the Handle into binary  mode.  No encoding/decoding or newline translation will be done.
  • Use hGetBuf/hPutBuf, or the I/O operations provided by Data.ByteString, which all operate with binary data.

Using utf8-string

If you’re using utf8-string in certain ways then you might get incorrect results.

  • The operations in System.IO.UTF8 add a UTF8 wrapper around the  corresponding System.IO operation.  Unless the underlying Handle is in binary  mode, these operations will result in garbage being read or  written.  For example, if you want to use System.IO.UTF8.print,  then call hSetBinaryMode stdout True first.  Better still, just use System.IO.print directly.  f you need to fix the encoding to UTF-8 rather than using the locale encoding, then call hSetEncoding handle utf8.
  • The rest of the operations in utf8-string will continue to work as before.

Newline handling

There is a new API for newline translation in System.IO.  By default, Handles in text mode translate newlines to or from the native representation for the current platform, that is “\r\n” on Windows and “\n” on other platforms.  You can change this default using hSetNewlineMode, for example to be able to read a file with either Windows or Unix line-ending conventions:

 hSetNewlineMode handle universalNewlineMode

where universalNewlineMode translates from “\r\n” to “\n” on input, leaving “\n” alone, and translates “\n” to the native newline representation on output.


GHC Status Update (from the Haskell Implementors Workshop)

September 15, 2009

The video of Simon Peyton Jones’ GHC Status Update presentation at the Haskell Implementors Workshop is now online.  Lots of details about the goodies that will shortly be arriving in GHC 6.12.1.


Visualising the Haskell package dependency graph

July 6, 2009
Package dependency tree

Package dependency graph

This is a graph showing the dependencies between the packages that come with GHC.  I just added some (trivial) support to the ghc-pkg tool to generate the output in dot format, and generated the above graph with

ghc-pkg dot | tred | dot -Tsvg >pkgs.svg

Note the “tred” filter, which eliminates clutter from transitive edges. The ‘ghc-pkg dot’ command should be in GHC 6.12.1.


New paper: Parallel Performance Tuning for Haskell

June 22, 2009

Here’s our Haskell Symposium paper about parallel profiling with GHC and ThreadScope:

Parallel Performance Tuning for Haskell (Don Jones Jr., Simon Marlow, Satnam Singh) Haskell ‘09: Proceedings of the second ACM SIGPLAN symposium on Haskell, Edinburgh, Scotland, ACM, 2009

Abstract:

Parallel Haskell programming has entered the mainstream with support now included in GHC for multiple parallel programming models, along with multicore execution support in the runtime. However, tuning programs for parallelism is still something of a black art. Without much in the way of feedback provided by the runtime system, it is a matter of trial and error combined with experience to achieve good parallel speedups.

This paper describes an early prototype of a parallel profiling system for multicore programming with GHC. The system comprises three parts: fast event tracing in the runtime, a Haskell library for reading the resulting trace files, and a number of tools built on this library for presenting the information to the programmer. We focus on one tool in particular, a graphical timeline browser called ThreadScope.

The paper illustrates the use of ThreadScope through a number of case studies, and describes some useful methodologies for parallelizing Haskell programs.


The new GHC build system is here!

April 28, 2009

The new GHC build system has been now been merged in.  GHC developers can look forward to increases in productivity and faster build times thanks to the new non-recursive make design.

Here are some quick stats:

Lines of build-system code (including Makefile and Haskell code):

  • old build system: 7793
  • new build system: 5766 (about 2000 fewer lines, or a 26% reduction)

Furthermore, this doesn’t count the code for ‘cabal make’, which is still in Cabal but is no longer used by GHC.

Time to validate with -j2 (the default; test suite is still single-threaded):

  • old:  28 mins
  • new: 28 mins

Single and dual-core builds don’t see much difference.  However, adding more cores starts to demonstrate the improved parallelism: validate with -j4 (still single-threaded test suite):

  • old: 25.3 mins
  • new: 24.0 mins

Parallelism in the new build system is a lot better. It can build libraries in parallel with each other, profiled libraries in parallel with non-profiled libraries, and even libraries in parallel with stage 2. There’s very little explicit ordering in the new build system, we only tell make about dependencies.

Time to do ‘make’ when the tree is fully up-to-date:

  • old: 2m 41s
  • new: 4.1s

Time to do ‘make distclen’:

  • old: 5.7s
  • new: 1.0s

We also have all-new build-system documentation.

The biggest change you’ll notice is that the build system now expresses all the dependencies, so whatever you change, you should be able to say ‘make’ to bring everything up to date. Sometimes this can result in more rebuilding than you were expecting, or more than is strictly necessary, but it should save time in the long run as we run
into fewer problems caused by things being inconsistent or out-of-date in the build.

We stretched GNU make to its limits. On the whole it performed pretty well: even for a build of this size, the time and memory consumed by make itself is negligible. The most annoying problem we encountered was the need to split the build into phases to work around GNU make’s lack of support for dependencies between included makefiles.

On the whole I’m now convinced that non-recursive make is not only useful but practical, provided you stick to some clear idioms in your build-system design.


New paper: Runtime Support for Multicore Haskell

March 3, 2009

Here’s a paper on the internals of GHC’s parallelism support, showing some nice improvements in parallel performance over GHC 6.10.1: “Runtime Support for Multicore Haskell“.   Abstract:

Purely functional programs should run well on parallel hardware because of the absence of side effects, but it has proved hard to realise this potential in practice. Plenty of papers describe promising ideas, but vastly fewer describe real implementations with good wall-clock performance. We describe just such an implementation, and quantitatively explore some of the complex design tradeoffs that make such implementations hard to build. Our measurements are necessarily detailed and specific, but
they are reproducible, and we believe that they offer some general insights.


Ever wondered how big a closure is?

February 12, 2009

Somebody asked me yesterday how to find out how big the runtime representation of a type is. I hacked this up using the internal unpackClosure# primitive:

{-# LANGUAGE MagicHash,UnboxedTuples #-}
module Size where

import GHC.Exts
import Foreign

unsafeSizeof :: a -> Int
unsafeSizeof a =
  case unpackClosure# a of
    (# x, ptrs, nptrs #) ->
      sizeOf (undefined::Int) + -- one word for the header
        I# (sizeofByteArray# (unsafeCoerce# ptrs)
             +# sizeofByteArray# nptrs)

Try it in GHCi:

Prelude> :!ghc -c Size.hs
Prelude> :l Size
Ok, modules loaded: Size.
Prelude Size> unsafeSizeof 3.3
16
Prelude Size> unsafeSizeof "a"
24
Prelude Size> unsafeSizeof (1,2,3,4)
40
Prelude Size> unsafeSizeof True
8
Prelude Size> unsafeSizeof $! Data.Complex.(:+) 2 3
24
Prelude Size> unsafeSizeof $! 3
16
Prelude Size> unsafeSizeof $! (3::Integer)
16
Prelude Size> unsafeSizeof $! (3::Int)
16
Prelude Size> unsafeSizeof $! (2^64::Integer)
24

I’m on a 64-bit machine, obviously. It doesn’t always do the right thing, but for ordinary algebraic types it should work most of the time. Remember to use $!, as the size returned for an unevaluated thunk is always just one word (unpackClosure# doesn’t work for thunks).


Benchmarking recent improvements in parallelism

January 9, 2009

Over the last few months we’ve been making various improvements to the performance of parallel programs with GHC. I thought I’d post a few benchmarks so you can see where we’ve got to. This is a fairly random collection of 6 parallel benchmarks (all using par/seq-style parallelism rather than explicit threading with forkIO). The main point here is that I just took the programs unchanged – I haven’t made any attempt to modify the programs themselves to  make them parallelize better (although other people might have done so in the past), the focus here has been on changing the GHC runtime to optimize these existing programs. The programs come mostly from old benchmarks for the GUM implementation of Parallel Haskell, and the sources can be found here.

  • matmult: matrix multiply
  • parfib: our old friend fibonacci, in parallel
  • partree: some operations on a tree in parallel
  • prsa: decode an RSA-encoded message in parallel
  • ray: a ray-tracer
  • sumeuler:  sum . map euler

Here are the results.  The first column of numbers is the time taken for GHC 6.10.1 to run the programs on one CPU, and the following three columns are the difference in elapsed time when the programs are run on 4 CPUs (actually 4 cores of my 8-core x86_64 box) with respectively GHC 6.8.3, 6.10.1, and my current working version (HEAD + a couple of patches).

------------------------------------------------------------
  Program   6.10.1   6.8.3 -N4  6.10.1 -N4  ghc-simonmar -N4
------------------------------------------------------------
  matmult     8.55   -60.0%     -63.7%       -72.0%
   parfib     9.65   -72.6%     -70.2%       -76.3%
  partree     8.03   +26.4%     +52.7%       -40.7%
     prsa     9.52   +13.8%     -44.1%       -68.2%
      ray     7.04   +16.5%     +11.8%       +28.0%
 sumeuler     9.64   -71.2%     -73.1%       -74.0%
  -1 s.d.    -----   -68.8%     -71.6%       -78.0%
  +1 s.d.    -----   +20.1%      +6.4%       -27.1%
  Average    -----   -38.8%     -45.0%       -59.9%

The target is -75%: that’s a speedup of 4 on 4 cores. As you can see, 6.10.1 is already doing better than 6.8.3, but the current version has made some dramatic improvements and is getting close to the ideal speedup on several of the programs.  Something odd is going on with ray, I don’t know what yet!

Here’s a summary of the improvements we made:

  • Lock-free work-stealing queues for load-balancing of sparks (par). This work was originally done by Jost Berthold during his internship at MSR in the summer of 2008, and after further improvements was merged into the GHC mainline after the 6.10.1 release.
  • Improvements to parallel GC: we now use the same threads for GC as for executing the program, and have made improvements to the barrier (stopping threads to do GC), and improvements to affinity (making sure each GC thread traverses data local to that CPU).  Some of this has yet to hit the mainline, but it will shortly.
  • Eager blackholing: this reduces the chance that multiple threads repeat the same work in a parallel program.  It’s  a compile-time option (-feager-blackholing in the HEAD) and it costs a little execution time to turn it on, but it can improve parallelism quite a lot.
  • Running sparks in batches. Previously, each time we run a spark we created a new thread for it. Threads are lightweight, but the cost can still be high relative to the size of the spark. So now we have Haskell threads that repeatedly run sparks (stealing from other CPUs if necessary) until there are no more sparks to run, eliminating the context-switch and thread-creation overhead for sparks. This means we can push the granularity quite a lot: parfib speeds up even with a very low threshold now.

We’re on the lookout for more parallel benchmarks: each new program we find tends to stress the runtime in a different way, so the more code we have, the better.  Even if (or especially if) your program doesn’t go faster on a multicore – send it to us and we’ll look into it.


Explicit Stack Traces

December 4, 2008
ghc.exe: panic! (the 'impossible' happened)
  (GHC version 6.11.20081202 for i386-unknown-mingw32):
        idInfo

Argh!  When working on GHC, or in any Haskell in general, it is occasionally useful to know something about what was going on when a program crashed.  Currently we have several ways of doing this; for example the ghci-debugger or using the cost-center-stacks that are part of the profiling tools in GHC.  For my internship here at MSR, I’ve been looking at putting another tool in the GHC-users toolbelt, that of selective explicit-call-stack (ecs) traces (in a theme of things described on this wiki page); and today I’ve had a breakthrough in terms of using the tool I’ve been working on to actually help debug a real issue I was having in GHC! So, our panic above was coming from this bit of source:

varIdInfo :: Var -> IdInfo
varIdInfo (GlobalId {idInfo_ = info}) = info
varIdInfo (LocalId {idInfo_ = info}) = info
varIdInfo other_var = pprPanic "idInfo" (ppr other_var)

What the ecs-extension I’m working on does is allow you to add an annotation (an awesome new feature from that’s been recently folded into GHC Head) onto this function to say that I want to debug this function.

import GHC.ExplicitCallStack.Annotations
{-# ANN varIdInfo Debug #-}

And because there’s a few things that need neatening up in the implementation, the pprPanic needs rewriting slightly:

varIdInfo :: Var -> IdInfo
varIdInfo (GlobalId {idInfo_ = info}) = info
varIdInfo (LocalId {idInfo_ = info}) = info
varIdInfo other_var = throwStack (\s -> pprPanic ("idInfo\n" ++ show s) (ppr other_var) :: SomeException)

Recompiling this source (for the curious, recompiling meant building a stage3 ghc), passing in a -fexplicit-call-stack argument to GHC will essentially create a version of this function that tells all callers of this function to pass in their source locations, and will make the pprPanic print out this location.

What this really means is that the original panic becomes slightly more helpful:

ghc.exe: panic! (the 'impossible' happened)
  (GHC version 6.11.20081202 for i386-unknown-mingw32):
        idInfo
in varIdInfo, basicTypes/Var.lhs:238,30

I’ve been told that just having one level of stack information is enough to help debug many problems; in this case it wasn’t quite, but we can just go an add a Debug annotation to varIdInfo to get more stack information.  If a function that has a Debug annotation calls a function with a Debug annotation, the stack is grown an extra level.  So, lather-rinse-repeat enough times, and it’s possible to get quite a lot of information.

ghc.exe: panic! (the 'impossible' happened)
  (GHC version 6.11.20081202 for i386-unknown-mingw32):
        idInfo
in varIdInfo, basicTypes/Var.lhs:238,30
in idInfo, basicTypes/Id.lhs:168,10
in idInlinePragma, basicTypes/Id.lhs:633,37
in preInlineUnconditionally, simplCore/SimplUtils.lhs:619,12
in simplNonRecE, simplCore/Simplify.lhs:964,5
in simplLam, simplCore/Simplify.lhs:925,13
in simplExprF', simplCore/Simplify.lhs:754,5
in simplExprF, simplCore/Simplify.lhs:741,5
in completeCall, simplCore/Simplify.lhs:1120,24
in simplVar, simplCore/Simplify.lhs:1032,29
in simplExprF', simplCore/Simplify.lhs:746,39
...
in simplExprF', simplCore/Simplify.lhs:750,39
...
in simplLazyBind, simplCore/Simplify.lhs:339,33
in simplRecOrTopPair, simplCore/Simplify.lhs:295,5
in simplTopBinds, simplCore/Simplify.lhs:237,35
in simplifyPgmIO, simplCore/SimplCore.lhs:629,5
in simplifyPgm, simplCore/SimplCore.lhs:562,22
in doCorePass, simplCore/SimplCore.lhs:156,40

But what, I hear you cry, are those “…”’s?  Well since Haskell uses recursion to implement iteration, it would be a very bad thing if we tried to keep a stack that included all sites recursively entered.  Instead, we have a rule that any call site may appear in the stack only once, and if it would appear multiple times, only the newest one is kept, with any others being replaced by “…”s.  Here we can see that simplExprF’ recursively called itself a few times.

Anyways, there are several issues to solve before this is completely usable for prime time, so I can’t promise when it’ll be available; but just thought I’d share what I’m working on;

Tristan (ToRA|MSR)