## Wednesday, April 18, 2012

In my previous post I described my work to support for SIMD instructions in GHC and to exploit this support in the high-level vector library. Despite some issues with the code generated by LLVM (still unresolved), SIMD instructions did lead to a significant performance increase on some small floating-point intensive microbenchmarks.

This performance gain comes at a syntactic cost: code must be re-written to use new SIMD operations. For consumers of the vector library, this is a small cost because high-level SIMD operators are available, but it would be nice not to have to pay any syntactic cost at all. There is an additional "deficiency" in the vector approach to computing with arrays—a lack of parallelism. It's nice to be able to leverage SIMD operations from Haskell code, but it would be even better if we could leverage them in parallel code! There are two primary frameworks for computing with parallel arrays in Haskell: Repa and Data Parallel Haskell (DPH). DPH includes a vectorizer that converts scalar code to vector code, so it seems like there's a good chance we could support SIMD operations in DPH without requiring users to rewrite code. The rest of this post describes (positive) preliminary results in this direction.

My running example will be computing the dot product of two vectors, this time using DPH instead of the vector library. Here is the code:

import Data.Array.Parallel
import qualified Data.Array.Parallel.Prelude.Double as D

dotp :: PArray Double -> PArray Double -> Double
{-# NOINLINE dotp #-}
dotp v w = dotp' (fromPArrayP v) (fromPArrayP w)

dotp' :: [:Double:] -> [:Double:] -> Double
dotp' v w = D.sumP (zipWithP (D.*) v w)


DPH introduces a new parallel array data type (PArray), some new syntax for parallel arrays ([:...:]), and some new combinators for operating on parallel arrays (zipWithP). DPH doesn't quite yet support type classes, so we have to import a special version of the Prelude specialized to the Double type. You can read more about the details on the DPH wiki page.

Using the simd branches of ghc, the vector library, and the dph library, we can change a single import statement and our code will use SIMD instructions to compute the dot product:

1:  import Data.Array.Parallel
2:  import qualified Data.Array.Parallel.Prelude.MultiDouble as D
3:
4:  dotp :: PArray Double -> PArray Double -> Double
5:  {-# NOINLINE dotp #-}
6:  dotp v w = dotp' (fromPArrayP v) (fromPArrayP w)
7:
8:  dotp' :: [:Double:] -> [:Double:] -> Double
9:  dotp' v w = D.sumP (zipWithP (D.*) v w)


Note than only the import statement on line 2 changed. If I had simply rolled SIMD support directly into Data.Array.Parallel.Prelude.Double, exploitation of SIMD instructions would be automatic, but for now I've put my changes into a separate module (this also makes comparisons easier).

Here are benchmarks taken on a 2.70GHz Intel® Core™ i7-2620M CPU with frequency scaling disabled. I'm using a perf-llvm build of the simd branch of GHC (compiled with GHC 7.4.1—earlier versions have trouble using LLVM to compile GHC). The benchmark program was compiled with the options -rtsopts -threaded -Odph -O2 -fllvm -optlo-O3 -optc-O3 -optc-march=corei7. I did not use -fcpr-off or -fno-liberate-case as suggested by comments in the README in the dph library because these options prevented fusion of SIMD operations. It looks like we really need Constructed product result analysis for efficient code, which is what I would expect. Errors bars mark one standard deviation, and times are averaged over 100 runs.

The C and vector versions both take advantage of SIMD instructions—in the former case, via intrinsics. I've included three DPH versions: the two versions I already gave code for, and a version that uses DPH's parallel array syntax. The code for this version is:

dotp :: PArray Double -> PArray Double -> Double
{-# NOINLINE dotp #-}
dotp v w = dotp' (fromPArrayP v) (fromPArrayP w)

dotp' :: [:Double:] -> [:Double:] -> Double
dotp' xs ys = D.sumP [:x D.* y | x <- xs | y <- ys:]


Strangely, using syntactic sugar results in a significant performance hit. This is a pre-release version of DPH though, and I am told that this performance gap should be eliminated in time for the next release.

Using SIMD instructions in DPH gives us a big gain over the baseline DPH code, but it looks like we're limited by memory bandwidth, so we're not taking advantage of our extra cores. Fortunately, I have access to a 32-core 2.0GHz AMD Opteron™ 6128. Here is the same benchmark run there:

Now we're cooking—much better scaling behavior! With DPH we can take advantage of multiple cores and we don't have to re-write our code to make use of SIMD instructions. Before GHC 7.6.1, I'd like to integrate all of this work back into the mainline branches of GHC, vector, and dph. Once that is done, all you will have to do to take advantage of SIMD versions of floating point operations in your DPH code is move to the new release.