Annotating strictness and unpacking datatype fields_Haskell High Performance Programming-QQ阅读男生中文玄幻网

上QQ阅读APP看书，第一时间看更新

Annotating strictness and unpacking datatype fields

Recall that in the previous chapter, we used seq to force strict evaluation. With the BangPatterns extension, we can force functions arguments. Strict arguments are evaluated WHNF just before entering the function body:

{-# LANGUAGE BangPatterns #-}

f !s (x:xs) = f (s + 1) xs
f !s      _ = s

Using bangs for annotating strictness in fact predates the BangPatterns extension (and the older compiler flag -fbang-patterns in GHC 6.x). With just plain Haskell98, we are allowed to use bangs to make datatype fields strict:

> data T = T !Int

A bang in front of a field ensures that whenever the outer constructor (T above) is in WHNF, the inner field is as well in WHNF. We can check this:

> T undefined `seq` ()
*** Exception: Prelude.undefined

There are no restrictions to which fields can be strict, be it recursive or polymorphic fields, although it rarely makes sense to make recursive fields strict. Consider the fully strict linked list:

data List a = List !a !(List a)
            | ListEnd

With this much strictness, you cannot represent parts of infinite lists without always requiring infinite space. Moreover, before accessing the head of a finite strict list you must evaluate the list all the way to the last element. Strict lists don't have the streaming property of lazy lists.

By default, all data constructor fields are pointers to other data constructors or primitives, regardless of their strictness. This applies to basic data types Int, Double, Char, and so on, which are not primitive in Haskell. They are data constructors over their primitive counterparts Int#, Double#, and Char#:

> :info Int
data Int = GHC.Types.I# GHC.Prim.Int#

There is a performance overhead, the size of pointer dereference between types, say, Int and Int#, but an Int can represent lazy values (called thunks), whereas primitives cannot. Without thunks, we couldn't have lazy evaluation. Luckily, GHC is intelligent enough to unroll wrapper types as primitives in many situations, completely eliminating indirect references.

The hash suffix is specific to GHC and always denotes a primitive type. The GHC modules do expose the primitive interface. Programming with primitives, you can further micro-optimize code and get C-like performance. However, several limitations and drawbacks apply, which we shall consider in Chapter 4, The Devil's in the Detail.

Unbox with UNPACK

The most powerful trick available to make efficient datatypes in Haskell is to unpack their fields, also known as unboxing. Those terms are almost synonymous; unboxing means very generally peeling off layers of indirection, while unpacking refers to methods of unboxing in GHC. An unpacked field is no longer a pointer to a data constructor. Instead, the value is stored in memory next to the constructor, where the pointer to a value (or a thunk) is normally stored.

Use the {-# UNPACK #-} pragma before a field to unpack it. An unpacked field must also be strict, that is, prefixed with a bang, otherwise it could be a pointer to a thunk, and there would be no way to know whether the value is evaluated or not.

The preceding diagram illustrates how a value such as T (1 + 2) would be represented in memory given different definitions for T. Without strictness annotation, the field points to a thunk. When the field is strictly defined, the contents will be calculated, but the field is still a pointer to the value.

In the unpacked version, the contents of field are stored in place of the pointer.

Note that, if compiled with -O and above optimization level, there is an error in the image, as an optimization passed in GHC automatically unpacks strict fields when possible. The strict field version then produces exactly the code of the version with explicit unpacking. However, in performance-critical datatypes, and especially in library code, it is good practice to use explicit unpacking, because automatic unboxing isn't always a good idea.

There are some restrictions to which kinds of values can be declared unpacked. Most importantly, sum types, parametrically polymorphic values, and functions are ruled out. So these are all invalid data declarations:

data S = S {-# UNPACK #-} !(Maybe Int)  -- invalid!

data F = F {-# UNPACK #-} !(Int -> Int) -- invalid!

data P a = P {-# UNPACK #-} !a          -- invalid!

On the other hand, these are valid:

data T = T {-# UNPACK #-} !(Int, Int)

data R a = R { field_a :: a
             , field_t :: {-# UNPACK #-} !T
             }
data GADT a where
  Empty :: GADT ()
  Some  :: a - > {-# UNPACK #-} !Int - > Some Int

That last type requires enabling the GADTs extension, for general algebraic datatypes.

Then how about this one?

data W = W {-# UNPACK #-} !Int {-# UNPACK #-} !W

It compiles just fine. W is not a sum type nor polymorphic, so it will be unpacked. But you cannot actually do anything with W – it's impossible to construct values of type W: W 1 undefined as they will produce an error, while let w = W 1 w produces a loop! So as a corollary of other requirements of unboxing, we have that inductive fields cannot be unpacked.

Now let's find out the effect of unpacking on performance in a tight loop with the following little program:

-- file: strict_and_unpacked.hs

{-# LANGUAGE BangPatterns #-}

data PairP = PairP Int Int deriving (Show)

data PairS = PairS !Int !Int deriving (Show)

data PairU = PairU {-# UNPACK #-} !Int {-# UNPACK #-} !Int deriving (Show)

iter :: Int -> (a -> a) -> a -> a
iter end f x = go 0 x
    where go !n x | n < end   = go (n + 1) $! f x
                  | otherwise = x

With 1000 iterations of a function that does simple arithmetic on the fields, we obtain the following heap usage for the different data types:

The difference is very big indeed! But do note that unboxing doesn't always increase performance. For example, consider a record with a lot of fields. If those fields contain large chunks of unboxed data, then to make a copy of the record would mean duplicating all of that unboxed data too. Comparing to if those fields were lazy, that is, represented by pointers, we would only need to make copies of those pointers.

Using anonymous tuples

Tuples may seem harmless at first; they just lump a bunch of values together. But note that the fields in a tuple aren't strict, so a two-tuple corresponds to the slowest PairP data type from our previous benchmark.

If you need a strict Tuple type, you need to define one yourself. This is also one more reason to prefer custom types over nameless tuples in many situations. These two structurally similar tuple types have widely different performance semantics:

data Tuple = Tuple {-# UNPACK #-} !Int {-# UNPACK #-} !Int
data Tuple2 = Tuple2 {-# UNPACK #-} !(Int, Int)

If you really want unboxed anonymous tuples, you can enable the UnboxedTuples extension and write things with types, like (# Int#, Char# #). But note that a number of restrictions apply to unboxed tuples, as to all primitives. The most important restriction is that unboxed types may not occur where polymorphic types or values are expected, because polymorphic values are always considered as pointers.

Performance of GADTs and branching

Generalized algebraic datatypes are great. Existential quantification, which is more or less, comes with GADTs, but it's relatively easy to destroy performance with existential quantification.

Consider the following slightly contrived GADT, capable of representing all integral types and Chars:

-- file: gadts.hs

{-# LANGUAGE GADTs #-}

data Object a where
    Number :: Integral a => a -> Object a
    Character :: Char -> Object Char

Turns out, this datatype is quite benign. The following two folds, when compiled with at least-O, have exactly the same performance:

    foldl (+) 0 [1..1000000 :: Int]

    foldl (\a (Number b) -> a + b) 0
          [ Number x  | x <- [1..1000000 :: Int] ]

But this is an extremely simplified example, where GHC in fact discards our intermediate Number constructors altogether and just produces a tight loop over integers. And due to the extra type information present in the GADT, we can switch the function we fold into:

f :: a -> Object a -> a
f a x = case x of
            Character _ -> a
            Number n -> a + n

GHC would inline f and specialize it with type Int → Object Int → Int, learn that branch Character is never reached in the specialized version, discard it, and we'd end up with the same tight loop. Which is pretty nice!

But if we add an extra constructor to Object:

Number' :: Integral a => a -> Object a

And add an extra branch to f:

case x of
    …
    Number' n -> a - n

Then GHC will be forced to consider two branches, right? Well in general, the answer would be yes. But in our simple example, GHC will still happily produce the same tight loop. What is happening is that GHC fuses the list of Object values from the list comprehension, learning that no values are constructed with the Number' constructor, inferring that the new branch is still redundant.

But if we forced either the folding function or the object list and its elements to not inline (with NOINLINE or producing the objects elsewhere), or indeed constructed values with multiple constructors, then GHC would be forced to consider all type-correct branches.

So, in general, GADTs are optimized pretty well. But what about existentials? Consider using this ObjectE instead of Object:

data ObjectE where
    NumberE :: Integral a => a -> ObjectE

Here we're just saying that we don't care which number type a given object is, only that it has an Integral instance. But if we compare the performance of this fold over [ObjectE]:

foldl (\a (NumberE b) -> a + fromIntegral b) 0
    [ NumberE x | x <- [1..1000000 :: Int] ]

To the performance of a similar fold over [Object Int], the numbers are as follows:

[Object Int]: 51 KB allocated and execution time about 5ms
[ObjectE]: 32,000 KB allocated and execution time about 30ms

That is, because of an existential, our program got six times slower, and additionally started allocating space linear to input size. What exactly is going on here?

The problem is that by wrapping our numbers inside an existential, we are deliberately forgetting the type. The type class constraint lets us retain some information, but with an extra cost of a layer of indirection. Existentials force this indirection to persist through all GHC optimizations, and that's why our code was so slow. Furthermore, with that added indirection, GHC can no longer unbox our numbers as efficiently, which explains the extra allocations we observed.

The lessons here are that existentials have an overhead, and that extra type information available in GADTs helps not only the programmer, but also the compiler, by opening up extra possibilities for optimizations. GADT's are useful and fast, while existentials are just useful.