Handling tabular data_Haskell High Performance Programming-QQ阅读男生武侠网

上QQ阅读APP看书，第一时间看更新

Handling tabular data

If you need O(1) general indexing, a table-like data structure is virtually your only option. The Haskell report specifies the array package, which provides tables indexed by anything with an instance for a Ix typeclass.

Immutable arrays come in two flavors (we'll discuss mutable arrays later):

Data.Array.Array: Immutable arrays of boxed values
Data.Array.Unboxed.UArray: Immutable arrays of unboxed values

A common use case for Immutable arrays is memoization. For example, a table of Fibonacci numbers could be constructed as follows:

-- file: fib-array-mem.hs
import Data.Array

fib :: Int -> Array Int Integer
fib n = arr where
  arr = listArray (1,n) $ 1 : 1 : [ arr!(i-2) + arr!(i-1)| i <- [3..n] ]

We can also index by a tuple, which gives the array extra dimensions. The symmetric Pascal matrix will serve as an example:

pascal :: Int -> Array (Int, Int) Integer
pascal n = arr where
  arr = array ((1,1),(n,n)) $
    [ ((i,1),1) | i <- [1..n] ] ++
    [ ((1,j),1) | j <- [1..n] ] ++
    [ ((i,j),arr!(i-1,j) + arr!(i,j-1)) | i <- [2..n], j <- [2..n] ]

These self-referential definitions look very nice, and with arbitrary-sized integers the performance is pretty much as good as it could be.

But what if we only needed a table of the first 90 Fibonacci numbers? The 90th Fibonacci number fits into an Int64, so we could switch to that and use UArray instead of Array. But then we could not have had the nice self-referential definition, because array construction would block trying to index its unfinished self. In this case, you should build the list so that it doesn't reference the array, or convert the boxed array into an unboxed array via an intermediate list:

toUArray :: (Ix i, IArray UArray e) => Array i e -> UArray i e
toUArray a = listArray (bounds a) (elems a)

This conversion comes at the cost of an intermediate list, though.

The array package is quite low-level, but the speed will be there at the cost of doing a lot of index fiddling yourself. For multi-dimensional arrays, much of that index-fiddling is unfortunately unavoidable. But for single-dimensional arrays, there is a better choice.

Using the vector package

The Data.Vector modules in the vector package provide sweet and speedy high-level Int-indexed arrays, implemented on top of Data.Array. They too come in boxed and unboxed flavors.

The sweetness of vector is in the API, which is loaded with higher-order functions, convenient helpers, monad-lifted operations, and of course all the common operations for list-like structures.

The speed comes once again from fusion; in terms of raw speed, operations on vector have an upper bound set by arrays that vectors use internally. However, a sufficiently large composition of vector operations will almost always outperform a similar naive array-based program.

Say we have a sensor from which we have stored numeric observations at different times, and now we want to analyze that data. For performance, we choose to use unboxed vector for storing the data in memory. Also, we import randomIO for testing purposes:

-- file: vector-testing.hs

import qualified Data.Vector.Unboxed as U
import System.Random (randomIO)

A neat thing about unboxed Vectors is that unboxed vectors support tuple-valued elements. Internally, they are represented with two vectors. This defines some types:

type Obs = U.Vector (TimeStamp, Double)

type TimeStamp = Int

We can extract the value Vector of our observation vector using U.unzip in constant time and no copying:

-- | O(1)
values :: Obs -> U.Vector Double
values obs = snd (U.unzip obs)

Note that U.map snd would be bad, because mapping constructs a new vector, in general.

Now let's try something more interesting: a windowing function, which gives us the slice between two timestamps. The following implementation of a window function is linear in time and constant in space:

-- | O(n+m), no copying.
window :: TimeStamp -> TimeStamp -> Obs -> Obs
window from until v =
    let (_, start)   = U.span ((< from) . fst) v
        (between, _) = U.span ((<= until) . fst) start
        in between

We could improve this by a more involved binary search, for example. But for demonstration, I used just U.span, which also does no copying by reusing the original vector. Because the time step between two observations (TimeStamps) can be arbitrary, logarithmic time complexity is the best we could get.

Implementing value average is straightforward:

-- | O(n)
average :: Obs -> Double
average obs = U.sum (values obs) / fromIntegral (U.length (values obs))

Let's test out the performance by generating a data set of a million random observations and then calculating averages at different windows:

main = do
    obs <- U.generateM (1024 ^ 2) $ \i -> randomIO >>= \v -> return (i, v)
    print $ average $ window 1 (1024 ^ 2) obs
    print $ average $ window 2 (1023 ^ 2) obs
    print $ average $ window 3 (1022 ^ 2) obs
    print $ average $ window 4 (1021 ^ 2) obs

Compile and run with Runtime System statistics:

$ ghc -rtsopts -O vector-testing.hs && time ./vector-testing +RTS -s

[...]

2,090,993,872 bytes allocated in the heap
341,188,752 bytes copied during GC
59,032,744 bytes maximum residency (7 sample(s))
2,863,512 bytes maximum slop
138 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 4003 colls, 0 par 0.372s 0.372s 0.0001s 0.0013s
Gen 1 7 colls, 0 par 0.177s 0.177s 0.0253s 0.0447s

INIT time 0.000s ( 0.000s elapsed)
MUT time 1.426s ( 1.428s elapsed)
GC time 0.548s ( 0.548s elapsed)
EXIT time 0.006s ( 0.006s elapsed)
Total time 1.983s ( 1.983s elapsed)

%GC time 27.7% (27.7% elapsed)

Alloc rate 1,465,833,131 bytes per MUT second

Productivity 72.3% of total user, 72.3% of total elapsed

Wow, this looks pretty bad: lots of GC, only 70% productivity, and a 60-megabyte memory footprint! The data itself is only 16 megabytes on a 64-bit machine, which implies that a lot of unnecessary things are going on.

Turns out, optimization level -O is insufficient for lots of important optimizations to kick in. Switching to -O2 gives significant improvements:

$ ghc -rtsopts -O2 vector-testing.hs && ./vector-testing +RTS -s

[..]

1,862,402,704 bytes allocated in the heap
818,920 bytes copied during GC
16,779,344 bytes maximum residency (2 sample(s))
2,070,704 bytes maximum slop
19 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 3576 colls, 0 par 0.014s 0.014s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.001s 0.001s 0.0006s 0.0010s

INIT time 0.000s ( 0.000s elapsed)
MUT time 1.227s ( 1.229s elapsed)
GC time 0.016s ( 0.015s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 1.247s ( 1.245s elapsed)

%GC time 1.3% (1.2% elapsed)

Alloc rate 1,517,508,130 bytes per MUT second

Productivity 98.7% of total user, 98.9% of total elapsed

With -O2, we got down to 1.3% GC and 19 megabytes of memory footprint. For even better results, we could combine with optimizations for numerical code.

Note

It's certainly a good idea to always use -O2 for production and performance testing, especially for vector code. Compile times will be significantly longer, however, so in development -O or even no optimization passes at all is advisable.