Handling sparse data_Haskell High Performance Programming-QQ阅读男生中文历史网

上QQ阅读APP看书，第一时间看更新

Handling sparse data

Vectors and arrays are excellent for dense data, that is, when you're not doing inserts in between elements, and the range of indices is reasonable. But if you need, for example, inserts and indexing in arbitrary indices, a tabular structure won't perform well. In such cases, you need some sort of a map or similar structure.

Haskell doesn't have any sparse structures built-in, nor does the Haskell report define any. This has some nice consequences:

Keeps the core language small
Gives Haskellers complete freedom over the implementation
Allows writing code that doesn't care much about the specific underlying data structure

There are many excellent libraries, implementing a wide range of sparse data structures, in Hackage, not to mention type classes capturing general properties of those structures. Unfortunately, the ecosystem is a bit too scattered, so it is sometimes hard to determine which library would give the best performance, cleanest code, and most flexibility, or whether the package is maintained.

Using the containers package

The go-to package for immutable sparse data structures is containers. It provides reasonably efficient implementations for maps and sets, very basic tree and graph structures, and the sequence mentioned previously. The structures are purely functional, which is a nice property in itself, encouraging us to write code in a functional style:

Ord a => Data.Map.{Lazy,Strict}.Map k v
Ord a => Data.Set.Set a

Data.IntMap.{Lazy,Strict}.IntMap v
Data.IntSet.IntSet a

Data.Tree.Tree a
Data.Graph.Graph = Array Int [Int]
Data.Sequence.Seq a

However, you shouldn't expect these structures, say Map, to perform equally with traditional imperative-style structures (hashmaps), especially when used more or less imperatively. The Map from containers is based on binary trees. IntMap, which constrains its keys to Ints, uses Patricia trees and is considerably more efficient than Map Int.

The imperative map implementation is usually a hash table because of its O(1) lookups compared to the O(log n) of tree-based maps. However, hash tables rely on mutable state, and so are not so convenient in functional settings.

Functional structures have their unique advantages:

Updates reuse parts of the previous structure, so keeping older versions lying around is cheap
Automatic thread-safety

Furthermore, with lazy evaluation, a map lazy in its values allows for an intuitive memoization pattern:

-- file: map-fib.hs

import Data.IntMap as IM

fib :: Int -> IntMap Int
fib n = m where
    m = IM.fromList $ (1,1) : (2,1) :
        [ (i, IM.findWithDefault 0 (i-1) m + IM.findWithDefault 0 (i-2) m)
        | i <- [3..n] ]

Note

Unless you need to have thunks as values, you should use the strict variants of Map and IntMap for efficiency.

Using the unordered-containers package

The containers package requires an Ord instance from the keys in a Map and from values in a Set. For string-valued data this is problematic, because a comparison might be too expensive. And for some data, an Ord instance is an impossibility:

Hashable k => Data.HashMap.Lazy.HashMap k v
Hashable k => Data.HashMap.Strict.HashMap k v
Hashable a => Data.HashSet.HashSet

The unordered-containers package provides pure functional hash maps and sets, requiring a Hashable instance. The same performance considerations apply to HashMap and HashSet as the data structures from containers: they are persistent in nature but won't beat mutable-state variants in terms of raw speed.