
Handling sparse data
Vectors and arrays are excellent for dense data, that is, when you're not doing inserts in between elements, and the range of indices is reasonable. But if you need, for example, inserts and indexing in arbitrary indices, a tabular structure won't perform well. In such cases, you need some sort of a map or similar structure.
Haskell doesn't have any sparse structures built-in, nor does the Haskell report define any. This has some nice consequences:
- Keeps the core language small
- Gives Haskellers complete freedom over the implementation
- Allows writing code that doesn't care much about the specific underlying data structure
There are many excellent libraries, implementing a wide range of sparse data structures, in Hackage, not to mention type classes capturing general properties of those structures. Unfortunately, the ecosystem is a bit too scattered, so it is sometimes hard to determine which library would give the best performance, cleanest code, and most flexibility, or whether the package is maintained.
Using the containers package
The go-to package for immutable sparse data structures is containers
. It provides reasonably efficient implementations for maps and sets, very basic tree and graph structures, and the sequence mentioned previously. The structures are purely functional, which is a nice property in itself, encouraging us to write code in a functional style:
Ord a => Data.Map.{Lazy,Strict}.Map k v Ord a => Data.Set.Set a Data.IntMap.{Lazy,Strict}.IntMap v Data.IntSet.IntSet a Data.Tree.Tree a Data.Graph.Graph = Array Int [Int] Data.Sequence.Seq a
However, you shouldn't expect these structures, say Map, to perform equally with traditional imperative-style structures (hashmaps), especially when used more or less imperatively. The Map from containers
is based on binary trees. IntMap, which constrains its keys to Ints, uses Patricia trees and is considerably more efficient than Map Int.
The imperative map implementation is usually a hash table because of its O(1) lookups compared to the O(log n) of tree-based maps. However, hash tables rely on mutable state, and so are not so convenient in functional settings.
Functional structures have their unique advantages:
- Updates reuse parts of the previous structure, so keeping older versions lying around is cheap
- Automatic thread-safety
Furthermore, with lazy evaluation, a map lazy in its values allows for an intuitive memoization pattern:
-- file: map-fib.hs import Data.IntMap as IM fib :: Int -> IntMap Int fib n = m where m = IM.fromList $ (1,1) : (2,1) : [ (i, IM.findWithDefault 0 (i-1) m + IM.findWithDefault 0 (i-2) m) | i <- [3..n] ]
Note
Unless you need to have thunks as values, you should use the strict variants of Map and IntMap for efficiency.
Using the unordered-containers package
The containers
package requires an Ord
instance from the keys in a Map and from values in a Set. For string-valued data this is problematic, because a comparison might be too expensive. And for some data, an Ord
instance is an impossibility:
Hashable k => Data.HashMap.Lazy.HashMap k v Hashable k => Data.HashMap.Strict.HashMap k v Hashable a => Data.HashSet.HashSet
The unordered-containers
package provides pure functional hash maps and sets, requiring a Hashable
instance. The same performance considerations apply to HashMap
and HashSet
as the data structures from containers
: they are persistent in nature but won't beat mutable-state variants in terms of raw speed.