When do we really need efficient immutable data structures?

Yehonathan_Sharvit · April 18, 2021, 6:44am

There are two ways to manipulate immutable data:

Representing data with efficient immutable data structures (like in Clojure).
Representing data with regular immutable data structures and leverage structural sharing (a.k.a path copying) to create new versions of data, as illustrated in this article of mine .

The first approach is more efficient than the second approach. But the first approach requires specific data structures, while the second approach works with any data structures. For instance, Lodash.js and Ramda.js provide efficient functions to manipulate native JavaScript objects in an efficient way.

For usual nested maps where we tend to have less than 10 fields per level, I am not sure that the first approach is significantly more efficient than the second approach.

So here is my question: in what real life situations do we really need efficient immutable data structures?

lilactown · April 18, 2021, 2:42pm

Your question doesn’t make much sense on the surface to me. If I infer, I think you’re asking: what’s better, using mutable data structures where we choose to copy them and modify them in an immutable fashion, or immutable data structures that handle structural sharing for us?

Phill · April 18, 2021, 2:52pm

The concerns are certainly a bit different in JavaScript.

On a Clojure forum, your sanity may be brought into question for the mere mention of going to any effort not to use the famous structures with automatic structural sharing. Right?

But the casual reference to “efficiency” summoned to mind the little-known fact that – to satisfy a quest for efficiency for various distinct use cases – Clojure has not only persistent maps that automatically leverage structural sharing, but also (at least!) 2½ kinds of map that make a complete* copy for every modification! Despite the complete* copy, for their intended purposes they outperform PersistentHashMap!

(I rate array-map only ½. It makes a complete copy, but it graduates to a PersistentHashMap after crossing a threshold size.)

Yehonathan_Sharvit · April 18, 2021, 3:10pm

I could reformulate the question in two ways:

What would have been different in Clojure if instead of persistent data collections, we had immutable data collections with functions that implement structural sharing?
In other languages (e.g. Java or JavaScript), when is it better to have persistent data collections instead of immutable data collections with functions that implement structural sharing?

lilactown · April 18, 2021, 3:25pm

This question doesn’t make sense to me because I do not understand the difference you mean between “persistent data collections” and “immutable data collections with functions that implement structural sharing.” Clojure’s data structures are immutable and provide methods that implement structural sharing. How is that different than what you’re contrasting them to?

Yehonathan_Sharvit · April 18, 2021, 3:30pm

Let me explain: It is possible to write a function like assoc that works with a regular hash map, by implementing the structural sharing algorithm illustrated in this article of mine.

While in Clojure, the underlying data structure is optimized to make assoc implementation even more efficient. As far as I understand, this optimization is significant only when we have many fields at the same level of a nested map (not sure what is exactly the number for “many”).

lilactown · April 18, 2021, 4:14pm

You could use a mutable hash map and operate on it in an immutable fashion, but then you have no guarantees that some other code isn’t going to mutate it. Data structures like what Clojure provides give more guarantees that outside code won’t mutate the object you’re using because it doesn’t provide public methods to mutate them.

This is much more important in a multi-threaded context where you are sharing data across threads; in Java, for instance, you typically need complex concurrency-safe patterns in order to ensure that you’re not mutating a data structure while some other thread is reading or writing it. Even if you write your program using immutable conventions, passing your data to any 3rd party code could potentially mutate your object, meaning that you need to defensively implement locks or single-thread writers regardless of whether you yourself write your code in an immutable fashion.

This is why it can be more useful to have data structures that encapsulate and enforce the structural sharing algorithm used to create new objects.

Now, there is the question of what algorithm to choose. Clojure uses HAMT. I don’t know how it compares to your algorithm (I haven’t read your blog post fully) but my guess is there are reasons you would choose one over the other. Perhaps doing some benchmarks - or finding others who undoubtably have already done them - would be illuminating?

Yehonathan_Sharvit · April 19, 2021, 4:47am

I was not thinking of mutable hash maps but immutable hash maps + “immutable methods”.
I know that HAMTs are better when we have many items in the map.
My question is not about what algorithm is better.

My question is: when do we really need HAMTs?

lilactown · April 20, 2021, 12:31am

To be clear, there are many different ways to implement an immutable hash map; HAMT is one way. Is there a particular kind of immutable hash map that you are asking us to compare?

The original HAMT paper might be illuminating.

Also, like @Phill said in the 2nd post on this topic, Clojure actually uses an immutable array map (aka not a HAMT) until it grows to 8 items in the collection, where it then switches to a proper HAMT data structure.

Yehonathan_Sharvit · April 21, 2021, 4:54am

After thinking about it a bit more, I understood that I mixed two related but distinct topics:

How to update a field in an immutable map
How to update a nested field in an immutable map

Updating a field in an immutable map could be done either:

Regular map and shallow copying (like the spread operator does in JavaScript)
Leveraging smart data structures like HAMTs

It seems that the the benefits of HAMTs over regular maps are when there are more than 8 fields in the map.

In either case, the best way to update a nested field in an immutable map is via path copying, like the code for assoc-in does:

(defn assoc-in
  [m [k & ks] v]
  (if ks
    (assoc m k (assoc-in (get m k) ks v))
    (assoc m k v)))

Thank you @lilactown and @Phill for helping me to clarify my thoughts.

didibus · April 21, 2021, 5:59am

Its just a matter of performance and memory use, their interface and semantics would be the same otherwise. So I guess you’d need to do some benchmarking.

ericnormand · April 22, 2021, 11:20pm

It seems like you’re asking about the tradeoff between:

Implementation complexity of HAMT (over shallow copy-on-write)
Processing and memory savings when using HAMT (over shallow copy-on-write)

I’ve often thought about this question myself.

processing savings

I would guess that there were some early benchmarks of Clojure’s HAMT implementations that showed that 8 was the optimal number to cross-over at.

I am not an optimization expert. I only dabble. But I believe it has to do with memory caches. The HAMT is very cache-aware, but can’t really help at small sizes. 8 k/vs is about where the graphs intersect, and so Clojure will convert an array (with linear search of up to 8 keys) to a HAMT of 9 keys.

(Linear search works faster over a small array because of the cache. Essentially, a cache miss is more expensive than almost anything you can do within the cache, even a dumb linear search.)

However, once the array starts getting larger than the cache, you’re better off having a trie. A linear search over a large array requires linear cache misses. A trie requires logarithmic cache misses.

Here’s something to worry about now: if the 8-k-v mark was correct in 2007, is it still correct today, with our current hardware? Does it make sense on different hardware?

I think this same linearity applies to copying as well. Copying is a linear operation. You have to copy pointers to all keys and values to copy a hash map. Being able to copy a logarithmic subset is much faster for large collections.

memory performance

Even if copying were fast, it still takes more memory to copy a larger data structure. The same linear/logarithmic tradeoff applies here.

size distribution

@Yehonathan_Sharvit mentions that most hash maps are less than 10 keys. I haven’t measured, but that seems reasonable. Maybe even 90% are below 10 keys, and 99% are below 20 keys. Surely copying a 20 key map is very fast. Is the added complexity necessary for the last 1%? That’s a good question and probably needs testing on real systems. One possibility to consider is that the big ones are so much bigger than the median that copying them starts to dominate the runtime.

reference distribution

Something not mentioned in the OP is keeping references to older copies. I guess that, given how we use Clojure today, 99%+ of “copies” of our hash maps do not have references to them. They are immediately available for garbage collection. That leaves 1% that will hang around. Does that 1% justify the complexity? Again, we must consider if the ones that do hang around a long time are also really big.

Evidence

I think it’s important that for most applications, Clojure is fast enough. That is, the choices made at the beginning seem to work. That’s a serious factor in considering whether Clojure made the right tradeoff.

What would we be thinking about if Clojure didn’t have HAMTs and it wasn’t sufficient? We’d probably have a lot of discussions about keeping collections smaller (nesting them more) and avoiding the immutable ones altogether for certain large cases. If we do talk about avoiding immutable data structures, it’s for all cases, not just the large ones. The fact that we don’t talk about those mitigations for performance reasons is another mark for Clojure’s choices (though inconclusive).

I think we should also consider ClojureScript’s implementation. It started out with basic copy-on-write versions, but they were replaced later when they implemented HAMTs. Presumably, they implemented them for a good reason.

Mitigations

Let’s imagine Clojure didn’t have HAMTs. A simpler implementation of immutable collections would be to wrap Java’s HashMap in ImmutableMap and make full copies on every change. What are some things we could do to improve the performance?

Batching

I wrote about one thing in Grokking Simplicity, page 270, called withArrayCopy(). It’s a higher-order function that implements a copy-on-write discipline for arrays. It lets you batch operations. Instead of doing 100 immutable modifications in a row, you would make one copy, modify it in a mutable way 100 times, then wrap it up in ImmutableMap at the end. You only wind up making 1 copy instead of 100. Here’s the JS version from Grokking Simplicity:

function withArrayCopy(array, modify) {
  var copy = array.slice(); // make a shallow copy
  modify(copy);
  return copy;
}

Immer does something much cooler, but with the same “batching” idea. It “proxies” the object you pass it and records all of the operations on it. Then it makes a copy of anything that had a method called on it and replays the operations on the copy. This lets it work on nested data where you need to copy everything on the path to the thing that changes. This is good for JS because it’s awkward to use immutable structures. The syntax makes mutation very convenient.

More structure

In OO textbooks, they often talk about modeling sets using polymorphism. You have one class that is IntegerSet that contains all integers. And you have FilterSet that can filter out certain elements. And you have IntersectionSet and UnionSet that combine different sets. None of these sets are storing values. They logically represent a set using a predicate or by deferring to other sets.

In Clojure, we actually build the set. There’s no way to make an infinite set of all integers. A union of two sets makes a new set with the elements of the other two. What I’m suggesting is to perhaps be smarter about how collections are constructed, probably passing that responsibility onto the programmer. So, if you just need to check for containment, there’s no reason to realize a new union set. You could just make a small object that points to the two sets to be combined. Containment check is fast (worst case is you check both sets). This opens the door for lots of optimizations, at the cost of exploding choices for the programmer.

The example was about sets but it could be extended to other collections as well.

Conclusion

People used to think all the copying would be really expensive. Is it? People copy all the time now. Strings are immutable. They’re copied. And the stuff I see in JS with destructuring and immediately restructuring makes me mad. That’s just copying, too. Does waiting for network requests dominate the runtime anyway? Maybe it doesn’t even matter anymore. Until it does.

I can’t help but think that Clojure did get it right, at least for 2007 when it was first released. The computer I was running at that time at work may have had 512 MB of memory. People did fret about copying. And Clojure was fine.

Yehonathan_Sharvit · April 23, 2021, 9:42am

Wow. I like your detailed explanation @ericnormand .

In fact, as Rich mentions it in History of Clojure, we have two different use cases for maps:

Information maps (a.k.a records)
Collection maps (a.k.a associative arrays)

Information maps are heterogeneous and have a low number of fields, while collection maps are homogeneous and could have a large number of fields.

Rich decided to have a single data representation for both and an internal implementation that depends on the numbers of fields: start with array map and when number of fields is above 8 move to HAMT.

In fact, it means that most of the time, we have:

Information maps represented by array maps
Collection maps are represented by collection maps

dave.liepmann · April 23, 2021, 12:41pm

Great reference that I enjoyed re-reading:

I spent some time experimenting with various immutable record ideas but never ended up with something that added significant value to just using maps. Thus I decided I wanted programming with maps (and vectors and sets) in Clojure to be as first-class and native feeling as programming with lists had always been in Lisp.

A key question was: one map or two? There are two very distinct use cases for maps: potentially large, homogeneous, collection-like maps, e.g., mapping everyone’s username to their site visit count, or comparatively small heterogeneous information-maps, e.g., a bunch of things we know about a particular user in a particular context, like the email address, password, user id etc. I wanted only a single map literal syntax and library for maps. As with seqs, the maps were built behind several abstractions. Thus, maps can start with a representation optimized for small maps and switch to a representation optimized for large collections if and when they ‘grow’ beyond a certain size.

In addition to the map read/print support, Clojure has a large functional standard library for dealing with maps, with functions for creating maps (sorted and hashed), associating and dissociating keys with values, merging, lookup, key selection and enumeration, nested map selection and update, sequencing (a seq of a map yields a sequence of key-value tuples), as well as a library implementing relational algebra on sets of maps. Of course, any map library could have these things, but no language at the time had them as pure functions of high-performance persistent HAMTs. In addition, a large subset of the map functions also work on vectors, which are, of course, also associative.

from page 13 of the PDF at Clojure - History

ericnormand · April 23, 2021, 6:20pm

Oh, one other cool thing:

This library implements immutable collections in Rust, using a cool model. In multi-threaded mode, a map is immutable, since it is shared between threads. However, in single-threaded mode, it is mutable (safe, since it’s not shared between threads). Because of Rust’s borrow checker, it is known at compile-time which mode you’re in at each point in the program. When in single-threaded mode, the first assoc makes a copy, but subsequent assoces mutate the copy. Once a reference to the copy is shared, it becomes immutable again. It’s like automatic transients.

system · October 23, 2021, 6:20am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.