Is `=` fast in Clojure?


#1

Heard my friend comparing two implementations of persistent data structure last night. I remembered an interesting difference. In Clojure, when dissoc is applied on a sub tree, Clojure will try to find out if we can reduce the depth of the tree. If it’s possible to reduce the depth, then just change the structure. While in Scala, it’s not detected and the depth would remain.

Then the result was interesting. In Clojure, a same piece of data always has same tree structure, while in Scala, there are some chances we got a tree with different structures, since it might be caused by creating deep trees but left there. Thus Clojure can be faster in some edge cases when we are comparing two pieces of data which are supposed to be equal, because the structure is the same.

This opinion reminds me of JavaScript. In JavaScript, we normally avoid comparing two deep objects. We have to walk the whole trees to make sure that they are the same in every node. When we are using immutable-js, we are trying to make sure that we got new a reference every time we update it, which is like identical? in Clojure.

But we use = frequently. And people seem to prefer using = to identical? in Clojure. So I want to know if I can trust = that it’s faster enough and I don’t need to use identical? when performance is cared about?


#2

I would suppose that this depends from case to case how fast = is. Here is some code that tests one particular case:

(defn make-deep-data [n]
  (reduce (fn [dst x]
            [x dst])
          nil
          (range n)))

(def depths (take 12 (iterate (partial * 2) 1)))

(doseq [d depths]
  (let [a (make-deep-data d)
        b (make-deep-data d)]
    (println "\nFor depth d" d)
    (time (= a b))
    (println "Comparing to itself")
    (time (= a a))))

and print outs this

For depth d 1
"Elapsed time: 0.038561 msecs"
Comparing to itself
"Elapsed time: 0.002133 msecs"

For depth d 2
"Elapsed time: 0.003985 msecs"
Comparing to itself
"Elapsed time: 0.001997 msecs"

For depth d 4
"Elapsed time: 0.004792 msecs"
Comparing to itself
"Elapsed time: 0.0012 msecs"

For depth d 8
"Elapsed time: 0.00445 msecs"
Comparing to itself
"Elapsed time: 0.001583 msecs"

For depth d 16
"Elapsed time: 0.009612 msecs"
Comparing to itself
"Elapsed time: 0.002007 msecs"

For depth d 32
"Elapsed time: 0.015544 msecs"
Comparing to itself
"Elapsed time: 0.001894 msecs"

For depth d 64
"Elapsed time: 0.029676 msecs"
Comparing to itself
"Elapsed time: 0.002145 msecs"

For depth d 128
"Elapsed time: 0.057724 msecs"
Comparing to itself
"Elapsed time: 0.001697 msecs"

For depth d 256
"Elapsed time: 0.130804 msecs"
Comparing to itself
"Elapsed time: 0.001664 msecs"

For depth d 512
"Elapsed time: 0.162453 msecs"
Comparing to itself
"Elapsed time: 0.001934 msecs"

For depth d 1024
"Elapsed time: 0.524108 msecs"
Comparing to itself
"Elapsed time: 0.001896 msecs"

For depth d 2048
"Elapsed time: 0.701173 msecs"
Comparing to itself
"Elapsed time: 0.002049 msecs"

So if they are identical (same references) there is no extra costs of them being deep. But if they are equal, but with different memory, there seems to be a cost that depends on the size of the data structure for this particular example. You can do a similar experiment for your use case and that might give you an idea, but I would suppose that = is just what you want in most cases.

It would be nice if someone could point to an article describing more in-depth how Clojure’s equality mechanism works.


#3

Got paper link from my friend, not read yet…

Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections
https://michael.steindorfer.name/publications/oopsla15.pdf


#4

= and identical? are used for completely different purposes. If you have two immutable pieces of data, identical? just tells you an implementation detail about them – do they also happen to be exactly the same object stored in memory? If you want to know if two sets contain the same elements, or two maps contain the same key/value pairs, identical? cannot tell you that.

Now Clojure’s = does have an optimization that, because checking identical? is so fast, when you ask if two sets or maps are =, the implementation will first check if they are identical?, and if so it very quickly returns true. That is a very useful optimization if you try comparing two large nested immutable data structure with =, where one was created by making a few small updates to the other, and they ended up =.

Why? Suppose you take a map with 1000 key/value pairs, and from it you create two maps m1 and m2, where both m1 and m2 are the result of 'dissoc’ing the same key :x from the original map. m1 and m2 will not be identical?, because they were created from two independent ‘dissoc’ operations. However, all of the keys and values in m1 and m2 will be pairwise identical? to each other, even if they are deeply nested data structures themselves. So (= m1 m2) will only take the time required to do 999*2 identical? comparisons on the 999 keys and the 999 values of m1 and m2, and then return true.


#5

Clojure does not guarantee that equivalent maps have equivalent structure. The paper referenced elsewhere in the thread does have that invariant, and it does allow for faster negative equality checks. A more interesting consequence of that invariant, not explored in the paper, is that set operations can be significantly faster. An implementation, and benchmarks including Clojure’s and Steindorfer’s implementations, can be found at https://github.com/lacuna/bifurcan.