Tuple ordering

Hello!

I was recently surprised by Clojure’s sorting behavior when sorting tuples, which seems to depend on length:

(sort [[1 :a] [2] [3 :b] [0 :c :c]])
;; => ([2] [1 :a] [3 :b] [0 :c :c])

I was expecting what Python does:

>>> sorted([(1, 'a'), (2,), (3, 'b'), (0, 'c', 'c')])
[(0, 'c', 'c'), (1, 'a'), (2,), (3, 'b')]

Anyone know why this is the case? Is there a rationale I haven’t found?

Thanks!

Teodor


Edit: found https://clojure.org/guides/comparators (by @andy.fingerhut ). Still gladly welcoming comments.

Edit 2: Relevant excerpt from the docs:

  • vectors are sorted from fewest elements to most elements, with lexicographic ordering among equal length vectors.

Edit 3: I guess Edit 2 answers my question. New question: I’m now searching for a simple way to do lexical sorting by elements, ie. the Python behavior.

Edit 4: The docs contained the outline for lexical sorting. Here’s an adapted example:

(defn cmp-vec-lexi
  [x y]
  (let [x-len (count x)
        y-len (count y)
        len (min x-len y-len)]
    (loop [i 0]
      (if (== i len)
        ;; If all elements 0..(len-1) are same, shorter vector comes
        ;; first.
        (compare x-len y-len)
        (let [c (compare (x i) (y i))]
          (if (zero? c)
            (recur (inc i))
            c))))))

(sort-by identity cmp-vec-lexi [[1 :a] [2] [3 :b] [0 :c :c]])
;; => ([0 :c :c] [1 :a] [2] [3 :b])

2 Likes

Can’t you just do: (sort-by first [[1 :a] [2] [3 :b] [0 :c :c]])

Wouldn’t that ignore the second element? That might not have been clear from my example above.

There’s probably a more elegant/efficient solution, but I imagine the idea will be the same. To get the behaviour you want you need to make sure that when you are comparing the tuples they are the same length. You can do this by “growing” the shortest of the two collections before you pass them into the compare function.

(defn grow [coll length]
  (if (>= (count coll) length)
    coll
    (recur (conj coll nil) length)))

(defn tuple [a b]
  (let [longest (max (count a) (count b))
        a       (grow a longest)
        b       (grow b longest)]
    (compare a b)))

(sort tuple [[1 :a] [2] [3 :b]  [3 :a]  [0 :c :c] [2 :b] [0 :c :a]])

;; => ([0 :c :a] [0 :c :c] [1 :a] [2] [2 :b] [3 :a] [3 :b])

Hope that helps. :slight_smile:

Huh, that’s interesting.

Is it documented anywhere that nil is the “globally smallest element”?

I’m not sure if it’s mentioned in any docs. But the source seems to indicate that it is “the smallest element”.

2 Likes

Oh I see. You can use juxt for that:

(sort-by (juxt first second) [[1 :b] [1 :a] [1 :c] [0 :c :c]])

But it forces you to list out the columns or the data you want to sort by. Its not as dynamic in that way. It relies on the fact that vector of equal length sort lexicographically together. Juxt basically extract whatever you want from each vector, and re-creates an equal length vector with it.

You can kind of make it as dynamic or generic as you want, but you need to define your own logic for edge cases. I do not know the Python’s logic, but I’d guess there’s probably a way to create a juxt which encode a similar logic. For example:

(sort-by (juxt first second count) [[1 :b] [1 :a] [1 :c 0 0 0] [1 :c :c]])

This would sort by first column, then second, and if some vectors have more than that it would just sort them by count.

1 Like

This is a more general solution:


(defn lex-compare [a b]
  (if (sequential? a)
    (let [c (lex-compare (first a) (first b))]
      (if (zero? c)
       (recur (rest a) (rest b))
       c))
    (compare a b)))

(defn lex-sort [coll]
  (sort lex-compare coll))

(lex-sort [[1 :a] [2] [3 :b] [0 :c :c]])
;; => ([0 :c :c] [1 :a] [2] [3 :b])

If speed is a concern, you should use one of the three functions given in the docs. This version has slightly different behavior, as it will also compare sequentials inside the sequence lexically, whereas cmp-vec-lexi falls back on the default comparator.

2 Likes

Thanks for your reply. I like the implementation, it’s simpler than what I yanked in above.

Note that I modified the version from the docs a little; in reality it took a comparator function as input. The original took a comparator function cmpf as an argument, and used that recursively. I removed that piece of functionality when pasting in the code for simplicity.

Here’s the original function:

(defn cmp-vec-lexi
  [cmpf x y]
  (let [x-len (count x)
        y-len (count y)
        len (min x-len y-len)]
    (loop [i 0]
      (if (== i len)
        ;; If all elements 0..(len-1) are same, shorter vector comes
        ;; first.
        (compare x-len y-len)
        (let [c (cmpf (x i) (y i))]
          (if (zero? c)
            (recur (inc i))
            c))))))
1 Like

One thing to watch out for is this solution is non terminating for equal values.

(lex-sort [[1 :a] [2] [3 :b] [0 :c :c] [2])
=> ...

Adding the duplicate [2] at the end of the list makes it loop forever. Also you’re only checking if the first item is a sequence not the second so passing a sequence as A and a number as B makes it explode. But to be honest sorting a none homogeneous list seems like an edge case.

Something like this seems to work. Only handles a homogeneous list of tuples though.

(defn lex-compare [a b]
  (let [[x & xs] (seq a)
        [y & ys] (seq b)]
    (let [c (compare x y)]
      (cond
        (or (nil? x) (nil? y)) c
        (#{1 -1} c)            c
        :else                  (recur xs ys)))))

(sort lex-compare [[1 :a] [2] [3 :b] [0 :c :c] [2]])

;; => ([0 :c :c] [1 :a] [2] [2] [3 :b])
2 Likes

Adding the duplicate [2] at the end of the list makes it loop forever. Also you’re only checking if the first item is a sequence not the second so passing a sequence as A and a number as B makes it explode. But to be honest sorting a none homogeneous list seems like an edge case.

Oops. Looks like I went way too fast writing that. Your solution is really nice. I extended it slightly to cover the edge cases and handle nested sequences, because I deal with those constantly.

(defn lex-compare
  ([a b]
   (lex-compare compare a b))
  ([cmp-f a b]
   (if-not (and (sequential? a) (sequential? b))
     (cmp-f a b)
     (loop [[x & xs] (seq a)
            [y & ys] (seq b)]
       (let [c (lex-compare cmp-f x y)]
         (cond
           (nil? x)    c
           (#{1 -1} c) c
           :else       (recur xs ys)))))))

I’m pretty sure that the (or (nil? x) (nil? y)) can be reduced to just (nil? x) because of what you posted earlier. If y is nil and x is not, then (#{1 -1} c) will catch it.

1 Like

Okay, I thought I’d give it another shot.

First using a custom comparator:

(defn python-comparator
  [a b]
  (let [n (+ (count a) (count b))]
    (compare
      (vec (first (partition n n (repeat nil) a)))
      (vec (first (partition n n (repeat nil) b))))))
       
(sort python-comparator [[1 :a] [2] [3 :b] [0 :c :c] [2]])
;; => ([0 :c :c] [1 :a] [2] [2] [3 :b])

And now using sort-by:

(defn ps
  [n]
  #(vec (first (partition n n (repeat nil) %))))

(sort-by (ps 3) [[1 :a] [2] [3 :b] [0 :c :c] [2]])
;; => ([0 :c :c] [1 :a] [2] [2] [3 :b])

For that last one, you need to pass ps the max size of your elements or something bigger, like if you’re not sure but know it wouldn’t be beyond 10 you can just set it to (ps 10). You could also use (ps (apply max (map count coll))) to get the max size, in case you have no idea. That said, the bigger you set n, the slower the sort will be in this case. If you are able to tightly bound it, then from my tests it’s as fast as my previous solution using a comparator.

There is some similar code in Datascript, that combines comparators into a single one.

And then builds a comparator that can be used in sort or other similar functions.

1 Like