Tuple ordering

teodorlu · September 29, 2019, 11:32am

Hello!

I was recently surprised by Clojure’s sorting behavior when sorting tuples, which seems to depend on length:

(sort [[1 :a] [2] [3 :b] [0 :c :c]])
;; => ([2] [1 :a] [3 :b] [0 :c :c])

I was expecting what Python does:

>>> sorted([(1, 'a'), (2,), (3, 'b'), (0, 'c', 'c')])
[(0, 'c', 'c'), (1, 'a'), (2,), (3, 'b')]

Anyone know why this is the case? Is there a rationale I haven’t found?

Thanks!

Teodor

Edit: found Clojure - Comparators Guide (by @andy.fingerhut ). Still gladly welcoming comments.

Edit 2: Relevant excerpt from the docs:

vectors are sorted from fewest elements to most elements, with lexicographic ordering among equal length vectors.

Edit 3: I guess Edit 2 answers my question. New question: I’m now searching for a simple way to do lexical sorting by elements, ie. the Python behavior.

Edit 4: The docs contained the outline for lexical sorting. Here’s an adapted example:

(defn cmp-vec-lexi
  [x y]
  (let [x-len (count x)
        y-len (count y)
        len (min x-len y-len)]
    (loop [i 0]
      (if (== i len)
        ;; If all elements 0..(len-1) are same, shorter vector comes
        ;; first.
        (compare x-len y-len)
        (let [c (compare (x i) (y i))]
          (if (zero? c)
            (recur (inc i))
            c))))))

(sort-by identity cmp-vec-lexi [[1 :a] [2] [3 :b] [0 :c :c]])
;; => ([0 :c :c] [1 :a] [2] [3 :b])

didibus · September 30, 2019, 5:29am

Can’t you just do: (sort-by first [[1 :a] [2] [3 :b] [0 :c :c]])

teodorlu · September 30, 2019, 7:41am

Wouldn’t that ignore the second element? That might not have been clear from my example above.

andersmurphy · September 30, 2019, 11:52am

There’s probably a more elegant/efficient solution, but I imagine the idea will be the same. To get the behaviour you want you need to make sure that when you are comparing the tuples they are the same length. You can do this by “growing” the shortest of the two collections before you pass them into the compare function.

(defn grow [coll length]
  (if (>= (count coll) length)
    coll
    (recur (conj coll nil) length)))

(defn tuple [a b]
  (let [longest (max (count a) (count b))
        a       (grow a longest)
        b       (grow b longest)]
    (compare a b)))

(sort tuple [[1 :a] [2] [3 :b]  [3 :a]  [0 :c :c] [2 :b] [0 :c :a]])

;; => ([0 :c :a] [0 :c :c] [1 :a] [2] [2 :b] [3 :a] [3 :b])

Hope that helps.

teodorlu · September 30, 2019, 12:01pm

Huh, that’s interesting.

Is it documented anywhere that nil is the “globally smallest element”?

andersmurphy · September 30, 2019, 12:18pm

I’m not sure if it’s mentioned in any docs. But the source seems to indicate that it is “the smallest element”.

github.com

clojure/clojure/blob/master/src/jvm/clojure/lang/Util.java#L144


      
          static public boolean identical(Object k1, Object k2){
          	return k1 == k2;
          }
          
          static public Class classOf(Object x){
          	if(x != null)
          		return x.getClass();
          	return null;
          }
          
          static public int compare(Object k1, Object k2){
          	if(k1 == k2)
          		return 0;
          	if(k1 != null)
          		{
          		if(k2 == null)
          			return 1;
          		if(k1 instanceof Number)
          			return Numbers.compare((Number) k1, (Number) k2);
          		return ((Comparable) k1).compareTo(k2);
          		}

didibus · September 30, 2019, 7:17pm

Oh I see. You can use juxt for that:

(sort-by (juxt first second) [[1 :b] [1 :a] [1 :c] [0 :c :c]])

But it forces you to list out the columns or the data you want to sort by. Its not as dynamic in that way. It relies on the fact that vector of equal length sort lexicographically together. Juxt basically extract whatever you want from each vector, and re-creates an equal length vector with it.

You can kind of make it as dynamic or generic as you want, but you need to define your own logic for edge cases. I do not know the Python’s logic, but I’d guess there’s probably a way to create a juxt which encode a similar logic. For example:

(sort-by (juxt first second count) [[1 :b] [1 :a] [1 :c 0 0 0] [1 :c :c]])

This would sort by first column, then second, and if some vectors have more than that it would just sort them by count.

John_Shaffer · October 3, 2019, 3:59pm

This is a more general solution:


(defn lex-compare [a b]
  (if (sequential? a)
    (let [c (lex-compare (first a) (first b))]
      (if (zero? c)
       (recur (rest a) (rest b))
       c))
    (compare a b)))

(defn lex-sort [coll]
  (sort lex-compare coll))

(lex-sort [[1 :a] [2] [3 :b] [0 :c :c]])
;; => ([0 :c :c] [1 :a] [2] [3 :b])

If speed is a concern, you should use one of the three functions given in the docs. This version has slightly different behavior, as it will also compare sequentials inside the sequence lexically, whereas cmp-vec-lexi falls back on the default comparator.

teodorlu · October 4, 2019, 8:15am

Thanks for your reply. I like the implementation, it’s simpler than what I yanked in above.

Note that I modified the version from the docs a little; in reality it took a comparator function as input. The original took a comparator function cmpf as an argument, and used that recursively. I removed that piece of functionality when pasting in the code for simplicity.

Here’s the original function:

(defn cmp-vec-lexi
  [cmpf x y]
  (let [x-len (count x)
        y-len (count y)
        len (min x-len y-len)]
    (loop [i 0]
      (if (== i len)
        ;; If all elements 0..(len-1) are same, shorter vector comes
        ;; first.
        (compare x-len y-len)
        (let [c (cmpf (x i) (y i))]
          (if (zero? c)
            (recur (inc i))
            c))))))

andersmurphy · October 4, 2019, 9:10am

John_Shaffer:

This is a more general solution:
(defn lex-compare [a b]
  (if (sequential? a)
    (let [c (lex-compare (first a) (first b))]
      (if (zero? c)
       (recur (rest a) (rest b))
       c))
    (compare a b)))

(defn lex-sort [coll]
  (sort lex-compare coll))

(lex-sort [[1 :a] [2] [3 :b] [0 :c :c]])
;; => ([0 :c :c] [1 :a] [2] [3 :b])
If speed is a concern, you should use one of the three functions given in the docs. This version has slightly different behavior, as it will also compare sequentials inside the sequence lexically, whereas cmp-vec-lexi falls back on the default comparator.

One thing to watch out for is this solution is non terminating for equal values.

(lex-sort [[1 :a] [2] [3 :b] [0 :c :c] [2])
=> ...

Adding the duplicate [2] at the end of the list makes it loop forever. Also you’re only checking if the first item is a sequence not the second so passing a sequence as A and a number as B makes it explode. But to be honest sorting a none homogeneous list seems like an edge case.

Something like this seems to work. Only handles a homogeneous list of tuples though.

(defn lex-compare [a b]
  (let [[x & xs] (seq a)
        [y & ys] (seq b)]
    (let [c (compare x y)]
      (cond
        (or (nil? x) (nil? y)) c
        (#{1 -1} c)            c
        :else                  (recur xs ys)))))

(sort lex-compare [[1 :a] [2] [3 :b] [0 :c :c] [2]])

;; => ([0 :c :c] [1 :a] [2] [2] [3 :b])

John_Shaffer · October 4, 2019, 11:11pm

Adding the duplicate [2] at the end of the list makes it loop forever. Also you’re only checking if the first item is a sequence not the second so passing a sequence as A and a number as B makes it explode. But to be honest sorting a none homogeneous list seems like an edge case.

Oops. Looks like I went way too fast writing that. Your solution is really nice. I extended it slightly to cover the edge cases and handle nested sequences, because I deal with those constantly.

(defn lex-compare
  ([a b]
   (lex-compare compare a b))
  ([cmp-f a b]
   (if-not (and (sequential? a) (sequential? b))
     (cmp-f a b)
     (loop [[x & xs] (seq a)
            [y & ys] (seq b)]
       (let [c (lex-compare cmp-f x y)]
         (cond
           (nil? x)    c
           (#{1 -1} c) c
           :else       (recur xs ys)))))))

I’m pretty sure that the (or (nil? x) (nil? y)) can be reduced to just (nil? x) because of what you posted earlier. If y is nil and x is not, then (#{1 -1} c) will catch it.

didibus · October 5, 2019, 5:29am

Okay, I thought I’d give it another shot.

First using a custom comparator:

(defn python-comparator
  [a b]
  (let [n (+ (count a) (count b))]
    (compare
      (vec (first (partition n n (repeat nil) a)))
      (vec (first (partition n n (repeat nil) b))))))
       
(sort python-comparator [[1 :a] [2] [3 :b] [0 :c :c] [2]])
;; => ([0 :c :c] [1 :a] [2] [2] [3 :b])

And now using sort-by:

(defn ps
  [n]
  #(vec (first (partition n n (repeat nil) %))))

(sort-by (ps 3) [[1 :a] [2] [3 :b] [0 :c :c] [2]])
;; => ([0 :c :c] [1 :a] [2] [2] [3 :b])

For that last one, you need to pass ps the max size of your elements or something bigger, like if you’re not sure but know it wouldn’t be beyond 10 you can just set it to (ps 10). You could also use (ps (apply max (map count coll))) to get the max size, in case you have no idea. That said, the bigger you set n, the slower the sort will be in this case. If you are able to tightly bound it, then from my tests it’s as fast as my previous solution using a comparator.

dimovich · October 7, 2019, 9:30pm

There is some similar code in Datascript, that combines comparators into a single one.

And then builds a comparator that can be used in sort or other similar functions.

Andy_Wootton · January 2, 2020, 6:43pm

I’m new to Clojure but old to relational databases, so had to learn them theoretically:
Isn’t a relation an unsorted set of tuples? If you want a sorted set, can you use sorted-set or sorted-set-by?

teodorlu · January 3, 2020, 8:02am

Sorry, Andy, I’m not really following. Which post were you referring to?

Andy_Wootton · January 11, 2020, 9:52pm

Your original question called a vector of vectors “a tuple”. I don’t know Python but that’s not how I’ve seen what I call tuples (rows in tables/relations) represented in Clojure.

teodorlu · January 11, 2020, 10:15pm

Hello, Andy!

I agree that I’ve typed out a vector of vectors above
I consider tuple a more accurate description of vectors intended to be of the same length, where each element can be considered a “property”, potentially of unique type.

Example: I’d call [x y z] a tuple since it can be predictably pattern matched:

(for [[x y z] [[1 2 3] [4 5 6] [1 0 1] [1 0 0]]]
  (let [euclidian-distance (Math/sqrt (+ (* x x) (* y y) (* z z)))]
    euclidian-distance))
;; => (3.7416573867739413 8.774964387392123 1.4142135623730951 1.0)

Andy_Wootton · January 13, 2020, 5:13pm

I’ve learned some things as a result of our 2 different definitions:

Python tuples are an immutable version of Python lists. They can have different lengths.
Spec docs https://clojure.org/guides/spec mention tuples:

“Tuple - (s/tuple double? double? double?)
Designed for fixed size with known positional “fields”
Conforms to a vector of the values”

In my older, relational definition (from Chris Date), tuples you might want to sort would always have the same number of fields, because they’d be in the same relation. Missing values would be null. I didn’t understand that you wanted to leave out ‘trailing null fieds’ but still be able to sort. The suggestion above to ‘pad the right-hand fields’ puts us back in the same place.

system · July 14, 2020, 5:13am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.