Sorting a rich sequence without (sort)

outrovurt · March 15, 2022, 7:33am

(Following is a reply to the OP, and is has pretty much nothing to do with the above entertainment.)

I’m going to try to summarize everything I know about transducers.

a large number of core sequence transformation functions work by taking a sequence and producing a lazy sequence (f.e. map, filter, mapcat, take*, drop*)
if you want to perform more than one of these operations on a single sequence without transducers, you would have to feed the result of the first into the subsequent, and so on. For example:

(->> coll
     (map map-fn)
     (filter filter-fn))

(NOTE: just to be absolutely clear, here I am using the ->> macro for convenience, so each call in the stack is an ordinary call and not a transducer.)

The above works absolutely fine functionally, however it also creates one intermediate sequence (albeit a lazy one) for each operation.
Another possibly unwanted consequence of using these lazy-sequence generating functions is that they only generate lazy-sequences. What if you wanted your output to be a vector, map, string and so on? You then have to coerce the result, usually like so:

(->> coll
     (map map-fn)
     (filter filter-fn)
     ,,,
     (into []))

So transducers solve these two problems: they bypass the generation of intermediate collections, and they allow you to choose any output type you feel appropriate. But what exactly is a transducer, and how do you go about creating one? I’ll answer these in reverse order.

Most sequence functions which return a lazy sequence will also return a transducer if you call that function without the final argument (usually the sequence). You can then use that transducer (itself a function) by passing it to one of a number of functions which accepts transducers. The most important functions which I use are:

transduce
into
eduction
sequence
and of course reduce (I’ll explain this below)

Which function you use depends on the result you are after. Let’s say for example that you want to increment a range of numbers and return the result as a vector. Without transducers, you would write:

(into []
      (map inc (range 10)))

which is just the standard 2-arity call to into which takes the lazy sequence generated by the map function, iterates over each item, and adds it to our empty vector.

But to turn the above into something which accepts a transducer, in this case all we need to do is to spit our sequence out of the call to map to the right:

(into []
      (map inc)
      (range 10))

We now invoke the 3-arity version of into, which accepts a transducer (returned by calling (map inc)) and the range. And what this does is iterates directly over range (rather than the lazy sequence which we used previously), calls inc on each item, and calls conj to add that item to the vector.

I could rewrite the above using transduce as follows:

(transduce (map inc) ;;transducer
           conj ;; reducing function
           [] ;; initial value
           (range 10) ;; sequence
)
;; [1 2 3 4 5 6 7 8 9 10]

It’s a little more verbose by does the same thing, and in fact is what the 3-arity form of into calls internally. The above generates a vector of 10 items. But what if I wanted to generate a string instead? I would use almost the same code, swapping out conj for str, and [] for "", like so:

(transduce (map inc) ;;transducer
           str ;; reducing function
           "" ;; initial value
           (range 10) ;; sequence
)
;; "12345678910"

Et voila!

So a transducer is something which performs some operation on an item of a sequence, and which then takes the result and “adds” it to an running total (which begins with an initial value, in the first case [], in the second case "") using a specified operation (in the first case conj, in the second case str.) I’ve quoted “adds” on purpose, in this case it happens to be correct, however it can actually do anything it wants with the result. This is actually a reducing function, which accepts the running total (often called accumulator) and an item, in this case the result of the operation on the original item.

While (map inc) generates a mapping transducer, there are several other built-in transducers available. For example (filter even?) will generate a filtering transducer, and this will function by calling the supplied predicate on each item of the sequence, and only if that predicate is successful will it then call the reducing function (conj, str) to add the original item to the resultant collection (initially [] or "").

[16/03/2022: EDIT]
Now to answer the question “What is a transducer?” This is a little trickier to wrap ones’ head around. A transducer is a function which accepts one reducing function (say conj, str, essentially a function we will use to combine the result of our operation with a collection) and returns another reducing function, one which calls our original reducing function and which performs our original intended operation, e.g. mapping inc, filtering even? or whatever. Why? This is best illustrated by revisiting our above example:

calling (map inc) creates a transducer which can be used to consume any seq and which will output every item in that seq incremented by 1
how it will output that item can be chosen at the time of transduction
in our example we have chosen to output it to a vector, so we call transduce passing in our transducer function (map inc), conj (our reducing function) and an empty vector (our initial value)

The real magic happens inside transduce itself (which by the way is a ridiculously simple function given what transducers are and do), with the call to (transducer reducing-function), which in our example becomes ((map inc) conj). (map inc) (our transducer) takes conj (our reducing function) and transforms it into a new reducer function, one which is then passed to reduce together with an initial value of [], and it is that call to reduce which actually performs a) the inc on each item of the sequence we pass to it, and b) the conjing of the resultant item onto our accumulator, which is initially [].

So any given transducer contains the essence of a given operation we want to do, such as mapping inc, filtering even? or whatever, and the ability to combine it with a value (often a collection) we supply to it during transduction via a combining operation (e.g. conj) which itself is a reducing function. If you look at the source for conj, str, even arithmetic operators such as +, - and so on, you will note that they all have 0-arity and 1-arity versions as well as the 2-arity versions we expect. That’s because they are all, wait for it, yes, reducing functions.

the 0-arity version generally returns an empty collection, nil-value or a general default value; this is the initial value of our reduction
the 1-arity version which is often the identity function, but this depends on the operation
the 2-arity version, which accepts a collection and an item, performs some operation involving the two, and generally returns the collection, perhaps with the item added to it somewhere

When writing reducing functions ourselves, most of us will just create a 2-arity version because that is all the reduce function requires. However if we want to call reduce with one of our own functions but also wanted to perform some operation on the sequence beforehand, we can’t just pass a transducer directly to reduce.

For example, before I learnt about transducers, I would write this:

(->> [:a :b :c :d :e]
     (map-indexed vector)
     (reduce (fn [result [idx item]]
               (conj result [idx item]))
             []))
;; [[0 :a] [1 :b] [2 :c] [3 :d] [4 :e]]

This works, but creates an intermediate sequence which is the result of (map-indexed vector coll).

Immediately after learning about transducers, I thought I could write something like this:

(transduce (map-indexed vector)
           (fn [result [idx item]]
             (conj result [idx item]))
           []
           [:a :b :c :d :e])
;; [[0 :a] [1 :b] [2 :c] [3 :d] [4 :e] [nil nil]]

After all, isn’t that just a reducing function after the transducer? Well, yes it sort of is, but it isn’t quite “complete”, at least not as far as how transduce works is concerned. While the above certainly runs, you can see the result is not quite correct, as it adds an extra [nil nil] entry at the end of our sequence. To fix this, we wrap the reducing function in completing like so:

(transduce (map-indexed vector)
           (completing
            (fn [result [idx item]]
              (conj result [idx item])))
           []
           [:a :b :c :d :e])
;; [[0 :a] [1 :b] [2 :c] [3 :d] [4 :e]]

Which gives us the correct result.

The whole point of completing is to take a 2-arity reducer function that we would normally pass to reduce, and add 0-, and 1-arity versions to it, thus making it fully compatible with transduce.

[END EDIT]

eduction does something similar, allowing us just to pass our sequence through any number of transducers first then two reduce:

(->> coll
     (eduction (map-indexed vector))
     (reduce (fn [result [idx item]]
               ,,,)
             []))

eduction basically works by creating an Eduction object which implements IReduce, so when the result of the eduction above is passed to the reduce, it basically turns the operation into a transduce and once again avoids the creation of an intermediate lazy sequence. So on each iteration of the reduce function, the original item in the collection is transformed into [idx item], and that is passed to the reduce function. joinr’s function does something which is functionally identical, he just uses completing to turn the reducing function into a transducer because it wouldn’t work otherwise.