X> & x>>: auto-transducifying thread macros (now with parallelizing => and =>>)

Hey folks, I just whipped up a little auto-transducifying thread macro last night. It basically searches for contiguous groups of transducers and comps them together into a function that sequences the values flowing in the thread through the transducers. First the code, then we’ll discuss:

(def transducables
  #{(symbol 'map)
    (symbol 'cat) 
    (symbol 'mapcat) 
    (symbol 'filter)
    (symbol 'remove) 
    (symbol 'take) 
    (symbol 'take-while) 
    (symbol 'take-nth) 
    (symbol 'drop) 
    (symbol 'drop-while) 
    (symbol 'replace) 
    (symbol 'partition-by) 
    (symbol 'partition-all) 
    (symbol 'keep) 
    (symbol 'keep-indexed) 
    (symbol 'map-indexed) 
    (symbol 'distinct) 
    (symbol 'interpose) 
    (symbol 'dedupe) 
    (symbol 'random-sample)})

(defn transducable? [form]
  (when (sequential? form)
    (contains? transducables (first form))))

(defn compose-transducer-group [xfs]
  (->> xfs
       (map #(apply (first %) (rest %)))
       (apply comp)))

(defn xfn [xf-group]
  (fn [args]
    (sequence
     (compose-transducer-group xf-group)
     args)))

(defmacro x>>
  "Just like ->> but first composes consecutive transducing fns into a function
   that sequences the last arguement through the transformers. Also, calls nth
   for ints. So:
   
   (x>> [1 2 3] 
        (map inc) 
        (map (partial + 2)))
   Becomes:
   
   ((xfn [[map inc] 
          [map (partial + 2)]]) 
    [1 2 3])"
  [x & threads]
  (let [forms (->> threads
                   (partition-by transducable?)
                   (mapv #(if-not (and (transducable? (first %))
                                       (second %))
                              %
                              (list (list `(xfn ~(mapv vec %))))))
                   (apply concat))]
    (loop [x x, forms forms]
      (if forms
        (let [form (first forms)
              threaded (cond (seq? form)
                             (with-meta `(~(first form) [email protected](next form) ~x) (meta form))
                             (int? form)
                             (list `nth x form)
                             :else
                             (list form x))]
          (recur threaded (next forms)))
        x))))

(defmacro x>
  "Just like -> but first composes consecutive transducing fns into a function
   that sequences the second arguement through the transformers. Also, calls nth
   for ints. So:
   
   (x> [1 2 3]
       (conj 4)
       (map inc)
       (map (partial + 2))
       2)
   Becomes like:
   
   (nth
    ((xfn [[map inc] [map (partial + 2)]]) 
     (conj [1 2 3] 
           4)) 
    2)"
  [x & threads]
  (let [forms (->> threads
                   (partition-by transducable?)
                   (mapv #(if-not (and (transducable? (first %))
                                       (second %))
                            %
                            (list (list `(xfn ~(mapv vec %))))))
                   (apply concat))]
    (loop [x x, forms forms]
      (if forms
        (let [form (first forms)
              threaded (cond (seq? form)
                             (with-meta `(~(first form) ~x [email protected](next form)) (meta form))
                             (int? form)
                             (list `nth x form)
                             :else
                             (list form x))]
          (recur threaded (next forms)))
        x))))

Neat, right? Pretty simple. Why would you want to do this? Well, for one thing:

(->> (range 10000000)
     (map inc)
     (filter odd?)
     (mapcat #(do [% (dec %)]))
     (partition-by #(= 0 (mod % 5)))
     (map (partial apply +))
    ;;  (mapv dec)
     (map (partial + 10))
     (map #(do {:temp-value %}))
     (map :temp-value)
     (filter even?)
     (apply +)
     time)

Returns:

"Elapsed time: 5768.05707 msecs"
5000054999994

Whereas:

(x>> (range 10000000)
     (map inc)
     (filter odd?)
     (mapcat #(do [% (dec %)]))
     (partition-by #(= 0 (mod % 5)))
     (map (partial apply +))
    ;;  (mapv dec)
     (map (partial + 10))
     (map #(do {:temp-value %}))
     (map :temp-value)
     (filter even?)
     (apply +)
     time)

Returns:

"Elapsed time: 2102.793256 msecs"
5000054999994

Twice the speed with basically the same code.

The more transducers you can get lined up contiguously, the less boxing you’ll have in your thread. Let’s uncomment the (mapv dec) that is present in both the threads above. Because mapv is not a transducer, items get boxed halfway through our thread. As a result our performance degrades slightly for x>>.

First ->>:

"Elapsed time: 4590.188052 msecs"
44999977000016

Hmm, ->> actually goes faster now, perhaps due to mapv removing some laziness.

But for x>>:

"Elapsed time: 2351.326542 msecs"
44999977000016

So we lost some speed due to the boxing, but we’re still doing way better than the default thread macro. Point is, if you want to maximize performance, try to align your transducers contiguously.

One more perc with x> and x>>: a naked integer in a thread becomes an nth on the value threading through, so you can use them as replacements for get-in for most cases with heterogeneous nestings:

(let [m {:a {:b [0 1 {:c :res}]}}]
  (x> m :a :b 2 :c))

That’s just a personal preference but some might not want the change in semantics. Also, x> is different in that if a transducer is passed in, it acts like thread-last. This too is a semantic divergence from -> but in my experience we rarely ever use functions that take a function as a first argument in thread-first threads, so there probably won’t be too much confusion - and this makes x> more useful.

Next steps: integrate Christoph Grand’s xforms transducers (maybe automatically, anaphorically, for the core fns, like reduce and into?)

Anyway, I just whipped this up, so there may be downsides to this formalism I haven’t noticed, but that’s what this post is for - to debate the merits or lackthereof! :slight_smile: If it’s not a bad idea, I’ll probably wrap it up in a lib for general usage - but you are hereby granted, by the power invested in me, the inalienable right to copy and paste any of the words or code in this post anywhere you want. :wink: (edit: oh, and the meat of the threading code is taken from clojure.core anyway) Any other features or constraints y’all’d like to see in this code pattern?

Happy hacking!

(edit: After some discussion below, I released the library version of this idea here: johnmn3/injest)

4 Likes

Combine it with a slightly faster comp :slight_smile:

Edit: Why (symbol 'foo) and not just 'foo?

Nice! I thought 'foo might expand to (quote foo) and maybe not match by equality, but maybe not. I’ll bang on it and see, but it’s a compile time thing anyway, so shouldn’t affect the run time.

fusing contiguous map operations into a single function, ditching partial might be useful (takes like maybe 150 ms off on my end).

 [map (fn [arg] (as-> arg x
                      (apply + x)
                      (+ 10 x)
                      {:temp-value x}
                      (x :temp-value)))]

Curious if more substantive fusing could be had, along with a better comp implementation like ben mentioned. Kind of opens the door toward a little analyzer and maybe more optimizations. Cool idea.

1 Like

True. Granted, the author probably should do that fusing themselves, in any threading scenarios, as a matter of refactoring. But sure, why not optimize all the things? Especially if the side-effect semantics remain the same. Hadn’t considered the map fusing idea - I’ll stew on it, thx!

In related news: https://www.discovermagazine.com/mind/your-brain-is-not-a-computer-it-is-a-transducer lol

The problem with all of this is that we’re turning ourselves into mini compilers, implementing heuristics we have empirically learned to transform our code.
What we’re really looking for is a compiler which can do inlining, abstraction, constant propagation and constant folding.
You could probably do something like that with core.logic, if you had enough time and funding. Looks like a good subject for a thesis or three.

1 Like

The problem with human refactoring? Or do you mean the problem of trying to reason through these optimizations?

Well this particular piece of fruit was hanging lower than I’d realized. I’d been wanting to make it for a while but thought it would be way hairier. I’ve been working on an open source front end lib for higher order components, based on higher order functions, which can take transducers in various places, since it kinda fit as a “transducing context” and I sorta stumbled on the realization that this particular optimization could be simpler than I’d previously thought.

I guess another low hanging piece: If we lifted out cljr-thread-last-all from clj-refactor.el, then we could wind vanilla code into a thread-last and apply this auto-optimization to vanilla code as well :thinking: Essentially a code-walker than can auto-wind up any nested calls into a thread-last form and then apply this macro to fuse up the transducers. I’d probably add that to perc, so any forms within a perc fn get auto-optimized.

The map fusion may be low hanging as well, with simple code walking. An actual compiler that carries an environment around through the code… Maybe ClojuristsTogether could fund something like that one day. I’d throw down on it.

hotspot gets a (big) vote too. the question then becomes whether the fused/optimized code actually improves upon what hotspot is able to detect e.g. via comp and apply its optimizations to. I think in the case of the original macros here, the benefit is largely just moving away from seq generation and going into loops and composed functions.

I am uncertain (testing would be interesting), since hotspot apparently likes small methods instead of big ones, so it can more aggressively inline. Maybe fusion isn’t all that great and comp already plays to the jit’s strengths. Or maybe not.

In the broader context of optimizing “mini compilers” I would certainly welcome that over the burden of trying to write a general purpose optimizing compiler. E.g. optimize a pipeline into something 2x or more performant seems plenty fine. We already do that idiomatically in clojure anyway with certain abstractions (arraymap vs. persistentmap; chunked seqs; etc.). Why not have drop-in forms that can try to speed up the local context using limited hueristics and/or passes?

Note: a lot of this could be done on top of tools.analyzer (e.g. more agressive optimizations) and/or core.typed even. I definitely think it would be a research project, but perhaps well scoped. Again, the benefits are uncertain (absent extreme fusion and automatic lifting e.g. of primitive operations…something like turn your ->> pipeline into an optimized primitive-friendly bytecode representation amenable to type hinting).

Yeah, it should be noted as well:

  1. Because transducers have different laziness semantics, you can’t just auto-optimize all vanilla code naively without some stuff blowing up, so it’s best that the caller is aware of the change of semantics at the call site, via a known macro or something
  2. Some stateful transducers may be optimized for single-threaded performance and may not produce expected results in some multi-threaded scenarios… not sure if that applies to the macros above, as some context outside the thread would be orchestrating mutli-threading, which I don’t think would usually bang on those transducer’s internal state independently… But removing the stateful transducers from the above macros may lend the formalism to integrating more parallelism into the threading semantics, with maybe like a px>> that auto-parallelizes the stateless mapping transducer.

So small macros like this that provide explicit performance improvements (with associated caveats) may be ideal for exploring the space of possible future optimizations. There are so many different avenues of optimization and some of them are not friendly to one another, so empowering the caller to choose the correct optimization strategy for their needs requires less forethought from the tool developers. The problem is providing powerful tools that fit with developers existing UX ergonomics - these thread macros are a perfect fit for auto-transducification, where it already fits with developers existing coding ergonomics.

So I agree, and I’ll probably forgo the map-fusion optimization, to try to deliver an optimization semantic that is more easily understood by the caller: this optimization x>> is about fusing transducers in a thread. px>> only works for stateless transducers, etc. (though it’d be interesting to explore if map fusion pays more dividends in multi-threaded mapping-transduction scenarios)

An actual optimizing compiler is outside of the scope of this idea, but this idea could def be used somewhere in an optimizing compiler, if the caller is aware of the changed laziness semantics, or if the compiler can know the call won’t blow up memory.

What about the two other semantic changes I made: nth on integers and allowing the thread-last semantic on transducers in a thread-first x> form?

I’d argue both are ergonomically additive changes, as, for integers, a naked int would have always thrown an exception anyway, and, for x>, without allowing for the apparent thread-last semantic on transducers, there’d be no point in having a thread-first version anyway. And very rarely do we ever thread-first some produced function into the first param position of some collection operator in a thread-first macro, so callers will likely know, “Okay, this is a transducer-thread-first macro and the function at this point in the thread is a function that usually takes the collection last, so this is probably the transducer version, so I can imagine this as being like a thread-last semantic at this point in the thread.” And no other semantic could be expected for a x>-like form, because transducers are defined according to their step function, not their collection params. So it’s either that weirdness or no x> macro at all, and I figure it’s better off having it than not, so you can grab for those thread-first scenarios.

Anyway, if there’s no objections to those two semantic differences, I’ll ship them as is.

I’ll give a simple example of what I mean
Let’s say you write

(get-in m [:a :b])

Since you know the keys at compile time, you could substitute with:

(-> m (get :a) (get :b))

which is faster
But it’s based on human knowledge. A compiler would have done something different:

(get-in m [:a :b])
((fn [m ks] (reduce get m ks)) m [:a :b])
(reduce get m [:a :b])
(reduce (get m :a) [:a])
(reduce (get (get m :a) :b) [])
(get (get m :a) :b)

Inline and partially apply as much as possible, without apriori knowledge

There’s now a repo up on Github where you can grab it using deps.edn. Consider it alpha for now. Once it has some usage and any issues are ironed out, I’ll push up an artifact to clojars.

(edit: oh, and there’s also a new extension feature reg-xf that allows the user to extend the macros by adding extra transducers that they’re able to recognize. An example of bringing in cgrand/xforms is in the readme)

Cool! I often end my ->> with (into []) or (into {}). Would it make sense to detect if a group of transducables is followed by an into and use that as the transducing function instead of sequence?

Not a bad idea, but honestly it’d probably only add a few percent increase, based on my testing. Also, using sequence allows us to know we’re getting laziness semantics a little bit closer to the laziness of the pipeline functions we’re replacing with transducers. And you can grab for transducers that are potentially more lazy, like, you could pass in cgrand’s x/into and follow it with a (take 5) and you’ll know the x/into can short-circuit after the first chunk, not having to realize the whole sequence (not tested yet).

In general, I’m pretty amazed at the performance of sequence compared to transduce (which into uses under the hood) given the extra caching and laziness semantics you get out of it - just from my minimal testing. But it would be useful to build a performance test suite to compare all these different scenarios. If transducing groups that end in an into (or any other transducer consumer) show significant dividends in various “general” situations, rather than sequenceing them, then I’d try it.

FYI, the ClojureScript version is currently broken. Looks like I’m going to have to drop into the cljs.analyzer.api to resolve names :confused: Should be fixed up this weekend though.

Got a workaround for ClojureScript, listed in the readme. Spin up a repl with criterium and let me know if you find anything odd:

clj -Sdeps \
    '{:deps 
      {io.github.johnmn3/injest {:git/tag "v0.1-alpha.6" :git/sha "0a5f2df"}
       criterium/criterium {:mvn/version "0.4.6"}
       net.cgrand/xforms {:mvn/version "0.19.2"}}}'

Quick update on the API that is settling in and some extra commentary:

1. Extending injest with reg-xf!

You can now register one or more transducers like this (injest/reg-xf! x/reduce). Works in ClojureScript too, but you have to register the names from a Clojure namespace for now.

2. As a replacement for get-in

We’re now wrapping both integers and strings in a get, so that you can concisely thread into basically anything:

(let [m {1 {"b" [0 1 {:c :res}]}}]
  (x> m 1 "b" 2 :c name)) ;=> "res"

I can’t tell you how many times I’ve been using thread macros and I’ve wanted this feature. Why not make numbers and strings usable for lookups at those locations? It just makes sense.

On the subject of ClojureScript, for the above example threads, I’m seeing a 6 times (or more) improvement in speed! This is some very low hanging fruit, folks!

With an initial value of (range 1000000), ClojureScript’s ->> produces:

"Elapsed time: 3523.186678 msecs"
50005499994

While the x>> version produces:

"Elapsed time: 574.145888 msecs"
50005499994

That’s free money, y’all!

I guess the question is, why wouldn’t you just default to x> and x>> always, as a matter of convenience? You get better path navigation through data structures and even if you’re not using a transducer in it now, you might want to grow it that way in the future - so why not just opt for the transducing version now?

And your junior Clojure devs don’t even have to understand transducers. At least not initially. They just have to learn the semantics of the thread-last macro, then give them the x>> macro and then let them go to town. It’s like training-wheels for transducers :slight_smile:

In the rare case you really need to manage things in a threading pipeline as a lazily computed, infinite sequence, then you can just go back to ->>, no big deal, but that’s really the edge case.

1 Like

Are you familiar with clj-fast?
It has a lot in common.
There is a balance to be struck, though, with code-rewriting macros, especially when they change the actual code being executed. I still need to clarify that in the documentation

Aye, I’ve checked it out! Pretty awesome stuff. I’ll probably mention your work (and @joinr’s work) in a references section on the repo, as I build it out.

I think you do mention it somewhere in the docs - macro’ized versions of functions have different ergonomics, can’t be passed around etc. Fortunately, ->> was born with that limitation, so those ergonomics remain consistent for x>>. The practical difference here pertains to eagerness on intermediate values, but I’m betting user’s won’t need to even be aware of that difference, from a UX perspective, for 99% of their code.

There’s still some area for improvement. I’d like to replace the get with a get-or-nth-for-lists so that the path navigation works for numbers on lists as well, but I’m reluctant to introduce a new runtime function in such a thin, abstract macro. I’d only add that if I can ensure it doesn’t introduce a significant performance hit vs using the plain get at runtime. Ideally, it’d outperform get. Might explore the strategy in clj-fast.core/val-at for that.

Could also wrap the thread-top-level keywords in a get form, but I recently read somewhere that (:x m) is faster than (get m :x), so I left them as is - worth investigation though.

Ultimately, though, I view the purpose of this macro as an ergonomic convenience over the already existing performance tools in Clojure (transducers), moreso than inventing any new performance features. It just tries to help you introduce transducers to your codebase while maintaining your other existing expectations as much as possible.

The shortest code path is actually (m :x), including a nil-check.
Regarding eagerness, looking at the final element is sufficient.
If it is (into coll ,,,) you can turn it into transduce.
If not, you can turn it into a sequence.

I’ve been considering writing a macro like that as well, so you’ll excuse me if I refrain from looking at your implementation until I do :slight_smile:

1 Like

We’re now wrapping both integers and strings in a get, so that you can concisely thread into basically anything

It’s your macro, so you extend the language as you like. However, this is notably differing from clojure semantics, e.g. numbers, strings, etc. are not applicable. Maybe there is a reason for this at the language design level (buried in google groups discussions from a decade ago). I think the reason for strings that they leave a broad ambiguity: since strings are indexed, should they act as associative containers like vectors, mapping int → val, or are they implicit keys into an associative container? Numbers have a similar ambiguity, since there are slightly different paths between get/nth depending on support for Indexed. The only types that are explicitly viewed as keys are keywords and symbols, given their IFn implementation.

Since you mentioned targeting newcomers, it probably bears mentioning that the semantics of the codegen in this macro differ from clojure.core since in your (now technically DSL), numbers and strings have implicit IFn implementations with a specific interpretation (where, in the case of strings, they could have multiple interpretations). In the language of the progenitor, this is perhaps “complecting” things. In theory I don’t think it’s a big deal, but in practice it means that

then you can just go back to ->>

does not hold if the details of your DSL are relied on (e.g. strings/numbers interpreted specifically as keyword accessors). This could be jarring to consumers, specifically newcomers, when they are faced with the sudden, perhaps inscrutable error of java.lang.Long cannot be cast to clojure.lang.IFn or java.lang.String cannot be cast to clojure.lang.IFn .

2 Likes