X> & x>>: auto-transducifying thread macros (now with parallelizing |>> and =>>)

Yeah, I think it’s possible to do in ClojureScript. But it’s discouraged. And I wouldn’t recommend letting your lib leak those callable numbers and strings into user’s application code.

Yeah, I may submit a patch/proposal to ask.clojure.org one day, after these bits settle a little. And true, providing injest/-> and injest/->> would allow easy migration back to fully lazy semantics while preserving the path navigation features.

Before spinning up the gears of Rich and the Clojure Core team on a possible proposal, I’d like to have a thorough debate about the pros and cons, just for my own understanding - I think I’ve considered most possible ergonomics, but I could have missed something.

Objections so far have really boiled down to unfamiliar aesthetics. That’s a fair default objection to have in general, but I’m arguing that this addition brings both syntactic and semantic simplicity by extending existing idioms, somuchso that it outweighs the aesthetic unfamiliarity. So if y’all have more objections outside of aesthetics, keep them coming!

1 Like

I think at this point, perhaps you are asking in the wrong place, and your sample size will be limited. Core Dev / language design is a useful area to discuss these things. Many decisions w.r.t. language design do boil down to aesthetics, principle of least surprise, and other intangibles. A lot of this used to be discussed on the google group for Clojure; then it migrated to Jira patch notes; I am uncertain where the meatier discussions are today (maybe slack). Alex Miller is at least attentive to ask.clojure, clojureverse, and reddit.

Perhaps the implication of having any value (or simply “more” types of primitive values) be interpreted as an applicable function is that expressions like (1 {1 :hello}(get {1 :hello} 1 work where (1 1)(get 1 1) will just return nil under your implicit interpretation of get (get is somewhat liberal). Should that actually be an error? It’s not novel under existing semantics though (due to get), (:a :a) returns nil too, so there is at least symmetry with existing treatment of keywords and symbols.

Does this convenience create problems for reasoning down the road? It is - by virtue of history now - idiomatic that numbers and strings do not have a function representation. If I see numbers or strings applied in the function position (or say something like clj-kondo does), do we introduce a slew of false positive errors when trying to reason about the code? Maybe this is irrelevant if you are the only one reading the code, or readers will be versed in the expanded idiom.

It would be interesting to see what people who have put much more thought into these questions would have to say.

Yeah, I’ll probably do that at some point soon.

Okay, so I want to conduct a survey. We have a few options with regard to handling numbers in threads.

Always producing an nth is great because it allows us to index into both vectors and lists, but then we’re not as ergonomic with maps with numbers as keys (which is rare, granted)

Always producing get works for both vectors and maps, but then we can’t index into sequences flowing down the thread, which would be awesome

The best of all worlds would be calling get for map values but nth for vectors or lists, but that would require introducing a new runtime function that doesn’t come with core.

Which would you prefer?

  • Numbers always produce an nth (works on vecs and lists)
  • Numbers always produce a get (works on maps and vecs)
  • Numbers should produce get when arg is map, otherwise nth (works on all three, but requires new runtime fn)

0 voters

Option 4: None of the above.

I’ve been watching this thread for a while without contributing because I think what you’re trying to do is just inherently a bad idea – but it seems common practice for folks who fall in love with macros.

Every macro introduced adds semantic complexity to the language of code that uses it. It’s something that has to be learned by each new person that encounters it and if it isn’t an official core macro, that person has to figure out where it’s coming from and then go read that library’s documentation (and hope it’s good enough).

Because ->> and transducers have different semantics, hiding that difference in a “very similar” x>> macro is kind of the worst of all worlds as far as macro usage goes: the “uncanny valley” where the surface similarity leads people to assume one behavior (because ->> is well-known and well-documented) when the actual behavior is different, and subtly so.

And on top of that, you’re proposing making your x>> / x> semantics even more misleading by silently supporting constructs that can’t be changed back to ->> / -> (because you’re giving semantics in the x world to constructs that are errors in the core world).

Where you started off – with a very simple syntactic transform – wasn’t too bad (although I would never use it in my code and would never let it come in via a PR review either) but you’re way off the deep end at this point, creating a monstrous “kitchen sink” DSL-in-a-macro.

7 Likes

Thanks for the feedback, Sean!

Every macro introduced adds semantic complexity to the language of code that uses it. It’s something that has to be learned by each new person that encounters it and if it isn’t an official core macro, that person has to figure out where it’s coming from and then go read that library’s documentation (and hope it’s good enough).

Isn’t this always true though? For all new semantics?

Because ->> and transducers have different semantics, hiding that difference in a “very similar” x>> macro is kind of the worst of all worlds as far as macro usage goes: the “uncanny valley” where the surface similarity leads people to assume one behavior (because ->> is well-known and well-documented) when the actual behavior is different, and subtly so.

What then would differentiate an uncanny macro from a canny one? It’s not as if someone would be using x>> unintentionally, or by some accident or without knowing what the purpose of x>> is. It’s utility isn’t really ambiguous either. What subtle differences would we not know about when deciding to transducify a thread-last thread?

And on top of that, you’re proposing making your x>> / x> semantics even more misleading by silently supporting constructs that can’t be changed back to ->> / -> (because you’re giving semantics in the x world to constructs that are errors in the core world).

For it to be misleading, it would have to be conveying something not true. I think you’re thinking that people will have wrong expectations about how it will behave. I don’t understand why you think that though. The advertised behaviors of the new macros are not exaggerating or making things up. The eagerness semantics of transducers aren’t extremely mysterious. Regarding the new navigational capabilities, there’s not a lot of mystery there either.

Where you started off – with a very simple syntactic transform – wasn’t too bad (although I would never use it in my code and would never let it come in via a PR review either) but you’re way off the deep end at this point, creating a monstrous “kitchen sink” DSL-in-a-macro.

Kitchen-sink!?? It’s a two-line addition, you cantankerous troglodyte! :stuck_out_tongue_winking_eye:

Again, these are all aesthetic objections, unrelated to technical merits or lacktherof. And I appreciate your aesthetic opinion on it too. But I wouldn’t be making the proposal if I didn’t already disagree with you on all those aesthetic judgements.

People coming fresh to a code base that already uses it – I’m coming at this from a maintenance p.o.v. Functions are far more obvious since they are part of the core semantics.

My purely technical criticism here was about complecting multiple semantic changes, hence “Option 4: none of the above.” by which I mean “if using value X in ->> is an error, using value X in x>> should be a similar error”.

That’s why I haven’t chipped in until this last step where you asked for feedback on how/whether to extend the basic transformation to add semantics that make thread-land → x-land essentially a one-way trip (because x-land → thread-land becomes multiple transformations and they are context-sensitive/value-sensitive).

Uy, y’all keep claiming that introducing both semantics at the same time is a technical factor and not an aesthetic one. I think they are both orthogonal to each other and to existing core behavior, and so do not ergonomically complect at all.

But I’ll make a compromise. The legacy behavior will be requireable like:

  (:require [injest.core :refer [x>>]]
    ...

Where as the new behaviors will be available like:

  (:require [injest.path :refer [x>>]]
    ...

Code shops would have to decide which they they are going to be using in their code base. I’m going to recommend injest.path, y’all can recommend injest.core.

Is that a fair compromise?

1 Like

Just my 2 cents - I like the idea of rewriting threading into transducers, but I would personally prefer if it was a refactoring instead of a macro. In other words, if I could run a function in my IDE to change the actual threading code to become a well formatted transducer flow, instead of it happening automatically at compile time.

4 Likes

cljr-unwind-all could be tweaked to do that pretty easily

That’s a good idea, even a linter that could tell you: this could be rewritten as a transducer, would be useful.

As for the discussion, I think maybe in ClojureScript you face that need more often? That your keys are strings or numbers? Due to interop maybe?

For me, I think it’s just confusing because numbers and strings are not valid functions, so how come they work in the threading macro? Now you’d have to learn about the fact that this is a “special” threading macro that doesn’t only thread things, but it also treats numbers and strings specially.

I wouldn’t say it’s an aesthetic thing, it’s more an expectation from the reader thing, it’s just not what I expect, so it be surprising and confusing at first. And it has the problem that if I get used to it, and I’m suddenly in a context with only the core threading macros, my muscle memory is broken, and I might again be surprised…

If it saved me a lot of verbosity, I might still consider the trade off, but I don’t know, you can use get and get-in and it’s barely any lengthier.

I think it be different if it was consistent across the language, but like joinr pointed out, it seems it might not be possible to implement IFn for numbers and strings.

Something else for me is more about the entire library. Like the macro as x>> and x> it’s like, okay this is a transducing threader, I get it, it is conceptually consistent within itself: Use transducers like they were sequence functions you could thread together. But now suddenly it’s like… Oh and also this other unrelated feature… And now it makes me think, okay so are you going to add more unrelated features as well as the library evolves? And is this library actually better thought of as better-thread which is more like: Threading with a whole lot of added convenience.

So ya, my vote would be, have one macro for every logically consistent feature set, and if you personally want a macro that has all the features, well create a macro with all of them combined as a better-thread where you can say, this is a full featured threading macro with all the things I always wanted thread-first and thread-last to also support.

3 Likes

Okay, the new ns scheme is up on the repo, with a new release:

clj -Sdeps \
    '{:deps 
      {net.clojars.john/injest {:mvn/version "0.1.0-alpha.12"}
       criterium/criterium {:mvn/version "0.4.6"}
       net.cgrand/xforms {:mvn/version "0.19.2"}}}'

As described above, you can opt into the new path navigation with:

(ns ...
  (:require [injest.path :as injest :refer [x> x>> +>> =>> <>>]]
   ...

injest.path also provides non-transducifying ‘path threads’ +> and +>> so that you can restore laziness to a thread without having to remove any added path navigation semantics that may have been added.

@didibus Yeah, I’m planning on having a separate one for the parallelized semantic as well.

As for the discussion, I think maybe in ClojureScript you face that need more often? That your keys are strings or numbers? Due to interop maybe?

I’ve actually seen quite a bit of backend code as well, in the wild, having to deal with cheshirized json that, for whatever reason, couldn’t be keywordized. Super common on integrations. Data wrangling. I’d prefer all the keys be keywords but it’s just not like that out there for most dev shops, for a significant slice of their code. So this’ll come in super handy for threading into data coming from json that couldn’t be keywordized.

Thank you for saying more politely, more elaborately, and more convincingly what I was trying to say :slight_smile:

Just FYI, if you’re using a recent CLI version, you can do this instead:

clj -Sdeps \
    '{:deps 
      {io.github.johnmn3/injest 
       {:git/tag "v0.1-alpha.3" 
        :git/sha "71a03de"}}}'

It’s good to get into the habit of using VGN - Verified Group Names - in coordinates for libraries (instead of groups like johnmn3).

@seancorfield Nice, thanks! I’ll update the repo and above references.

You know me: on a mission to get everyone using the latest version of the official Clojure tools :slight_smile:

1 Like

Yeah, prolly should’a spun the lib up with that new new goodness you put out recently :slight_smile: I’m still catching up with latest tools.

Added lambda wrapping, per Should the threading macros handle lambdas? - Clojure Q&A (updated release coordinates above)

Wrapping lambdas makes threads more clear and concise and has the added benefit of conveying to the reader that the author intends for the anonymous function to only take one parameter. In the classical thread syntax, the reader would have to scan all the way to the end of (#(... in order to know if an extra parameter is being passed in - so the intention of the author is more explicit. It also prevents people from creating unmaintainable abstractions involving the threading of values into a literal lambda definition, which I would rather not have to maintain.

With regard to proposals to clojure.core, I don’t think there’s any reason to rush. We could let folks kick the tires or a few months or years, just using the lib. Whether these semantics contribute to more or less code maintainability should start to become more obvious over time.

Personally, I’m a big fan of Clojure’s simplicity. If Rich and crew were not so disciplined about keeping the basic abstractions simple and non-complected, I would not have been able to build the x>> macros. Heck, they couldn’t have made transducers so ergonomic if the 1-arity collection functions were already squatted on. It’s the foresight not complecting abstractions that prevents it from becoming another Javascript and has allowed for new, unforeseen capabilities. Sometimes adding less worse things now let’s you add more better things later. So I’m very sympathetic to knee-jerk aversions to new semantics.

However, with regard to these new semantics, if you analyze their impact, you can see that we’re not barring any potential directions of semantic growth and astraction that we would want to entertain, nor are we introducing any new abstractions. We’re simply reclaiming unusable tokens for usage in the already existing thread abstractions.

Oh, I also got rid of the :exclude [-> ->>] requirment and introduced +> and +>>, which have these path thread semantics without transducifying their forms (has the lazier behavior). When x> or x>> are required from the injest.path namespace, they have the path thread +>/+>> semantics.

1 Like

Parallel => and =>>

Got a new update out last night. Try it out with criterium and net.cgrand/xforms:

clj -Sdeps \
    '{:deps 
      {net.clojars.john/injest {:mvn/version "0.1.0-alpha.12"}
       criterium/criterium {:mvn/version "0.4.6"}
       net.cgrand/xforms {:mvn/version "0.19.2"}}}'

This release comes with parallel versions of x> and x>> which use the equals sign’s two horizontal bars to denote parallelism: => and =>>

The improvements are interesting: Instead of using sequence on the thread, => and =>> leverage core.async's parallel pipeline in order to execute singular or consecutive stateless transducers over a pool of threads equal to (+ 2 your-number-of-cores). Remaining contiguous stateful transducers dealt with in the same manner as in x> and x>>. It doesn’t work well for small data payloads though, so for demonstration purposes let’s augment our previous example threads:

(require '[clojure.edn :as edn])

(defn work-1000 [work-fn]
  (range (last (repeatedly 1000 work-fn))))

(defn ->>work [input]
  (work-1000
   (fn []
     (->> input
          (map inc)
          (filter odd?)
          (mapcat #(do [% (dec %)]))
          (partition-by #(= 0 (mod % 5)))
          (map (partial apply +))
          (map (partial + 10))
          (map #(do {:temp-value %}))
          (map :temp-value)
          (filter even?)
          (apply +)
          str
          (take 3)
          (apply str)
          edn/read-string))))  

(defn x>>work [input]
  (work-1000
   (fn []
     (x>> input
          (map inc)
          (filter odd?)
          (mapcat #(do [% (dec %)]))
          (partition-by #(= 0 (mod % 5)))
          (map (partial apply +))
          (map (partial + 10))
          (map #(do {:temp-value %}))
          (map :temp-value)
          (filter even?)
          (apply +)
          str
          (take 3)
          (apply str)
          edn/read-string))))

Same deal as before but we’re just doing a little extra work in our thread, repeating it a thousand times and then preparing the results for handoff to the next stage of execution.

Now let’s run the classical ->> macro:

(->> (range 100)
     (repeat 10)
     (map ->>work)
     (map ->>work)
     (map ->>work)
     (map ->>work)
     (map ->>work)
     (map ->>work)
     last
     count
     time)
; "Elapsed time: 18309.397391 msecs"
;=> 234

Just over 18 seconds. Now let’s try the x>> version:

(x>> (range 100)
     (repeat 10)
     (map x>>work)
     (map x>>work)
     (map x>>work)
     (map x>>work)
     (map x>>work)
     (map x>>work)
     last
     count
     time)
; "Elapsed time: 6252.224178 msecs"
;=> 234

Just over 6 seconds. Much better. Now let’s try the parallel =>> version:

(=>> (range 100)
     (repeat 10)
     (map x>>work)
     (map x>>work)
     (map x>>work)
     (map x>>work)
     (map x>>work)
     (map x>>work)
     last
     count
     time)
; "Elapsed time: 2862.172838 msecs"
;=> 234

Under 3 seconds. Much, much better!

All those times come from Github’s browser-based vscode. When running in a local vscode instance (or in a bare repl), those above times look more like: 11812.604504, 5096.267348 and 933.940569 msecs - a performance increase of 2 fold for the x>> version, to an increase of 10 fold for the =>> version, when compared to ->>.

In the future I’d like to explore using parallel folder instead of core.async but this works pretty well.

After a few days or weeks - after folks have had a bit to kick the tires - I’ll release a beta version on Clojars and put out a more formal release announcement in a separate set of posts. In the mean time, please give it a whirl and let me know if you find any issues. BTW, there was a bug in the last release that made it impossible to define a thread within a function with bindings - that’s been fixed but sorry if anyone got bit by that; it would have been pretty confusing. Anyway, enjoy!

1 Like

So I’ve got another alpha out, this time with parallel r/fold's Fork/Join under the hood. It’s pretty fantastic. More robust than the pipeline version and much less of a foot-gun when working with smaller workloads.

Bottom-line, when trying to parallelize work, if the work is too small, parallelization can actually make the whole job take longer. This is especially true of pipeline and when used on large sequences with small workloads, the problem compounds and it becomes unusable. r/fold is a little more forgiving in this regard, dividing sequences into more manageable partitions. I’m exploring doing automatic partitioning of sequences being passed into the pipeline, but I haven’t come up with anything satisfying yet.

This pretty much sums up the features I wanted on the roadmap, so I’m very close to releasing a beta. My only issue left is naming…

Initially, I named the pipeline-thread-last operator =>>

Then I named fold-thread-last operator =>> and renamed the pipeline-thread-last to |>>, since I wanted fold-thread-last to be the more used operator and I thought =>> denotes parallelism better and |>> is a little ugly.

Then I figured fold-thread-last might be better represented as <>>, where < denotes a fold. So I renamed the pipeline one back to =>>, since I thought |>> was kinda ugly.

It’s nice though that |>> starts with a “pipe” character, which might be better from a mnemonic perspective. OTOH, = looks like a pipe or a parallel set of pipes.

So what do y’all think? Have a preference over names? Answer below or just respond to this poll:

  • <>> for fold and =>> for pipeline
  • =>> for fold and |>> for pipeline
  • =>> for fold and o>> or *>> or anything (answer below)

0 voters

Anyway, the alphas are now available on clojars as well:

clj -Sdeps \
    '{:deps 
      {net.clojars.john/injest {:mvn/version "0.1.0-alpha.12"}
       criterium/criterium {:mvn/version "0.4.6"}
       net.cgrand/xforms {:mvn/version "0.19.2"}}}'

Once we settle on good names I’ll probably move it into beta and make a more formal announcement on the proper channels.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.