Sorting a rich sequence without (sort)

cnuernber · March 14, 2022, 4:51pm

@Webdev_Tory - You are correct, the answers I gave assumed 6 items.

Here is an answer that does not and has the same or better runtime of any of the transducer-based pathways - and for longer sequences that includes the javascript pathway although for short sequences a concrete construction of just setting values into arrays will dominate as Chris noted.

user> (def src '(A D B E D F))
#'user/src
user> (def srcv (vec src))
#'user/srcv
user> (require '[tech.v3.tensor :as dtt])
nil
user> (-> (dtt/reshape srcv [(quot (count srcv) 2) 2])
          (dtt/transpose [1 0]))
#tech.v3.tensor<object>[2 3]
[[A B D]
 [D E F]]

Note the transducer pathways are conj’ing results into vectors.

@zcaudate - iterators do not do 150% of what transducers do although I think you are trolling us here ;-). There are libraries to make Iterables do some of what transducers do but transducers allow you to create a pathway that works with both sequences in a pull model and async pathways in a push model - something that an iterable (or dtype-next/tensor) pathway does not. So by that definition I would say that any aggregation/conversion pathway of iterables or iterators is only capable then of at most 50% of what transducers do.

Additionally if you pass an iterator to a function your original iterator is now in question but that isn’t true of any sequence-based pathway regardless of it is transducer based or not. That exact property, immutability, is also an important piece to understand. So then your iterator approach is now less that 50% of transducers as it involves making sure someone doesn’t mutate your nugget of truth in the middle of your iteration and thus making your entire construction or program less likely to be correct.

zcaudate · March 14, 2022, 5:34pm

@cnuernber: you’re right in one sense about the push/pull marketing trick. Iterators don’t need that to get the job done. I’m just glad dtype-next uses the tried and true iterator pathway. Otherwise I’d be avoiding it as I avoid the transducer cult.

I’d like an actual concrete example of something transducers can do but iterators cannot. Mind you, I have studied transducers too.

gist.github.com

https://gist.github.com/zcaudate/d25d87bbb4b4203e6d9a8a606fdbfd4a

xform.clj

(defn x:map
  "map transducer"
  {:added "3.0"}
  ([] (x:map identity))
  ([f]
   (fn [rf]
     (fn rf:map
       ([] (rf))
       ([result] (rf result))
       ([result input]

This file has been truncated. show original

zcaudate · March 14, 2022, 6:00pm

I’d also like an example of someone dumb enough to write a multithreaded program the way you have described.

cnuernber · March 14, 2022, 6:19pm

I don’t think it is a marketing trick. Many people like Clojure due in no small part to the core.async facility which I believe is what actually prompted the development of transducers. Clojure has decent popularity for processing streams of data such as kafka queues where you may choose either push or pull depending but I am speaking of other people here not myself.

The issues with iterators being mutating state aren’t specific to multithreading. Any time you have a reference to an iterator as part of an object or closure there is some chance of having someone call next on that iterator and the issue is that it isn’t specific to the person who first wrote the code. It has to do with having an exponential increase in the potential issue manifold that comes with processing data in extremely stateful ways.

Speaking of which I think java streams are a closer match to transducers than iterators. The source code for for the Iterable version of the guava library module you point to states that you should use streams instead. The reason I use iterators in dtype next or in dataset is purely for performance to avoid creating many small objects or because the java collections already efficiently work in an iterator pathway.

In any case I think arguments could also be between java 8+ streams and sequence-based transducer pathways that are similar to your argument about using iterators pipelines and transducer pipelines.

joinr · March 14, 2022, 7:39pm

It’s an adapter to conform to the convention established for transducing functions (derived from the earlier reducers lib): Clojure - Transducers

original talk "Transducers" by Rich Hickey - YouTube

The gist of it is that the api expects to find a zero-arg version for initialization/seeding, a 2-arg version (what we are used to in reduce) for stepping or accumulating, and a 1-arg version for completion. If you are used to writing reduce, then you either need to supply these arities yourself, or wrap the reducing function in completing which will supply the arities for you and uses identity for the completing function (making it equivalent to what reduce is doing).

joinr · March 14, 2022, 9:03pm

https://nvd.nist.gov/vuln/detail/CVE-2018-6088#vulnCurrentDescriptionTitle
https://nvd.nist.gov/vuln/detail/CVE-2020-15678

zcaudate · March 15, 2022, 12:20am

@joinr so you’re saying that these issues will be solved if transducers are used?

I’m not involved in these projects but I’d take a guess that the code in question is related to caching (which is definitely hard to do and can lead to errors)
You do not want to use transducers and lazy seqs in there scenarios because they add more complexity, not less.

zcaudate · March 15, 2022, 12:27am

Streams are repackaged iterators, seqs are cached iterators, kafka queues are iterators.

They are pretty much all iterators. Things don’t have to be marked Iterable to be considered an iterator.

joinr · March 15, 2022, 12:27am

I didn’t actually say anything. You asked for examples and these were readily available.

The class of errors here seem preventable with some properties of what @cnuernber mentioned (immutability preventing iterator invalidation, e.g. an entire class of error, from the start). Other languages are not immune, although library design (java got there eventually) can try to detect these things and fail fast (or promise weak consistency guarantees for concurrent collections). Or you resort to static analysis and hope they got that right.

If this can happen to the “dumbies” managing some of the highest visibility code bases in use, I would gather there are far more examples in the wild written by geniuses like us.

joinr · March 15, 2022, 12:35am

Unless I want the elegance of iterators, the composition of functions, the ability to reuse the same iterative processes across different mediums, and to prevent myself from establishing a critical vulnerability or similarly nasty bug that I waste time tracing down. Then I would; I definitely would. Transducers are loops and function composition. Not that complicated IMO.

zcaudate · March 15, 2022, 1:04am

Noted. I should have asked for examples that were fixed when geniuses like us evangelised these codebases with transducers.

Again, that’s pretty good marketing. Iterators compose, they also work on immutable collections. And have you ever tried tracking down a bug in a deeply pipelined transducer code base?

Not really.

zcaudate · March 15, 2022, 1:06am

js for loops are iterators, python list comprehension are iterators.

I believe I grossly underestimated the previous figure of 150%.

joinr · March 15, 2022, 1:25am

Yes. Internet snark generates piss poor requirements. The good news is, you can always cherry pick after the fact, especially when you have reached your conclusions a priori.

And have you ever tried tracking down a bug in a deeply pipelined transducer code base?

Yes. Less difficult than iterator invalidation errors by far.

Not really.

Nah, really.

zcaudate · March 15, 2022, 1:37am

Look, this was my first snarky requirement and the most important one. Everything else is pretty much irrelevant.

zcaudate · March 15, 2022, 1:44am

There are no loops. Seqs provide the looping mechanism. Transducers provide the wiring. I’m critical of the wiring.

Can you give an example?

joinr · March 15, 2022, 2:03am

You have conflated seqs and transducers; transducers may operate on seqs, but they also operate on other things, including - tada - lower level interface implementations that loop over the contents of a collection (or implement the concept of reduction), to include iterators. The end result is a composed function invoked inside one or more loops, with no intermediate collections or cached seqs involved at all. You need to study more.

joinr · March 15, 2022, 2:04am

I already gave several. Can you?

zcaudate · March 15, 2022, 2:36am

That’s what I said - transducers provide the wiring. Well, technically it’s the monad encapsulating the transducer interface providing the wiring. You can write whatever you want in your specific transducer.

I’m actually interested what you mean by invalidation errors - why do they occur, and why they are are hard to track down. I’m using iterators a lot at the moment, I haven’t encountered any (maybe because I’m doing caching ok but maybe I’m missing something). It’d be good to know for future reference.

And sorry to seem snarky but I get absolutely triggered by transducer code for the sake of transducer code. I’ve had PTSD over a CRDT codebase used by Atlassian in its Confluence product.

Basically, it changed my whole opinion of the whole reactive mantra (including then async/await pattern).
Any code that was added into the codebase had to be rewritten in the async style .
Because it was written in an async style, every downstream module had to know about what the upstream model was going do so it was impossible to modularise.
Having transducers in there, specifically the 1-arity function that has the effect of inverting the entire pipeline inside out (through transducer function composition). It did not help at all with debugging, especially when one is trying to understand where and when in the pipeline a piece of logic is running.
When you add in lazy seqs and the potential for the head to be held, it’s a crazy mix of unpredictability and memory leaks.
The project was dropped later in favour of an open source CRDT library.

zcaudate · March 15, 2022, 3:15am

For clarification, what I mean by monadic encapsulation is just this particular pattern of wiring the result of computation to rf. It’s a very clever(hacky) way of reusing the 1st-arity case of the function to compose a pipeline. But that’s all it is.

(fn rf:map
  ([] (rf))
  ([result] (rf result))
  ([result input]
   (rf result (f input)))
  ([result input & inputs]
   (rf result (apply f input inputs))

I know that the term transducer cult isn’t going to change anybody’s mind if their mind is already made up but maybe there are others that haven’t made up their minds and can do the work themselves to figure out what they really prefer.

joinr · March 15, 2022, 5:06am

This is very old news; Rich even brought up the monoidal nature of its origin when the precursor - reducers - were introduced (in reference to Guy Steele’s talk on concatenative parallel computations for Fortress which inspired reducers then transducers). The design conforms to monoid (or monad plus). I remember the haskell type astronauts (God love them) spending several months and blogposts elaborating on this in varying levels of abstraction.

transducer cult

this sounds like something the goth kids from south park would say.