I think with a lot of these, what happens is often that optimizing the Clojure compiler for those doesn’t matter necessarily, because the JIT might already optimize those.
I think the best improvement to the compiler (but one that would be a huge undertaking), would be to group functions and vars under the same class file, because class loading really slows down startup time, if an entire namespace could somehow compile to a single or only a few classes it could improve startup considerably.
Similar to that, I’d love to see dead code elimination, so if I only use 5 functions from core, I should only pay the cost of initializing those 5 core Vars, their metadata and their functions. That would also reduce startup time a lot.
GitHub - bsless/clj-fast: Unpredictably faster Clojure is basically doing this stuff at the library level. I think there are opportunities for an optimizing compiler built on top of core.typed and tools.analyzer (look at the stuff that Ramsey did with MAGIC compiler and building optimizing passes for .net/CLR stuff). Definitely some interesting untapped potential.
To add just a bit with regards to clj-fast, the biggest benefits I found were with regards to loop unrolling, with speedups of 2x up to order of magnitude, depending on the function and size of the collection iterated on.
Generally, it can be done by a combination of two (and a half) passes - constant propagation, function call inlining, and partial application. By partial application I mean even if you have a vector of [x y z] where these symbols are arguments to a function, the nth of the vector will not change.
In terms of dispatching to the specific implementation instead of working through clojure.lang.RT I did see speedups, but with the exception of clojure.core/find they were not dramatic.
The Clojure compiler is (intentionally) pretty basic, more of a translator from Clojure source to Java bytecode, than an optimizer. The bet there is that the JIT (with 100s of person-years of engineering in it) can do more and do better than the compiler with dynamic information. That was a great bet when it was made, and is still pretty good.
Direct linking using static calls makes a lot of the call paths easier for the JIT to analyze and optimize (not needing to go through the synchronized Var loads). Transducers tend to build stacks of mostly small-ish non-synchronized functions always called with the same types so are also pretty amenable to JIT optimization.
There is a branch with lazy var loading from way back (it takes some effort to merge it due to drift over the years). That branch, especially when combined with direct loading, means many fewer vars need to be loaded at startup/load time and can reduce startup times significantly. The reason it’s not been pulled in is that the delayed var loading required a conditional check (for whether its loaded) that makes every var invocation slower. Ghadi has done some work replacing that part with dynamic guards that seems like it has both fast load and fast invocation. Maybe we’ll get back to that some day, could be a nice reduction in load/start times (25-30% maybe?).
There are a few other places where dynamic stuff could help as well, but I’m not sure those are life changing. Some of the value-oriented features coming to the JVM are things that would greatly benefit Clojure as more direct translations for what we do now and those might be very useful (could make real tuple support make sense for example).
I think a lot of the clj-fast stuff is generally a bad idea - it’s avoiding abstractions that make Clojure what Clojure is, potentially cuts you off from optimizations that could be made inside Clojure in the future, in some cases is less portable to other Clojure dialects, and does not make any difference to your overall program performance unless its in very hot code paths. I wish the guidance around it was a lot better to make these tradeoffs and good application clear.
I know many things would be easier to optimize at runtime by the magic jvm. but something is only possible to optimize at compile time. e.g. the use of persistent data structures. With careful static analysis, we could safely replace them with a fast impl/a impl generate lesser objects.
Most of the performance issues I faced in production is the unreasonable large amount of object creation which mostly contributed by manipulating persistent data structure / destructuring. I know how to avoid it, but the code just look awful and not clojure.
Which part, in particular?
Regarding dispatching to concrete methods over going through RT I’d even agree. As I move the library out of alpha in the future I will make it clearer, maybe even move it to a different namespace.
However, regarding loop unrolling, besides the reliance on :inline which is still experimental, I don’t really see it. While the JIT is incredible, I’ve yet to see it manage to optimize away using reduce1 in get-in.
Extra arities are even considered for some core functions, from what I understand (assoc, for example).
It is true I did not consider other dialects. It was born out of my needs and profiling results of backend applications which churn hundreds of billions of messages per day.
As always regarding performance optimization, we can pull out Knuth’s old adage
premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
The programmer has some responsibility to know what they’re doing. I wouldn’t bother optimizing loading up the configuration file. If someone uses clj-fast for that, it’s their mistake.
Interestingly, making loading vars lazy could have negative consequences for GraalVM native-images. Since var loading involves class loading, this work must be done at image build time, hence the option --initialize-at-build-time is needed when producing native images. Luckily vars are initialized in static initializer blocks, so this work can in fact be done at image build time. Delaying this work to runtime will probably render Clojure native images completely unusable. I hope that any changes done to var loading will consider this scenario. Perhaps the old behavior can be preserved using a system property.
Yesterday I was discussing with @Chris_Nuernberger that the Clojure compiler (or some extra tool) could solve this “AOT” scenario in a different way. Perhaps it could emit some class loading code that could be ran “at build time” only for the vars needed in a final program, or the lazy loading (if that implementation is chosen) can be forced ahead of time for only the relevant vars, at build time.
For scenarios where startup time is important (short running scripts, AWS lambda) one could currently consider a GraalVM native-image based solution or babashka.
I agree, that’s why I don’t suggest lazy loading Vars, I think dead code elimination would be much better, and possibly batching functions into the same generated class files as well (though I think that would be a huge change).
That’s true, but it be nice if Clojure did it as well when running under the normal JVM.
I’d imagine an AOT compilation would do it as an option (cause DCE would prevent production REPL use)
GraalVM is nice, but it’s cumbersome for things like Cloud Functions and Lambdas. Being able to get Clojure start time down and bundle size down while running in the normal Java container would still be good here.
I guess lazy loading Vars would have the benefit of making REPL start time and scripts start time and such faster as well, since DCE doesn’t make sense for those, cause there’s no pre-compile pass. I think like you said, as long as maybe there’s a flag where you can choose between lazy loading and pre-loading Vars, it could also be a nice option.
I was speaking with @jackrusher the other day about ClojureScript and he made the observation that JVM-targeted Clojure, ClojureScript, and the new compiler targeting Dart don’t really share any common code / abstractions.
Perhaps there is an opportunity to create some core compiler passes that can be shared and then folks can implement emitters for the different targets?
Being not so familiar with the implications of this, I’d love to hear other’s thoughts on this idea.
My thoughts were pretty abstract on this. Graal native currently only works with Clojure with a flag, --initialize-at-build-time. My thoughts were if whatever that flag does could be done during the AOT step thus creating a hard-linked set of classes along with data that would be loaded from a sidecar file of some sort as you can’t put pure data in bytecode files and lot’s of vars are just persistent datastructures of some form or another.
A related observation is that tech.ml.dataset, even when pre-compiled with AOT, takes about 1 second to be useable. Perhaps this is partially due to the number of classes produced or something along those lines but some of that is due to RT.var(x,y) being called in lots of static initializers. One concrete idea would be hard-link those to the actual static instances so an example would be RT.var(“clojure.core”, “println”) would get hard-linked to whatever static instance represents the println function.
With new JDK implementations I think there is also an opportunity to cut down on the number of classes produced as you have method handles to somewhat efficiently generically call a function so lots of bespoke AFn implementations can be replaced with a specific arity method handle overload. This may allow for multiple functions to be created in one class along with some collection of method-handle AFN instances. That is of course a massive change and it would be JDK-11+ specific so the time for that is not anywhere near now.
So, there are really two thoughts. First, could we do whatever is necessary to remove the --initialize-at-build-time flag from graal native compilations which involves more compile-time static initialization of member variables and datastructures. Second, as new tech comes out it is always interesting to reconsider architectural choices to see if there is any advantage there. I have had the same thoughts as @didibus w/r/t generating fewer bespoke classes but I can’t see a way past it without java supporting IFn at a lower level which it really does not until you have methodhandles.
I feel in my bones this getting shot down, but as Clojure is more or less stable, maybe it’s time to spend effort on documenting the compiler and doing general source code cleanup. Building out a larger suite of regression tests, basic stuff like standardizing formatting and documenting methods, documenting overarching design of features like the STM, etc.
You know, all the stuff Rich Hickey thinks is pointless.
Then maybe start working on prototypes of what can be done with MethodHandles and other JDK-11+ specific features or unifying the different compilers like @philomates suggested
When I run this locally I get less dramatic differences but still significant (~15% vs the posted ~56% slowdown). If these numbers are at all accurate that suggests there could be a lot to be gained by doing some simple optimizations in the compiler
Function invocation path (map-as-function) is typically lowest overhead. So going with (the-map :some-key) will work fast and be portable. I changed my “idioms” to start using this when I realized that (I used to use keyword-as-function and clojure.core/get quite a bit before). It’s like 5ns faster on my machine than the kw one (if you believe the nanos…). The tradeoff is that you are losing the support of the broader java.util.Map and java.util.List and array support that nth/get basically paper over for you; and non-clojure types aren’t IFns so they will throw an exception. Still, if you’re slamming clojure maps, sets, vectors, then it’s a clean and fast way to get the lower level speed.
Destructuring will still cost, since it assumes maximum polymorphism and doesn’t leverage any type hint information. So sequential things always go through clojure.core/nth, and map-like things always use clojure.core/get, both of which do checks and branching. You get substantially faster if you avoid these things, namely by doing what I mentioned above, or (in some cases like with direct field access) use type hints and direct method/field invocation. The library @didibus mentioned provides some tooling to allow you to leverage types a bit, which is handy with things like records or deftypes or jvm objects that can be really efficiently accessed, or even instances of Indexed or ILookup. Savings can be substantial for hot loops (which was the origin of the lib). I think the clj-fast lib is incorporating its own variant of these ideas. It would ultimately be “nice” to have some optimizing compiler passes akin to what SBCL does, where you can leverage the type system / inference engine to then generate even better code with user-defined safety/debug/speed levels (e.g. if you want to force type specialization akin to what structural does, compiler could just unpack all that stuff for you).
The Clojure compiler is old and pre-dates “AST as data” implemented in ClojureScript. I’m not sure that the claim about Clojure Dart is true since I believe they started with ClojureScript so the “AST as data” is probably true there. ClojureScript and probably Clojure Dart should conform to the same base AST rep defined by tools.analyzer. Given that, I definitely don’t see why transformations could not be shared.