Do Clojure still have rooms to improve at compiler level?

zerg000000 · June 29, 2021, 1:08am

Watching the Rust development in past few years. Especially, zero cost abstraction impressed me a lots.

I know we already did some great jobs like direct-linking. but what could we do next?

seancorfield · June 29, 2021, 1:41am

I’ll be interested to see what response you get to this.

The evolution of the Clojure compiler is extremely conservative because a) backward compatibility is incredibly important, both to the Clojure core team and to many of Clojure’s commercial users and b) any changes need to be very carefully analyzed to ensure they do not cause any performance regressions.

It’s only relatively “recently” that Clojure abandoned support for pre-Java-8 versions

jiyinyiyong · June 29, 2021, 4:38am

implementing another Clojure with Rust?

zerg000000 · June 29, 2021, 6:15am

i can imagine a few,

direct call to the actually method, when type info can be obtained? the current core functions do lots of type check before calling the actual impl

lot of functions could be replaced at complie time, assoc/get-in with static key

replace local map/vector to mutable variants?

auto convert threading map/filter to transducer?

aggressive inlining functions?

didibus · June 29, 2021, 7:04am

I think with a lot of these, what happens is often that optimizing the Clojure compiler for those doesn’t matter necessarily, because the JIT might already optimize those.

I think the best improvement to the compiler (but one that would be a huge undertaking), would be to group functions and vars under the same class file, because class loading really slows down startup time, if an entire namespace could somehow compile to a single or only a few classes it could improve startup considerably.

Similar to that, I’d love to see dead code elimination, so if I only use 5 functions from core, I should only pay the cost of initializing those 5 core Vars, their metadata and their functions. That would also reduce startup time a lot.

joinr · June 29, 2021, 7:26am

GitHub - bsless/clj-fast: Unpredictably faster Clojure is basically doing this stuff at the library level. I think there are opportunities for an optimizing compiler built on top of core.typed and tools.analyzer (look at the stuff that Ramsey did with MAGIC compiler and building optimizing passes for .net/CLR stuff). Definitely some interesting untapped potential.

bsless · June 29, 2021, 8:51am

To add just a bit with regards to clj-fast, the biggest benefits I found were with regards to loop unrolling, with speedups of 2x up to order of magnitude, depending on the function and size of the collection iterated on.
Generally, it can be done by a combination of two (and a half) passes - constant propagation, function call inlining, and partial application. By partial application I mean even if you have a vector of [x y z] where these symbols are arguments to a function, the nth of the vector will not change.
In terms of dispatching to the specific implementation instead of working through clojure.lang.RT I did see speedups, but with the exception of clojure.core/find they were not dramatic.

alexmiller · June 29, 2021, 1:37pm

The Clojure compiler is (intentionally) pretty basic, more of a translator from Clojure source to Java bytecode, than an optimizer. The bet there is that the JIT (with 100s of person-years of engineering in it) can do more and do better than the compiler with dynamic information. That was a great bet when it was made, and is still pretty good.

Direct linking using static calls makes a lot of the call paths easier for the JIT to analyze and optimize (not needing to go through the synchronized Var loads). Transducers tend to build stacks of mostly small-ish non-synchronized functions always called with the same types so are also pretty amenable to JIT optimization.

There is a branch with lazy var loading from way back (it takes some effort to merge it due to drift over the years). That branch, especially when combined with direct loading, means many fewer vars need to be loaded at startup/load time and can reduce startup times significantly. The reason it’s not been pulled in is that the delayed var loading required a conditional check (for whether its loaded) that makes every var invocation slower. Ghadi has done some work replacing that part with dynamic guards that seems like it has both fast load and fast invocation. Maybe we’ll get back to that some day, could be a nice reduction in load/start times (25-30% maybe?).

There are a few other places where dynamic stuff could help as well, but I’m not sure those are life changing. Some of the value-oriented features coming to the JVM are things that would greatly benefit Clojure as more direct translations for what we do now and those might be very useful (could make real tuple support make sense for example).

I think a lot of the clj-fast stuff is generally a bad idea - it’s avoiding abstractions that make Clojure what Clojure is, potentially cuts you off from optimizations that could be made inside Clojure in the future, in some cases is less portable to other Clojure dialects, and does not make any difference to your overall program performance unless its in very hot code paths. I wish the guidance around it was a lot better to make these tradeoffs and good application clear.

zerg000000 · June 29, 2021, 3:47pm

I know many things would be easier to optimize at runtime by the magic jvm. but something is only possible to optimize at compile time. e.g. the use of persistent data structures. With careful static analysis, we could safely replace them with a fast impl/a impl generate lesser objects.

Most of the performance issues I faced in production is the unreasonable large amount of object creation which mostly contributed by manipulating persistent data structure / destructuring. I know how to avoid it, but the code just look awful and not clojure.

bsless · June 29, 2021, 6:26pm

Which part, in particular?
Regarding dispatching to concrete methods over going through RT I’d even agree. As I move the library out of alpha in the future I will make it clearer, maybe even move it to a different namespace.
However, regarding loop unrolling, besides the reliance on :inline which is still experimental, I don’t really see it. While the JIT is incredible, I’ve yet to see it manage to optimize away using reduce1 in get-in.
Extra arities are even considered for some core functions, from what I understand (assoc, for example).
It is true I did not consider other dialects. It was born out of my needs and profiling results of backend applications which churn hundreds of billions of messages per day.
As always regarding performance optimization, we can pull out Knuth’s old adage

premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

The programmer has some responsibility to know what they’re doing. I wouldn’t bother optimizing loading up the configuration file. If someone uses clj-fast for that, it’s their mistake.

seancorfield · June 29, 2021, 6:58pm

These aren’t semantically equivalent at all so this certainly isn’t something that should be done “automatically”. map/filter are lazy, transducers are not.

This can also be dangerous since inlining too much can change what the JIT can do and can cause performance to be worse.

borkdude · June 30, 2021, 9:48am

Interesting article (from 2014) about loading vars and Clojure boot time:

https://blog.ndk.io/solving-clojure-boot-time.html

Interestingly, making loading vars lazy could have negative consequences for GraalVM native-images. Since var loading involves class loading, this work must be done at image build time, hence the option --initialize-at-build-time is needed when producing native images. Luckily vars are initialized in static initializer blocks, so this work can in fact be done at image build time. Delaying this work to runtime will probably render Clojure native images completely unusable. I hope that any changes done to var loading will consider this scenario. Perhaps the old behavior can be preserved using a system property.

Yesterday I was discussing with @Chris_Nuernberger that the Clojure compiler (or some extra tool) could solve this “AOT” scenario in a different way. Perhaps it could emit some class loading code that could be ran “at build time” only for the vars needed in a final program, or the lazy loading (if that implementation is chosen) can be forced ahead of time for only the relevant vars, at build time.

For scenarios where startup time is important (short running scripts, AWS lambda) one could currently consider a GraalVM native-image based solution or babashka.

didibus · July 1, 2021, 12:13am

I agree, that’s why I don’t suggest lazy loading Vars, I think dead code elimination would be much better, and possibly batching functions into the same generated class files as well (though I think that would be a huge change).

borkdude · July 1, 2021, 10:34am

This is already what GraalVM native-image is doing.

didibus · July 1, 2021, 6:47pm

That’s true, but it be nice if Clojure did it as well when running under the normal JVM.

I’d imagine an AOT compilation would do it as an option (cause DCE would prevent production REPL use)

GraalVM is nice, but it’s cumbersome for things like Cloud Functions and Lambdas. Being able to get Clojure start time down and bundle size down while running in the normal Java container would still be good here.

I guess lazy loading Vars would have the benefit of making REPL start time and scripts start time and such faster as well, since DCE doesn’t make sense for those, cause there’s no pre-compile pass. I think like you said, as long as maybe there’s a flag where you can choose between lazy loading and pre-loading Vars, it could also be a nice option.

philomates · July 2, 2021, 12:26pm

I was speaking with @jackrusher the other day about ClojureScript and he made the observation that JVM-targeted Clojure, ClojureScript, and the new compiler targeting Dart don’t really share any common code / abstractions.

Perhaps there is an opportunity to create some core compiler passes that can be shared and then folks can implement emitters for the different targets?

Being not so familiar with the implications of this, I’d love to hear other’s thoughts on this idea.

Chris_Nuernberger · July 2, 2021, 12:55pm

My thoughts were pretty abstract on this. Graal native currently only works with Clojure with a flag, --initialize-at-build-time. My thoughts were if whatever that flag does could be done during the AOT step thus creating a hard-linked set of classes along with data that would be loaded from a sidecar file of some sort as you can’t put pure data in bytecode files and lot’s of vars are just persistent datastructures of some form or another.

A related observation is that tech.ml.dataset, even when pre-compiled with AOT, takes about 1 second to be useable. Perhaps this is partially due to the number of classes produced or something along those lines but some of that is due to RT.var(x,y) being called in lots of static initializers. One concrete idea would be hard-link those to the actual static instances so an example would be RT.var(“clojure.core”, “println”) would get hard-linked to whatever static instance represents the println function.

With new JDK implementations I think there is also an opportunity to cut down on the number of classes produced as you have method handles to somewhat efficiently generically call a function so lots of bespoke AFn implementations can be replaced with a specific arity method handle overload. This may allow for multiple functions to be created in one class along with some collection of method-handle AFN instances. That is of course a massive change and it would be JDK-11+ specific so the time for that is not anywhere near now.

So, there are really two thoughts. First, could we do whatever is necessary to remove the --initialize-at-build-time flag from graal native compilations which involves more compile-time static initialization of member variables and datastructures. Second, as new tech comes out it is always interesting to reconsider architectural choices to see if there is any advantage there. I have had the same thoughts as @didibus w/r/t generating fewer bespoke classes but I can’t see a way past it without java supporting IFn at a lower level which it really does not until you have methodhandles.

bowbahdoe · July 11, 2021, 3:18am

I feel in my bones this getting shot down, but as Clojure is more or less stable, maybe it’s time to spend effort on documenting the compiler and doing general source code cleanup. Building out a larger suite of regression tests, basic stuff like standardizing formatting and documenting methods, documenting overarching design of features like the STM, etc.

You know, all the stuff Rich Hickey thinks is pointless.

Then maybe start working on prototypes of what can be done with MethodHandles and other JDK-11+ specific features or unifying the different compilers like @philomates suggested

zerg000000 · July 11, 2021, 4:37am

More documentation of the language standard would be a great help for various Clojure implementations in different host languages.

For JVM side, Host Language Changes are interesting. e.g. MethodHandles, Value Type, Loom…etc.

For JS side, es module, class, template literals. we needs at least some ways to interop them, so that we could live with the host language happily…
(link)

Josh_Lemer · August 1, 2021, 5:12pm

Someone at work pointed this out recently

(let [f-keys (fn [{:keys [foo bar] :as m}]
               (str foo bar))
      f-kws (fn [{foo :foo bar :bar :as m}]
              (str foo bar))
      f-gets (fn [m]
               (str (:foo m) (:bar m)))
      m {:foo "a" :bar "b"}]
  (c/quick-bench (f-keys m))   ; // 111.027050 ns
  (c/quick-bench (f-kws m))    ; // 106.243598 ns
  (c/quick-bench (f-gets m)))  ; //  71.026220 ns

When I run this locally I get less dramatic differences but still significant (~15% vs the posted ~56% slowdown). If these numbers are at all accurate that suggests there could be a lot to be gained by doing some simple optimizations in the compiler