Performance issues with cross-correlation

geokon-gh · August 13, 2020, 12:40pm

I’m having some trouble with performance and I’m not really sure what tools I should be reaching for. So I’m hoping some people with more experience could give me some suggestions/feedback

Clojure has been really great for massaging maps and moving data around and refactoring has never been easier. Everything is functional and decoupled and I love it… but when it comes to dealing with vectors, time series and this kind of “linear” data I’m finding it really hard to reason about performance, and the functions that operate on collections always feel inadequate

I implemented a straight crosscorrelation. Basically it takes two “signals” (series of points [[position value] [position value] …]) and moves them past each other. In the overlap it interpolates one vector on the other’s positions (b/c their data points’ positions don’t necessarily match) and calculates a correlation coefficient

This code is very very very different from efficient C++ code I’d normally write. I understand that there are probably libraries I could be reaching for and I could whip out the big guns with a linear algebra library like Neanderthal - but my vectors aren’t that crazy long and I wanted to keep things simple. So I’d like to get a working purely Clojure solution working for the time being

However my final solution is terrifyingly slow. A 1000 by 1000 cross correlation is locking up my whimpy machine. Without getting fancy it really shouldn’t be that bad

I don’t really want a code review of my embarrassment (unless there are some masochists), but I’d like to know what tools should I be looking at for profiling my functions? What are some gotchas and things to avoid?

Is there just some point where you gotta drop back down to an imperative style? (I hope not!)

github.com

geokon-gh/corascope/blob/master/src/corascope/correlation.clj

(ns corascope.correlation)

(defn interpolate-point
  "Given a vector V1 of [X1 Y1] pairs and a vector of X2 values
  this returns an [X2 Y2} values, Where the Y2's are interpolated linearly
  TODO: Figure out how to implement quickly.
  Right now just gets the first number that is past the X1"
  [x2 v1]
  (let [[preceeding-points
         following-points] (split-with #(<= (first %)
                                           x2)
                                       v1)
        preceeding-point (last preceeding-points)
        following-point (first following-points)]
    (if (or (nil? preceeding-point)
            (nil? following-point))
      nil
      (/ (+ (second preceeding-point)
            (second following-point))
         2.0))))

This file has been truncated. show original

jan · August 13, 2020, 2:35pm

Give clj-async-profiler a try. It does a good job at identifying non-obvious bottlenecks.

joinr · August 13, 2020, 2:40pm

I’d rather not give a brief response, and provide some dissection of where specifically your performance woes lie. That will take a tad longer, but I’m working on it. Your implementation uses vectors and sequences quite a bit, so there will be significant overhead. You’re also using (I think) boxed math. When compared with typical numeric code, you are taking a substantial performance hit by not using primitive arrays and primitive math. You get orders of magnitude faster access just due to mechanical sympathy, then the unboxed math ops get you further.

Before shifting the implementation to numeric-friendly code (e.g. primitive arrays and primitive math), it might be useful to look at where operations are crushing performance. e.g. could there be some uninentional O(N) operations being used a lot that combine to kill your performance? Possible.

joinr · August 13, 2020, 3:36pm

There are some low hanging fruit:

split-with has to do 2 passes through the data, applying its predicate to each entry to produce the sequence. That’s two O(N) passes that could be done in one.

If you have indexed structures, nth will typically be faster than first or second since there’s some coercion to seq that happens. A little overhead.

Since interpolate-points is the focal point of work (as indicated by profiling), you get some gains by changing it to use nth. On mine (for two 1000 element vectors) that drops runtime from 60s to 42s (meh). The better algorithmic gains come from eliminating the duplicate scans using a helper function:

(defn split-adjacent [f xs]
  (reduce (fn [acc x]
            (if (f x)
              x
              (reduced [acc x]))) nil xs))

(defn interpolate-point
  [x2 v1]
  (let [[preceeding-point following-point]
            (split-adjacent #(<= (nth % 0) x2) v1)]
    (if (or (nil? preceeding-point)
            (nil? following-point))
      nil
      (/ (+ (nth preceeding-point 1)
            (nth following-point 1))
         2.0))))

Which gets runtime down to 5s or so. Still very slow, but showing improvement. May need to add some corner case coverage to split-adjacent but it appears to produce the same results as intended so far. I get the sense that interpolate-points is begging to be more efficient algorithimically.

joinr · August 13, 2020, 5:29pm

Pretty sure, since we know X is ascending (or can be made so), we can do interpolation in one pass. As it stands, you’re calling the interpolation scheme every time for every input X2, so you’re going into quadratic runtime, since you have to find the split every time to determine which values to interpolate. This can be made substantially more efficient algorithmically (e.g. interpolate all points in 1 interleaved pass over both inputs) and by way of caching results.

@geokon-gh

I put up a revised version with mods here. The goal is to exploit more efficient algorithmic paths from the high level code, and then trend a little towards mechanical sympathy with some type hints for the numerics. The next step beyond this would be to redesign everything in primitive double arrays (e.g. using aset aget and friends), which would likely be substantially faster yet (probably in the ms range I’m guessing). If you went with a matrix representation and an efficient library like neanderthal; even faster. With this implementation, I got the original runtime down from 60s for 1k x 1k with size 2 shift, to 5s with the intermediate revision using split-adjacent, to ~1.6s with more substantive modifications (still high level).

We can compute a piecewise linear approximation of the underlying function indicated by v1 (in your parlance) lazily. If the domain for x2 (the inputs to project/interpolate onto v1) are monotonic, then we can just walk through x and the lazily-generated linear functions defining v1, and project x onto these lines to give us a linear interpolation. It’s not exactly the same as the original implementation (where for every x, regardless of order, you scan the input v1 until you find a pair of adjacent points to average “every time”) in that I’m a little more precise about the piecewise linear function that’s defined (e.g. the interpolation can handle arbitrary values at various intervals). The idea is the same, and you should be able to replicate the original scheme if you want to.
We can do better when computing the correlation coefficient. The legacy implementation relied on multiple passes through the inputs (I think 5 in total), and used generic math. If we realize that most of the measures necessary to compute the correlation are running sums based off the input vectors (aside from the means which we need ahead of time), we can compute these totals in one pass. We can also type hint to help keep things specific and efficient, since we’re messing with doubles. So, reducing from like 5 passes to 3, and hinting, helps knock down the cost of the correlation coeff (about 0.5 s for me).
The mean function leverages count, except you pass in sequences, for which count is O(n). So this will slow down as the input size grows. Instead, there’s a more general version that just accumulates a sum and a count, then computes the mean afterward. It allows us to traverse the input 1x instead of 2. Minor savings.

geokon-gh · August 16, 2020, 3:41am

Thank you so much for this really detailed break down @joinr. I also really appreciate how you’ve kept a lot of the original code so it’s much easier for me to figure it out This invaluable for me - I’m learning a lot here

Sorry, it’s been a couple of days since you posted. I’ve just needed to take the time to digest the things you’ve written and to grok the code. I will need to write more later once i play around with some benchmarks

Some of the issues are due to my own sloppiness, but you’ve also shown me a lot of tricks I didn’t know about before. The split-adjacent reduce with the closure that captures an atom is a very nice and simple solution and I’d ofcourse missed that split-with does atake-while and a drop-with pass. However I’m surprised that this and the nth changes lead to such a massive speed up

In the more complete fix-up/rewrite I like how you’ve leveraged lazyness for the lerp. It’s something I need to get more in the habit of using. However in interpolate-points you immediately thread it into a vec

  (->> v1
       sort
       linear-ranges
       (lerp-points (sort x2))
       vec)

Doesn’t this “realize” the lazy sequence immediately? Are you just making the function lazy for general reusability and flexability? Or is that playing some performance role here?

One thing that you’ve done in the lerp and in correlation-coefficient reduction is pack intermediary values in to maps. You seem to worry about the number of passes on the data vectors (to avoid cache misses I guess) - to great results - but are you not worried about all the map look ups you are doing at every iteration? Especially in the correlation-coefficient you are unpacking values, adjusting them and then packing pack into the map.

Are these just optimization still left on the table, or are they simply thing that shouldn’t make a performance difference for whatever reason?

joinr · August 16, 2020, 6:39pm

In my analysis, the reasoning ended up being that nth gives you some small speed ups since you’re not coercing to a seq all the time (you can in fact get even faster using direct method invocation via (.nth ^clojure.lang.Indexed some-vector n) since clojure.core/nth still has some overhead that will drag you down in hot loops). In this case, since you’re traversing the input over and over and over, there are savings to be had just by making access less costly. This is about a 30% gain if I recall, but that gain compounds if we have a better algorithm.

The real savings came from being smarter about what you’re computing (typical computer science stuff), which eventually led to later algorithm. In your original implementation, in order to interpolate an an input X, you leveraged the simpler function interpolate-point, which takes an X a vector of sampled points. interpolate-point simply traverses the sampled points to find an adjacent pair of points (or none) to average for the interpolation. You did this by two separate O(N), early-terminating scans via split-with, e.g. take-while and drop-while. You do this for every input x in X. So I think complexity of the original was O(N) + O(N + 1) + O(N) [since you invoked last on the preceding values, which is O(N)], still O(N) in general though.

Note that invoking clojure.core/last will always yield an O(N) scan of the sequence (or seqable thing); so for cases like getting the last element of an indexed structure (like a vector), where you want O(1) lookup time, avoid last and use peek (for vectors) or a combination of nth and count e.g. (nth the-vector (dec (count the-vector))) to preserve efficiency.

In relative terms, eliminating the need for the second scan and eliminating the invocation of last (another scan), should leave us doing around <= 1/3 of the work (I figured 27% in one concrete case of a vector of 0 , 8). Then if we get savings from using nth, say a 30% reduction, we get that down to about 19% of the original runtime. I think the type hinted function supplied to split-adjacent gets us the final gains for the intermediate step.

When we shift towards the lerp-points implementation, we’re just building on the gains we made before. This time, instead of multiple scans for every point, we only traverse the input and the interpolation samples 1x. So, we similarly cut our algorithm down to (still technically O(N)), doing about 1/3rd of the work (even though we’re doing additional work like creating the linear ranges and more sophisticated sampling, it’s trivially affordable).

It does. I’m trying to recall; there was a specific reason I did that (the original version was a lazy seq). I recall doing it for profiling clarity; dumping everything into a vector (via vec which uses a transient) ends up being pretty harmless to overall perf, but substantially aided profiling further down the chain in other functions. When you leave it up to the lazy sequence generation bit, it gets hard to separate the generation of interpolated points (which technically happens very fast when producing an unrealized lazy seq) vs the processing of those points e.g. in the correlation function, since the realization is deferred. You get a big call stack where the deferred sequence is being realized far from where it was defined (as expected), so it became hard to separate into easily profiled “blocks.” By producing a vector, I was able to get a direct handle on how much faster interpolate-points was in a profiler-friendly way (I was using visualvm).

I am less worried about the map lookups; they are actually very quick (I think for small arraymaps, doing a linear scan over a small array is about as fast as a direct field lookup in a type/record/object, at least for keyword keys). Similarly unconcerned about constructing new arraymaps every iteration (small array clone, plus allocation). You could certainly explore optimizing that further though by using mutable containers to hold the state (or even just a double array), which would avoid any lookups and map creation cost. Notice that I “did” resort to manually unpacking the values, as opposed to naive destructuring, since this avoids intermediate calls to clojure.core/get, and instead defers to the IFn implementation of key lookup for maps, which does not incur the type dispatch overhead of the more general get, which is what the destructuring would emit (structural looks into these sort of optimizations while preserving the syntax sugar)).

I definitely think there are further optimizations to consider; the most drastic (and fruitful) of which would be injecting more primitive array stuff in. There are still intermediate steps to explore (

geokon-gh · August 17, 2020, 10:20am

This is fantastic @joinr . Some of these are real non-obvious tricks.

As you point out, adding up a lot of O(N) traversals should still be O(N) in the end. To my mind that wasn’t a large concern b/c running across a memory-sequential vector seemed quick with the prefetcher. So my focus was on the “inner loops” that I’m doing on each pass and less on decreasing the number of passes. Your analysis seems to really show that my assumptions are wrong, It may be b/c Clojure vectors are not like C++ vectors (as my lizard brain keeps thinking) or java arrays and they’re chunky and each element is (potentially) variable sized (a strange “flexibility” in clojure that I’ve gotta say … I’ve never run into a situation where variable type vectors were a good idea). This is probably what’s leading to very slow traversals. So as you’ve pointed out, I should look into using something like a double-array. I just forget why it’s discouraged: https://clojure.org/reference/java_interop#_arrays

And you’re also absolutely right that some of the algorithms could be cleaned up. The big low hanging fruit is that since the points are sorted some of this could be restructured to be O(logn). However I just felt that a 1000x1000 cross correlation should be quicker than what I was getting even with a bit of a naiive algo

I have a general problem that I feel the functional style ends up encouraging doing multiple simpler passes instead of one “involved” one where you’re tracking indices and intermediary values. It’s easier to do a pass to get a mean and then a pass to add up variances than to clunk in a lambda that does both at the same time. It’s just harder to visually parse. And for instance searching the full vector for the adjacent points each time is very straightforward in a functional style, but cutting off points to decrease the search-space starts to get a bit ickier when things are immutable. That said, none of this is insurmountable with recursion and lazyness

And yeah, lazy stack traces are a mess b/c they like to vomit up a mess far away from the call site

Thanks for teaching me about the O(n) look ups with last and the overhead of destructuring. It’s a bit problematic these aren’t immediately algorithmicaly apparent - but I guess you just need to know your language

What are you using for your analysis btw? The previously mentioned clj-async-profiler ?

joinr · August 17, 2020, 3:47pm

Yeah, these vectors are Hash Array Mapped Tries, implemented specifically as object arrays of size 32, with a careful ordering to enable O(1) access to the last 32 elements. Traversals are still relatively quick, since you’re walking a sequence of 32-element arrays (e.g. chunks). The cache friendliness here relies on how well the actual object layout on the heap is currently; so in the worst case you may have references all over the place.

On my machine, traversing a 10^6 element vector of boxed Longs via reduce (which uses an iterator to walk the object arrays efficiently, as opposed to naive nth lookups) is about 2-3x slower than using areduce (making sure to invoke areduce inside a function that can be jit’d btw) on a primitive array of longs. Summing via reduce and areduce is about 4.5x slower for the vector as well (even with hints). The dense representation of the array enables the clean memory layout we’d like, so there are performance benefits. That’s even with bounds checking forced by default.

There’s a similar tradeoff you probably want to be careful of between implicitly converting between vectors and sequences (e.g. by using the core sequence functions on a vector input). You will get a sequence back (e.g. from map), not a vector. This can cause problems down stream e.g. if you expect the result to be a vector and have O(1) count and ~O(1) nth, when in fact the seq result will be O(N) for both.

One of the nice benefits of the functional approach was being able to delve into the higher-level structures to optimize after-the-fact, without having to dissect a bunch of gnarly loops. Walter Bright (of D) refers to the C/C++ style of loop-centric programming as “toilet bowl programming” since you have to follow the swirl of loop noise down the drain to eventually find out what the hell is going on in the center. The functional style enabled an IMO clearer communication of the algorithm you’d implemented, and thus easier fusion of the different passes. I’ve found that to occur in my projects as well; rapidly prototype the naive approach, then do optimization passes where necessary, possibly fusing or unrolling the pretty high-level code into (slightly) less prettier code or a different algorithm. Having the different passes helps to highlight (once you know the costs) where you can easily trim fat or refactor into equivalent things (like a reduction that collects multiple things, or a transduce or eduction that provides similar benefits in a composable manner without intermediate sequences).

There are always loop and mutation when you decide that imperative is the way to go. Clojure is really a multiparadigm language of an FP default with plenty of escape hatches for consenting adults.

I honestly forgot this myself after messing with this stuff for close to a decade. I was doing some advent of code stuff and implemented an algorithm using vectors and subvectors to parse some stuff, implicitly expecting last to be O(1) for vectors like nth. I quickly found out that my assumption was false (I’d actually found this out before and forgotten), so wrote a little wrapper around peek to get the behavior I wanted and zoom my solution performance sang. So even if you know your language, there are still periodic surprises…

I typically use visualvm, which either ships with your jdk or can be downloaded separately (my case). It attaches to any jvm and allows various profiling and sampling; it’s free. I typically just load up code in the REPL, attach visual VM to the instance, turn on the cpu sampler (not the profiler), run the code in the repl, collect a snapshot in visualvm, turn off the sampler, interrupt the (likely) long running code in the REPL, then examine the snapshot. I like to turn on the hotspot selection which will sift out the method calls where most time is spent, and I filter out the threads that are just infrastructure noise from NREPL. From there I can walk the call tree along the hotspots and find out which functions are causing it, typically conforming to some typical rogues gallery of performance impediments.

Criterium is another really useful tool for benchmarking and getting fairly honest results (as opposed to time, which isn’t terrible in a pinch). I’d use that when making changes and testing the result, instead of having to hook up visualvm at every turn.

didibus · August 17, 2020, 7:24pm

The Clojure data-structures are just not designed for numeric intensive logic or algorithmic efficient containers. They’re meant for domain modeling and designed as such.

Still, you can optimize their usage to surprising levels, but if you’re really looking for performance and efficiency, it’s better to use more appropriate data-structures.

You can do that by using Java’s data-structures like array or mutable collections, with appropriate type annotations and use with loop/recur to avoid boxing on numbers. Or you can do it with using specialized libraries that target that problem, like fastmath or neanderthal.

In my opinion, sticking to Clojure’s standard data-structures for such problem is fighting against the grain, you’re not using the right tool for the job. They weren’t designed for this purpose, the fact they can often get you good enough performance is a cool bonus, but that works when N is small, which is often small in practice, but not always, and that’s when I’d just switch to some better data-structures.

geokon-gh · August 18, 2020, 5:52am

I really appreciate your thorough responses and it’s nice to hear your development experience mirrors my own. It was just that sometimes during the refactor from multi-pass to a more optimized version I get the feeling I’m doing something wrong… or fighting the language bit? So the point of this post I guess was just to bounce it off the community as a sanity check. From what you and @didibus have said there is a bit of friction there - though not worse than the toilet bowl programming style I did however overlook Clojure array functions like amap and areduce. So I maybe just need to adjust what tools I reach for (I need to try out transduce and educate and be more cognizant of the type conversions happening between steps

PS: in fairness to C++, you can actually avoid the toilet bowl and write more elegant code that looks quite functional if you leverage STL data structures. You will also have reduce and map etc… It just ends up not as flexible as Clojure

joinr · August 18, 2020, 3:49pm

It was just that sometimes during the refactor from multi-pass to a more optimized version I get the feeling I’m doing something wrong… or fighting the language bit?

I wouldn’t say you are “fighting the language” at all. Rather, you are exploring how to represent the problem in an idiomatic fashion, using the constructs the language provided. This high-level description is usually easy to arrive at quickly. If that proves to be unsatisfactory performance-wise, then you look at the high level implementation and start down a path of incremental transforms to get toward more performance (speed/compactness). In those cases where you know the end-goal is going to be dense numerics, then it makes sense to plan early on to either start from a primitive array-based implementation (perhaps with a pure functional layer around it), or to design your implementation to quickly pivot towards one. If you design your implementation as a composed set of small functions or protocols, it’s not too hard to swap out say a vector-based implementation for an array-based one.

in fairness to C++, you can actually avoid the toilet bowl and write more elegant code that looks quite functional if you leverage STL data structures. You will also have reduce and map etc… It just ends up not as flexible as Clojure

I think C++ also has the value type advantage, as well as more static types + some type checking, so that you get a lot of help from the compiler as well as control over memory layout, etc. D tries to go that route while providing zero cost abstractions as well (they ended up with a very similar feel to clojure pipelines and similar functions in the standard library as well and the unified function call syntax also normalizes things). I’m curious to see how Project Valhalla will work out on the JVM with value types; in theory that should open up some avenues for substantial mechanical sympathy (although it’s not entirely clear how to reconcile that with the existing persistent structures, which by nature rely on reference types to allow extreme sharing of substructures).

The hiphip and fluokitten libs also provide some convenience functions for faster primitive operations. There’s a post on Dragan’s blog that demos some of this. There’s also fastmath which includes primitive-math if you want to use just the primitive math stuff. I find the primitive math stuff helps quite a bit when working with these things.

geokon-gh · August 19, 2020, 2:42am

Your answer anticipated my next questions
As I started to play with amap and areduce I noticed it’s interface is a bit bizarre/clunky and not really consistent with the rest of the seq abstraction. So I did do a bit of searching and came across fluokitten which seems to clean up the interface and make it more consistent with the rest of Clojure

And you’re right, right now things are a bit tied to the underlying vector implementation. I’m first going to do a rewrite with the Clojure array interface, and then if I don’t lose steam I will think of how to make a wrapper to abstract the implementation away and make it more generic

didibus · August 19, 2020, 3:23am

I believe fastmath has interpolation functions built in as well FYI.

joinr · August 19, 2020, 4:27am

I’m not sure if there is, but something like fluokitten with primitive-typed operators, combined with composable transducers would be pretty useful (shouldn’t be hard since transducers are literally ALL functions). You could just define these things on top of primitive arrays without too much deviation (if any) from your sequence implementation and just swap out the storage device. I think F# does a good job with that, since it provides a nice typed Array module with basically comparable operations from the Sequence and List modules.

joinr · August 19, 2020, 4:28am

There’s a ton of good stuff in there; IMO possibly too much of a dependency for simple focused tasks, but likely to be useful if you’re doing general numerics. I like that primitive-math is a stand-alone option as well.

geokon-gh · August 19, 2020, 6:54am

Unless I’m not catching some subtly, but from looking at the docs I think these are addressing separate issues. primitive-math is only providing more Clojure-y arithmetic operations (so something to use instead of calling Java’s Math/whatever like I do now), while fast-math is giving me containers like Vec2 for holding/traversing my data - so this is more in-line with my issue with crunching through arrays quickly

This is really getting into stuff I haven’t tried out yet… Something you’d mentioned before but I overlooked was eduction. I think I’m starting to understand what you’re suggesting. On a high level my multi pass goes:

shift
trim
interpolate
correlate

I can pretty easily massage these steps into transducers (only the shift transducer will need to change at each iteration) But then, if I’m reading the docs right, with eduction I can glue these into a single pass.

The catch then is that amap and areduce won’t directly give transducers like map and reduce - and that’s where something like fluokitten could help. But I think I could also roll my own transducers … I just haven’t looked into how to do that yet

I’ll pick away at it and post a follow up once I understand what I’m talking about better haha

joinr · August 19, 2020, 7:08am

Yeah, that’s the line I was thinking. You just need a primitive reducing function over the typed array that uses areduce internally (I usually implement clojure.core.protocols/CollReduce or reify one). Then you can fuse those passes together using transducers / eduction/ comp, and you should be able to implement a similar pipeline with minimal passes and retain primitive math everywhere. I haven’t done this, but it seems plausible (a minor extension of existing functionality). Unsure how to tie fluokitten stuff into transducers though (different protocols maybe).

didibus · August 20, 2020, 3:53am

Just saw from Reddit: https://github.com/techascent/tech.datatype

That seems perfect, bunch of optimized numeric data-structures with primitive operations over them.

geokon-gh · September 5, 2020, 2:19pm

I thought I’d give an update:

I first did a rewrite kinda bringing together some of the ideas and suggestions here: https://gist.github.com/geokon-gh/8208c3d9849d22968ff72eccfbe108ca

Things were much faster, but I was looking for an order of magnitude greater performance so I could see this would start hitting some walls before I got the speeds I want. The final code really needed to be using native arrays.

I still found this step really useful b/c it helped me reason about the algorithm. I tried to massage things into transducers but I kept seeing that the algorithm needed to do multiple stuff on each pass and I had some optimizations available if I narrowed my search windows - which proved to be challening to implement with core data structures

The end result using native arrays is pretty close to optimal and a bit ugly, but not too bad I guess:
https://gist.github.com/geokon-gh/342083acc137c602a3f64f9d4694afd4

Some observations - I’m likely not correct about all of this stuff - these are just what I’m inferring from benchmarking:

destructing inside functions that get called a lot really kills performance
b/c you can’t destructure you no longer can easily return multiple values from a function call - so you end up having to manually inline code
You are also limited to type hinting on functions that have 4 primitive type arguments. If you have more arguments then the compiler will throw an error. So … more inlining and spaghettify… (for instance I have a add-measurement-to-correlation function that I can’t use b/c it requires 5 double values. Maybe this could be curried, but I expect that would also give a performance hit)
Even though Clojure is the immutable functional bee’s knees, the compiler is limited in it’s ability to leverage immutability by say automatically memoize simple function calls. For instance I have a function call alast to get the last element in an array. But if the call site and the function use alength it just ends up getting called multiple times unfortunately, so I end up inlining this function manually and “saving” the result of alength to a let temporary. It makes the performance more in-your-face and less compiler magic but it does end up cluttering the code further.
areduce and amap are quite finnicky. My first drafts used them, and you can kinda massage your algorithms to use them - but it ends up being much more practical and clean to just use loop (unless you need to really create a 1-to-1 map). As far as I can tell areduce and amap don’t have an secret sauce inside that makes them faster than a loop
loop seems to have better support for type hinting (or more likely… I was doing something wrong with amap areduce). However it often freaks out and does weird stuff: See this discussion : https://ask.clojure.org/index.php/9499/inconsistent-unmatching-primitive-recur-arg?show=9528 (thanks for your input there @joinr)
In effect the initial loop variable values seed the recursion “argument” types. This process is completely opaque from the programmer’s perspective - so you kinda need to massage and benchmark the code to ensure you are getting primitive types there. Sometimes it wouldn’t and I would have to futz with the code to make the compiler happy. This workflow can probably be improved if I knew the right tools here
The compiler is often really not clever. If you do an (aget some-double-array 0) it seems to mostly not realize that it will return a double even if the some-double-array has been marked as ^doubles
though it’s discouraged… peppering everything with type hints helps a lot. In some places type hints have zero effect - but the compiler will not tell you. After doing a lot of type hinting I still don’t quite have a handle on where it’s strictly necessary. But more type hints helps when something goes wrong and you accidentally generate an intermediary Object. If you have the code slathered in extra type hints then the compiler will scream at you when an Object appears some place.
As far as I could tell VisualVM doesn’t actually display anything to tell you that “hey I’m spending XX% of the time unboxing stuff here”. Just some function calls will hog an unreasonable amount of CPU time given the operations they’re doing. Again, there may be some better tooling here that I just don’t know about.

(set! *warn-on-reflection* true)
(set! *unchecked-math* :warn-on-boxed)

these flags are invaluable but you can get still get boxing in non-math locations (like in calls to aget) - so it’s not a panacea. You still need to benchmark and check

At this point the code is pretty good. It goes three passes (with % run time)

shift (20%)
calculate an average for the overlapped region and the region array-index bounds (45%)
“accumulate” a correlation coefficient (35%)

The reason I’m quite confident I’m close to optimal is that the shift pass is by far the simplest pass b/c it simply goes through and increments valus by a shift value (so I probably didn’t mess that up!) If everything else’s speed is of the same order then I think we’re okay. Running across arrays seems to be the slow point

Things left to do:
The lowest hanging fruit would be smashing the first two passes into one and get back ~20%. The only reason I haven’t yet is because things are already spaghettified enough and having a pass do 3 things at once is a bit messy - but I might add this later.

Another potential optimization is to combine position and value arrays into one interleaved array … [pos|val|pos|val|pos|val|...] but that’ll make the code even uglier. Since running across arrays seems to be the slow point - maybe this would help? But maybe not b/c you have to futz with indeces

At this point I don’t see any real 2x-type of speedups available (other than multithreading). Maybe there is some way to run faster across an array. If things are memory-sequentially and prefetching properly… I’m surprised this is really the bottleneck right now - but this is JVM magic territory and I don’t know how to proceed there

Anyways, thanks for the input guys. The end result is really not clojure-y at all unfortunately (it reminds me more of the ELisp code I’ve written) - but I still enjoyed the challenging of getting this all work. Hope this is useful for the next person that comes along and sees it. I got run times to within an acceptable limit. It’s about 1-2 sec now to crosscorrelate 5000 points… which is a bit clunky for a GUI app but I think it’s okay (I think with some caching and maybe some background threads it shouldn’t be perceptible). ~5000 points is near the upper limit for a data set in my usecase - and my laptop is running a 5 year old Core m3. So if it works for me it should work for everyone else