Performance issues with cross-correlation

joinr · August 13, 2020, 5:29pm

Pretty sure, since we know X is ascending (or can be made so), we can do interpolation in one pass. As it stands, you’re calling the interpolation scheme every time for every input X2, so you’re going into quadratic runtime, since you have to find the split every time to determine which values to interpolate. This can be made substantially more efficient algorithmically (e.g. interpolate all points in 1 interleaved pass over both inputs) and by way of caching results.

@geokon-gh

I put up a revised version with mods here. The goal is to exploit more efficient algorithmic paths from the high level code, and then trend a little towards mechanical sympathy with some type hints for the numerics. The next step beyond this would be to redesign everything in primitive double arrays (e.g. using aset aget and friends), which would likely be substantially faster yet (probably in the ms range I’m guessing). If you went with a matrix representation and an efficient library like neanderthal; even faster. With this implementation, I got the original runtime down from 60s for 1k x 1k with size 2 shift, to 5s with the intermediate revision using split-adjacent, to ~1.6s with more substantive modifications (still high level).

We can compute a piecewise linear approximation of the underlying function indicated by v1 (in your parlance) lazily. If the domain for x2 (the inputs to project/interpolate onto v1) are monotonic, then we can just walk through x and the lazily-generated linear functions defining v1, and project x onto these lines to give us a linear interpolation. It’s not exactly the same as the original implementation (where for every x, regardless of order, you scan the input v1 until you find a pair of adjacent points to average “every time”) in that I’m a little more precise about the piecewise linear function that’s defined (e.g. the interpolation can handle arbitrary values at various intervals). The idea is the same, and you should be able to replicate the original scheme if you want to.
We can do better when computing the correlation coefficient. The legacy implementation relied on multiple passes through the inputs (I think 5 in total), and used generic math. If we realize that most of the measures necessary to compute the correlation are running sums based off the input vectors (aside from the means which we need ahead of time), we can compute these totals in one pass. We can also type hint to help keep things specific and efficient, since we’re messing with doubles. So, reducing from like 5 passes to 3, and hinting, helps knock down the cost of the correlation coeff (about 0.5 s for me).
The mean function leverages count, except you pass in sequences, for which count is O(n). So this will slow down as the input size grows. Instead, there’s a more general version that just accumulates a sum and a count, then computes the mean afterward. It allows us to traverse the input 1x instead of 2. Minor savings.