Performance issues with cross-correlation

I’m not sure which exact kind of cross-correlation you need, but if it’s what ordinary cross-correlation does, Deep Diamond’s convolution forward layer computes this. If you want to use high-level API, just create a convolution layer, set the activation at :linear, and provide the data and the kernel. The forward pass computes the cross-correlation. You can also find the low-level function in the internals.

Thanks Dragan. I was mostly just looking into how to implement this int he Clojure way “manually” - had I know it’d be this involved I probably would have just reached for a library

Your new library looks interesting :slight_smile: I’m trying to move into machine learning so I’ll be following it for sure - but there are no docs or intros yet. I also see it’s using neanderthal - which I’ve used and I really enjoyed (It’s much better/more-sane than Nd array solutions that are popular now a days and goofy broken things like core.matrix). But while the API is great, it does require the MKL which makes it a bit of a no-go. I’m releasing my GUI application as a JAR and asking users to install the MKL is just too much b.c they’re all nontechnical (I’m having trouble just getting people to consistently install Java 11…). And bundling the MKL for all platforms in the JAR will balloon the application size.

I know you’ve explained to me that it’s not too much work to put in a different backend , but I’m just a bit scared to jump down that rabbit hole right now. Since you have an OpenCL implementation it could possible be bundled with POCL to generate native CPU matrix math kernels and maybe make a non-MKL CPU implementation without too much extra work… but even then you need to have compilation targets. I’d like to stick to my JVM bubble as long as I can! Once I leave it then I have other options as well - like just writing C++ and using the JNI. Hence this thread :slight_smile:

In any case, I’m sure I’ll be using your awesome libraries in subsequent projects. I’ll keep you in the loop if I make anything noteworthy. Thanks as always for all your open source work

Any reason you work almost exclusively with doubles? I profiled your code and there’s some overhead on casts which could probably be cut down.

Thanks for taking a look @bsless :slight_smile:

Could you share with me how you profiled it? I’m very curious - b/c I feel a bit like I’m poking around in the dark with VisualVM

I kinda just tried to keep everything as double for simplicity (and because at first I was not in-lining and using helper functions - which I type-hinted as ^double). I liked messed something up though :stuck_out_tongue:

Well, everything which is an array index is clearly a long (which is what returns from all the unchecked-* functions anyway).
How I profiled:
add clj-async-profiler as dependency: https://github.com/clojure-goes-fast/clj-async-profiler
run something like:

(prof/profile (dotimes [_ 100] (test-cross)))

look at the resulting flame graph.
Make sure you’re configured correctly according to the instruction in the repo.
Optionally, add the following jvm flags:

                    :jvm-flags ["-XX:+PreserveFramePointer"
                               "-XX:+UnlockDiagnosticVMOptions"
                               "-XX:+DebugNonSafepoints"]

It looks like there might be a few casts and unboxings which can be avoided.

but there are no docs or intros yet.

I have to correct you here. There are plenty of docs and intros. There’s even a full fledged book :wink:

Just saw this. I might take another pass at the vector impl just for kicks, and maybe try to shorten and make faster the array implementation. Also take a look at what the upper bound on perf would be with a neanderthal-based comparison. I think your observations are on track (e.g. destructuring costs and the like). The unboxing operations will show up, but they’re not obvious… One way to ensure you’re using primitive math everywhere is to use the primitve-math lib (either the stand-alone version or the one bundled in @generatme fast-math library), import the ns, and turn on primitive-math ops. Then you will get reflection warnings where it cannot infer the type, as well as guaranteed unboxed math. It’s a very clean way to tack down the boxing/unboxing issues, and typically yields a very nice performance boost (you have to go hint everything, but it’s fairly painless). I will probably have some more observations after digging through your revisions and profiling a bit.

1 Like

A note on destructuring - if unavoidable, you can gain some speedups by defining type hinted container types, i.e.

(deftype Interval [^long start ^long end])

Then pass it as a type hinted argument and get the fields by interop.
If cuts down slightly on allocation and gives a slight speedup.

Ooof, thanks for the tip @bsless. It was a bit challenging to get working

So yeah, you need to make sure you have debugging symbols installed and then make sure perf profiling security is disabled - this part is all explained in detail in clj-async-profiler

So I’d never had to add JVM flags - this was all new to me and not really explained anywhere (everywhere just assumes you know how to do it haha ). Just notes for anyone that comes across this later:

  • You need to create an “alias” in your deps.edn that has the options enabled
 :aliases {:dev {:jvm-opts ["-Djdk.attach.allowAttachSelf"
                            "-XX:+PreserveFramePointer"
                            "-XX:+UnlockDiagnosticVMOptions"
                            "-XX:+DebugNonSafepoints"]}}
  • Here i chose the name dev but this key is random and has no meaning. Then you need to force CIDER to use this alias (this part is undocumented) by adding this to your init.el in Emacs (or running from the *scratch* buffer)
(setq cider-clojure-cli-global-options "-Adev")

Then if you do everything correctly it will run the profile and dump the result into /temp/

I’m now seeing all the valls to valueOf methods which must be the type conversions. The profile doesn’t give call sites so I’ve poked around and found where all the issues were :slight_smile: This is a great tool to have! thanks

There is some funny stuff where in Clojure you can only type hint ^long but for instance alength returns an int . However I seem to have managed to keep all the indices as int without needing to explicitly type hint anything.

Now my flame graph is dominated by clojure.lang.RT.nth - but I’m not explicitly using any Clojure PersistentVectors (other than as the final return value) so I’m not quite sure where this is coming from. Is this an implicit data structure used in the loop recursion? Or are they generated by let blocks?

@joinr do let me know if you make anything of it. The vector implementation was more a brain storming session… it’s a bit half baked. The tricky part that I couldn’t resolve is that ideally in the first pass will you get overlap start-end indices and an overlap average and then the second pass needed to start will use those indices to avoid running across the whole vector. Now that I’m thinking about it again, it seems you could filter down the arrays onthe average-counting pass - I’m forgetting why I abandoned that idea

Oh and the type hinted container trick is great! I might revisit that if it doesn’t hurt performance much b/c it would really help clean up the code

you get nth calls due to destructuring of the vectors you return in places like shift-range
Would you like to share the flame graph?

Sorry, I meant to send this yesterday 04-cpu-flamegraph-2020-09-09-12-02-06

You’re right that there is some destructing in the “outer loop” - which I left for code-clarity and b/c I figured it wasn’t getting called too many times. (Though looking at the numbers it’s not-to-small of a percentage here). However I’m seeing these nth calls in all functions.

EDIT: Oh, I guess it could be the destructuring of the function arguments (which I’m guessing counts as in-function time - and not caller time). It would be crazy if getting the function arguments is swamping the actual function body. I will need to play around with this

I fear it might be destructuring of the function arguments. That’s why I suggested creating Interval containers with deftype

I managed to remove the destructuring in the arguments. From the previous flamegraph is looked like it would give some huge 3x improvement - but I guess there is something funny going on (maybe CPU pipeline stuff… not quite sure) - but I got a modest 10% speedup (I’m just running time so it’s not a great benchmark)02-cpu-flamegraph-2020-09-09-16-52-22

There is still destructuring in the top level array-cross-correlation which I’ll try to ameliorate with deftype . I thought I could remove it entirely by inlining the code (suince the function is getting called in just one place anyway) but it’s still a return value of loop so I inevitably need some intermediary container. I’ll give deftype a try and post back with results

One thing I’m a bit unclear on is long vs int. You have things like alength that give you ints and then you have things like unchecked-inc and -dec that return long values. Some places they get mixed up inevitably… though seemingly there is big performance penalty there. But I’m not sure what the best solution is there

Well after an hour I seemingly can’t figure out the right syntax with deftype (and I can’t find anything with on the web)

I restructure the code to return a deftype:

(deftype averaging-result [^double average ^long start-index ^long end-index])

(defn count-average-over-range ^averaging-result
  [^double start-position
   ^double end-position
   ^doubles positions
   ^doubles counts]
;; ... stuff
;; return site:
      (averaging-result. (/ total-counts
                           number-of-positions)
                        start-index
                        end-index)
;;
)

and then later at the call site I restructure to use it

;; ...
            (let [[overlap-start ;; this part still need to be rewritten to use a deftype :)
                   overlap-end] (overlap-area overlay-positions shifted-positions)
                  overlay-averaging-result (count-average-over-range overlap-start
                                                                     overlap-end
                                                                     overlay-positions
                                                                     overlay-counts)
                  ^double average-overlay (.average overlay-averaging-result)
                  ^long overlay-start-index (.start-index overlay-averaging-result)
                  ^long overlay-end-index (.end-index overlay-averaging-result)
;; ...

But this leads to reflection warnings b/c when I use interop to get at the values it seemingly can’t determine the that the return type is an instance of the deftype.

Reflection warning, /home/geokon/Projects/correlation/src/correlation/arrays.clj:389:43 - reference to field average on java.lang.Object can't be resolved.

When I try to annotate it, it gets angry

Can't type hint a local with a primitive initializer

I suspected it may be my type hint syntax… so I’ve tried many alternatives ^averaging-result, ^"averaging-result", ^correlation.arrays.averaging-result and ^"correlation.arrays.averaging-result" but nothing seems to stick… :slight_smile:
That said, even with the reflection warning, I am seeing a speedup

1 Like

Ah, you got bit by how the compiler munges names. try averaging_result, that should work.
The convention for def-ed types is usually camel case, then it wouldn’t even have happened, but the type hinting is for the emitted class so you need the class name.

Good catch :stuck_out_tongue:
But I still get the same error

I’ve update the Gist to the latest code: https://gist.github.com/geokon-gh/342083acc137c602a3f64f9d4694afd4#file-array-style-clj-L395

If you change the line to look like

overlay-averaging-result ^averagingResult (count-average-over-range  overlap-start
                                                                     overlap-end
                                                                     overlay-positions
                                                                     overlay-counts)

Then you get

Can't type hint a local with a primitive initializer

I believe the class is detected b/c If you put in some gibberish like ^NotARealType then you get a different error

For some reason it thinks the return type of count-average-over-range is a primitive,
The function signature reads


(defn count-average-over-range ^correlation.arrays.averagingResult
  [^double start-position
   ^double end-position
   ^doubles positions
   ^doubles counts]

But the return type can be removed or changed to just ^averagingResult to no effect

This is all for fun at this point. If you have no other ideas off the top of your head - then it’s no big deal :slight_smile:

Try this:

                 ^averagingResult overlay-averaging-result (count-average-over-range  overlap-start
                                                                      overlap-end
                                                                      overlay-positions
                                                                      overlay-counts)
                  average-overlay (.average overlay-averaging-result)
                  overlay-start-index (.start-index overlay-averaging-result)
                  overlay-end-index (.end-index overlay-averaging-result)
                  ^averagingResult shifted-averaging-result (count-average-over-range overlap-start
                                                                     overlap-end
                                                                     shifted-positions
                                                                     shifted-counts)
                  average-shifted (.average shifted-averaging-result)
                  shifted-start-index (.start-index shifted-averaging-result)
                  shifted-end-index (.end-index shifted-averaging-result)

The type hinting got wacky when you type hinted the primitives after destructuring.

I think you can replace all the unchecked-... with regular (or primitive-math) operators when you have (set! *unchecked-math* :warn-on-boxed) set.

Two more corrections:

  • array-average - doesn’t calculate average (check parentheses)
  • all type hints like new-floor-index ^long (unchecked-dec new-ceiling-index) should be replaced by (of course if necessary) with ^long new-floor-index (unchecked-dec new-ceiling-index)

I’m playing with this a little bit, so maybe I’ll come up with some more hints.

Yep, I went a bit too crazy with the type hints, but the final version worked and I got rid of all the reflection more or less

One thing that’s been very useful has been

The resulting Java is quite clear and easy to understand - and you see immediately what types it’s inferring and where it’s messing up.

The only outstanding things I’m seeing now are:

  • B/c array-correlation take too many arguments I can’t type hint them. So the resulting code needs casts (on the input of the loop - type hints there don’t seem to do anything). the only solution is more inlining (it’s only called in one spot - so it’s not horrible… I guess)
  • There are a ton of Int casts injected (basically everywhere you use aget) and it’s seemingly unavoidable b/c I guess you can’t increment/decrement bare ints through Clojure

But these two things are barely registering on the flamegraph

@generateme

  • Sorry, I should have removed that function. array-average is not being used. It ended up being inlined. Instead I have count-average-over-range which finds the array index range where the array overlaps while simultaneously calculating the average.
  • Yeah, I noticed later in the docstring for *unchecked-math* that I don’t need all the unchecked- stuff. I should go around and remove them :slight_smile: thanks for the reminder
  • And thanks for telling me about the proper type hint syntax. It wasn’t really clear to me which one I should be using haha
1 Like