Reducing function call latency

Daniel_Fitzpatrick · August 13, 2019, 3:44pm

I started work on dwimorberg to scratch an itch I had with basic math in Clojure. I wanted to have a ‘pluggable’ system so that numeric types eg apfloat, jscience could be added with minimal ceremony by the user. It would work similarly to the auto-promoting functions in c.c, transparently choosing appropriate types for any given function call.

The implementation so far is just a start: half-baked and not meeting any of the stated goals.

However, once the project was at a point where a benchmark could be performed, I was shocked to discover that perf was roughly 170% c.c’s +' function, for a small benchmark which added some longs:

I have a run recorded in dwimorberg.core

;; (crit/with-progress-reporting (crit/bench (reduce add (range 1 10000)) :verbose))
;; Execution time sample mean : 176.989471 µs

;; (crit/with-progress-reporting (crit/bench (reduce +' (range 1 10000)) :verbose))
;; Execution time sample mean : 104.753399 µs

profiling (with jvisualvm & tufte) seemed to indicate that the latency was mostly due to dispatch. So I sprinkled type hints everywhere (which javap tells me are being ignored) and rewrote a few different ways without any performance improvement.

A performance penalty is expected, but this gap is too large. I wonder if the idea is dead-in-the-water, or if there is an approach which would yield better performance.

I don’t know what performance would be acceptable, but with this benchmark I would be happy with around 120% speed of +'

joinr · August 13, 2019, 4:45pm

Just looking at your impls file, using dynvars is going to incur a hit on dereferencing (worse if they’re bound to non-default values). That’s a speedbump. Doing multiple instance checks via cond, also a hit (and ends up ignoring type hints) - although I don’t see a way around this. You’re better off doing hinted protocol method dispatch or inlined method calls (Typically) vs function invocation, although the inliner can sometimes work magic with small, decoupled functions. It looks like you do this in the impls.

fastmath may serve as a guide book, although its scope is intentionally narrower (targeting primitive numeric types) than providing an extensible, protocol-based mechanism for arithmetic operations.

Without diving in, I think the main difference is the *ops* dynvar and associated dynvars elsewhere. That’d be my first guess…and it may be a cost of doing business with your design.

p.s. you also have a lot of defrecords without any fields. Seems like you’re just using them for protocol implementations. deftype/reify looks better here IMO, unless you have data to store in a map-like container with efficient static field access.

Looking at dwimorberg.Common/add, it appears to be on par with clojure.core/+, since you’re doing checked adds and promoting.

Cool project

p.p.s

Since you’re not using any dynamic binding, I just elided the ^:dynamic meta and accepted the warnings, then re-ran your tests…here’s what I get:

dwimorberg.core> (c/quick-bench (reduce add (range 1 10000)))
Evaluation count : 12 in 6 samples of 2 calls.
Execution time mean : 51.536315 ms
Execution time std-deviation : 644.631891 µs
Execution time lower quantile : 50.735624 ms ( 2.5%)
Execution time upper quantile : 52.349056 ms (97.5%)
Overhead used : 1.924049 ns
nil
dwimorberg.core> (c/quick-bench (reduce +’ (range 1 10000)))
Evaluation count : 6684 in 6 samples of 1114 calls.
Execution time mean : 92.394386 µs
Execution time std-deviation : 3.378785 µs
Execution time lower quantile : 89.732801 µs ( 2.5%)
Execution time upper quantile : 98.031407 µs (97.5%)
Overhead used : 1.924049 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil

Looks like eliminating global var dereferencing buys you ~1.8x speedup over the baseline, which makes sense since you actually perform 3 dereferences along the way to get to a long-long operation. I’m guessing these add up.

I think this would be cool when coupled with some macros to detect type hinting and allow hinted math throughout. You can resolve the arithmetic operators at compile time and fallback to the runtime lookups if nothing is provided (or a generic Number is), which is what fastmath appears to do.

Daniel_Fitzpatrick · August 13, 2019, 5:30pm

Good call on using deftype.

I was about to say I’d perform some benchmarks with atoms and regular vars versus dynvars but it looks like you’ve already done some of that for me!

I’ll check out fastmath to see how this macro idea works there.

Daniel_Fitzpatrick · August 13, 2019, 8:02pm

Changed every defrecord to deftype.

Benchmarked without dynvars. Surprisingly, this didn’t affect the performance.

Execution time sample mean : 179.054983 µs

Then with atoms. Atoms proved to be significantly slower than dynvars.

Execution time sample mean : 317.834214 µs

joinr · August 13, 2019, 8:10pm

Very surprising that changing to dynvars didn’t affect performance on your machine when it did on mine…

I’m in the same ballpark still after reloading.

Did you push new source?

Daniel_Fitzpatrick · August 13, 2019, 8:14pm

I was surprised by that too.

My guess is criterium.core/bench optimized the dynvar lookup better than criterium.core/quick-bench. Or the last time I benchmarked with dynvars was an outlier.

joinr · August 13, 2019, 8:15pm

neither should be optimizing dynvar lookup; I think quick-bench just does a shorter, time-constrained set of runs vs. bench being more general (also provides default options like verbose).

Daniel_Fitzpatrick · August 13, 2019, 8:24pm

I pushed the changes to a new branch: regular-vars-to-test-perf.

I also ran the benchmark again with a noticeable performance gain this time.

Execution time sample mean : 158.819374 µs

joinr · August 13, 2019, 8:40pm

Ah, I see what I did. I also replaced the definition of add with

  (-> *ops*
      (proto/singleOps x)
      (proto/withTwo y)
      (proto/add x y)))

(defn add ^Number [^Number x ^Number y]
  (-> *ops*
      (.singleOps x)
      (.withTwo y)
      (.add x y)))

(defn fadd ^Number [^Number x ^Number y]
  (let [^dwimorberg.proto.IOps ops *ops*
        ^dwimorberg.proto.ISingleArityOps ops (.singleOps ops x)
        ^dwimorberg.proto.IWithTwo ops (.withTwo ops y)]
      (.add ops x y)))

dwimorberg.core> (c/quick-bench (reduce slow-add (range 1 10000)))
Evaluation count : 6426 in 6 samples of 1071 calls.
Execution time mean : 93.188050 µs
Execution time std-deviation : 1.265524 µs
Execution time lower quantile : 91.596183 µs ( 2.5%)
Execution time upper quantile : 94.719376 µs (97.5%)
Overhead used : 2.474407 ns

dwimorberg.core> (c/quick-bench (reduce add (range 1 10000)))
Evaluation count : 18 in 6 samples of 3 calls.
Execution time mean : 46.065552 ms
Execution time std-deviation : 903.623761 µs
Execution time lower quantile : 45.017299 ms ( 2.5%)
Execution time upper quantile : 47.044230 ms (97.5%)
Overhead used : 2.474407 ns

dwimorberg.core> (c/quick-bench (reduce fadd (range 1 10000)))
Evaluation count : 24 in 6 samples of 4 calls.
Execution time mean : 30.573606 ms
Execution time std-deviation : 563.679956 µs
Execution time lower quantile : 29.923466 ms ( 2.5%)
Execution time upper quantile : 31.445874 ms (97.5%)
Overhead used : 2.474407 ns

Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 13.8889 % Variance is moderately inflated by outliers

dwimorberg.core> (c/quick-bench (reduce +’ (range 1 10000)))
Evaluation count : 5214 in 6 samples of 869 calls.
Execution time mean : 117.264531 µs
Execution time std-deviation : 1.792870 µs
Execution time lower quantile : 113.923099 µs ( 2.5%)
Execution time upper quantile : 118.825913 µs (97.5%)
Overhead used : 2.474407 ns

Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil

The really strange thing is that I get improvement despite reflection warnings on the initial #'add rewrite, very strange.

Daniel_Fitzpatrick · August 13, 2019, 8:56pm

Your benchmark of slow-add looks faster than all your other benchmarks, including +'.

Maybe I’m not reading criterium’s output correctly.

joinr · August 13, 2019, 9:11pm

You’re right I missed the microsecond vs. ms. I was banging a previous project optimizing and was always in the “ms” range, so my eyes got used to it …
That makes a bit more sense; in practice protocol function dispatch is efficient (I “think” there’s compiler support for some cases).

So your implementation, here, on my machine, appears on par with the clojure.core implementation,
noting that “slow-add” is actually the current fastest of the bunch (still surprised by this, direct method dispatch ought to be faster). This is without dynamic vars.
:

dwimorberg.core> (c/quick-bench (reduce slow-add (range 1 10000)))
Evaluation count : 4332 in 6 samples of 722 calls.
Execution time mean : 138.456503 µs
Execution time std-deviation : 1.975058 µs
Execution time lower quantile : 135.918512 µs ( 2.5%)
Execution time upper quantile : 140.586139 µs (97.5%)
Overhead used : 2.474407 ns
nil
dwimorberg.core> (c/quick-bench (reduce +’ (range 1 10000)))
Evaluation count : 4362 in 6 samples of 727 calls.
Execution time mean : 137.133814 µs
Execution time std-deviation : 3.385023 µs
Execution time lower quantile : 133.407806 µs ( 2.5%)
Execution time upper quantile : 141.517850 µs (97.5%)
Overhead used : 2.474407 ns
nil

Daniel_Fitzpatrick · August 13, 2019, 9:16pm

Good to know I’m not nuts.

Still surprised that you’re benchmarks are roughly on par with c.c/+', while mine still run at about 150% speed.

Could be down to jvm version/processor differences.

Best not to question it and move forward by looking into fastmath I guess.

joinr · August 13, 2019, 9:27pm

With your test branch, I get:

dwimorberg.core> (c/quick-bench (reduce add (range 1 10000)))
Evaluation count : 6768 in 6 samples of 1128 calls.
Execution time mean : 88.962156 µs
Execution time std-deviation : 2.169067 µs
Execution time lower quantile : 86.856853 µs ( 2.5%)
Execution time upper quantile : 92.280416 µs (97.5%)
Overhead used : 1.776282 ns

Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil

dwimorberg.core> (c/quick-bench (reduce +’ (range 1 10000)))
Evaluation count : 8238 in 6 samples of 1373 calls.
Execution time mean : 74.363421 µs
Execution time std-deviation : 2.636942 µs
Execution time lower quantile : 71.363147 µs ( 2.5%)
Execution time upper quantile : 78.400636 µs (97.5%)
Overhead used : 1.776282 ns

Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 13.8889 % Variance is moderately inflated by outliers

About a 18% relative perf difference in favor of core’s +'. Curious to see if that can be ironed out any more, but it’s close. Could be getting down to some jit tricks (inlining parameters, inlining depth, etc.).

p.s.

After a restart, a clean, and another round, it looks like direct method invocation does help…

(defn addi [^Number x ^Number y]
  (.add ^dwimorberg.proto.ITwoArityOps
        (.withTwo ^dwimorberg.proto.IWithTwo
                  (.singleOps ^dwimorberg.proto.IOps dwimorberg.impls/ops x) y)
        x y))

dwimorberg.core> (c/quick-bench (reduce add (range 1 10000)))
Evaluation count : 5466 in 6 samples of 911 calls.
Execution time mean : 113.683740 µs
Execution time std-deviation : 1.059559 µs
Execution time lower quantile : 112.085641 µs ( 2.5%)
Execution time upper quantile : 114.832589 µs (97.5%)
Overhead used : 1.786501 ns

dwimorberg.core> (c/quick-bench (reduce addi (range 1 10000)))
Evaluation count : 10128 in 6 samples of 1688 calls.
Execution time mean : 59.235068 µs
Execution time std-deviation : 880.897615 ns
Execution time lower quantile : 58.368607 µs ( 2.5%)
Execution time upper quantile : 60.557801 µs (97.5%)
Overhead used : 1.786501 ns
nil

dwimorberg.core> (c/quick-bench (reduce +’ (range 1 10000)))
Evaluation count : 8922 in 6 samples of 1487 calls.
Execution time mean : 68.016916 µs
Execution time std-deviation : 1.381526 µs
Execution time lower quantile : 66.289970 µs ( 2.5%)
Execution time upper quantile : 69.645472 µs (97.5%)
Overhead used : 1.786501 ns
nil

That fits my mental model better; providing an inlined version of addi may provide some marginal relief. So here we kinda swing the other direction. If I run the benchmarks a couple more times, mean evaluation time doubles, but the relative differences are the same. Microbenching is peculiar.

system · February 12, 2020, 9:27am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.