My recent Clojure vs C++ experience

magnus0re · December 11, 2020, 1:44am

Sorry if this post is in the wrong category, I’m not that good with types.
What follows is a small write-up of my past week working on a small tool for work.

Using Clojure feels faster now than using C++.

Some background:
I’ve been programming in Clojure in my free time for about 18 months or so, for learning mostly.
My work projects are mostly in Python & C++. Currently I’m trying to analyze faults in a live data stream on a linux server. (Custom binary protocol, 100MB/s ish)

I have the need for a small program that let me see the output of this linux fifo (named pipe), but sampling it almost like an oscilloscope. I want the concept of automatically finding packet sizes, offsets, what parts of the message are stable and what parts of the message are always changing, and whatever in between. The Idea was originally a TUI with a few colors, arrow keys for manually adjusting the trigger settings. The more I got to thinking about the problem space, the more I saw the potential for super bad algorithms, in a Big O sense.

The final version of this program shall be able to take a snapshot of a ‘good’ packet, or set of good packets, and then count and store any ‘bad’ packets, because I have a ‘chaos-monkey’ on the producing side of this fifo that needs to run for weeks to get the data statistics I need.

The ‘chaos-monkey’ is actually hardware emulation of radiation-induced single event upsets running on some readout electronics. (yay, work), but the data stream ends up on this linux server, so I can run whatever I need there.

I spent ~2 days hacking away in C++ to try to make a nice program come out to do this.
Super painful, not fast, and most of the time not even working. Turns out I’m probably a poor C++ programmer, somehow I could not read the fifo faster than every 500 ms. Tried fstream, with and without buffering, binary and text mode, then two other attempts low-level C/POSIX system calls. No speed gains. I don’t understand why there are so many, many ways to shoot yourself in the foot with that language… I don’t even know what is wrong. So today I needed something fresh, so I wanted just to see if it went any better in Clojure.

So today I spent 1/2 a day in Clojure, tackling the same problem. And I’ve gotten farther than all my previous work combined. Still work in progress though.

Opening the stream, (first time for me) where I decided not to use with-open, is simply

(require '[clojure.java.io :as io])
(def fifo (io/input-stream "/tmp/myfifo"))

I just want the fifo to be open, always. So I close it in the REPL if I want.
Reading the stream is

(defn read-bytes! [len]
  (let [buf (byte-array len)]
    (.read fifo buf)
    buf))

I’ve got the offset detection and stable part detection working. Took me about 100 lines of code. I’m thinking that it probably could be smaller, but I still lack some cleverness to make that happen.
The code takes just under 300 ms for packet sizes in the range 50 - 8192 bytes. Good enough for me.

The following is the functions that I am both the most and the least happy with. I feel that the reducing function in lists-equal could be better.
The overlap function does the most work in the code I have now.

(defn lists-equal [len]
  (fn ([a] a) 
    ([a b]
     (let [_ nil]
       (if (every? nil? a)
         (reduced a)
         (mapv (fn [[x0 x1]] (if (= x0 x1) x0 nil)) (partition 2 (interleave a b))))))))

(defn overlap 
  ; overlap lengths of buffer with equality check. keeps value if equal, else nil.
  [len buf]
  (reduce (lists-equal len) (partition len buf)))
; (overlap 4 [-1 1 2 3,-1 4 5 6]) => [-1 nil nil nil]

When those pieces were done, writing the rest was basically just program flow, and a bit of counting.

So I somehow became better at solving my problem with Clojure than C++. I am both happy and bitter at the same time.
Anyways, thanks for being a great community! And thanks for reading, if you got this far.

seancorfield · December 11, 2020, 3:45am

Welcome! I have a background in C++ so I sympathize

mapv (and map) can take multiple sequence arguments so you could just pass a and b as separate arguments without the partition/interleave stuff:

(mapv #(when (= %1 %2) %1) a b)

len doesn’t appear to be used in lists-equal and reduce doesn’t attempt to call the reducing function with one argument (it would call it with no arguments if the collection has no elements), so you could write:

(defn lists-equal [a b]
  (if (every? nil? a) 
   (reduced a)
   (mapv #(when (= %1 %2) %1) a b)))

(defn overlap 
  "overlap lengths of buffer with equality check. keeps value if equal, else nil."
  [len buf] 
  (reduce lists-equal (partition len buf)))
; (overlap 4 [-1 1 2 3 -1 4 5 6]) => [-1 nil nil nil]

This won’t work if the buf is empty (but nor will your version, so I think this is equivalent).

For the comment you have describing overlap, it’s better to have an actual docstring (as above) so that it shows up in the REPL:

user=> (doc overlap)
-------------------------
user/overlap
([len buf])
  overlap lengths of buffer with equality check. keeps value if equal, else nil.
nil
user=>

magnus0re · December 11, 2020, 1:33pm

Sweet!
you just made the code work 3x faster. Thanks very much. Down to 100 ms now.

I did remember that map can take two values, but I mis-remembered and was thinking that it worked like the bindings in the for statement, somehow. Thanks for reminding me. But I had totally forgotten about when.

Richard_Heller · December 11, 2020, 1:38pm

Without seeing your C++ code, I’d assume the issues are string related. Since the garbage collector isn’t an issue, though, if you need better raw performance you could use Java instead of C++. It’s basically the same language without the memory corruption issues. Java’s performance is on par with C++ for most things and can be faster in some cases.

If Clojure is fast enough, though, stick with that. It’s a nicer language to work with.

Kah0ona · December 11, 2020, 3:30pm

lol put a smile on my face

joinr · December 11, 2020, 11:38pm

Building off earlier replies, I noticed that your are working with primitive byte arrays. That opens up additional opportunity for optimization if you are willing to deviate from the path a bit (depending on how far you’d like to deviate, you could get even more performance…).

Looking primarily at overlap, the variant that Sean proposed, we see this with the sample data under criterium:

(require '[criterium.core :as c])
(def b (byte-array [-1 1 2 3 -1 4 5 6]))

;; user> (c/quick-bench (overlap 4 b))
;; Evaluation count : 212262 in 6 samples of 35377 calls.
;; Execution time mean : 2.836395 µs

So that’s our baseline. Assuming scanning for overlaps is the bottleneck, we can leverage primitive math and contiguous arrays to avoid the penalties of the more general boxed operations.

(set! *unchecked-math* :warn-on-boxed)

Try to work with primitive unboxed values…since we have byte arrays, we can play with longs and bytes primitively…

We’d like to encode “new/not initialized” and “nil” as long values that are outside the byte range.

(def  bytenil (long -129))
(def  bytenew (long -130))

We’ll develop a function that performs a similar check to lists-equal, except we scan a primitive byte array and compare it to contents in a primitive long array with encoded values for bytenil and bytenew. use unchecked math where possible.

(defn scan-bytes
  [^long len ^longs res ^bytes b]
  (if (== (aget res 0) ^long bytenew)
    ;;uninitialized array, e.g. first reduction, just copy.
    (do (dotimes [i len]
          (aset res i (aget b i)))
        res)
    ;;otherwise do the scan like before
    (loop [idx 0]
      (if (== idx len)
        res
        (let [l (aget res idx)]
          ;;skip nil values since they already mismatch
          (when (> l  ^long bytenil)
            (let [r (aget b idx)]
              (aset res idx  ;;store match or nil
                    (if (== l r)
                      l
                      ^long bytenil))))
          (recur (inc idx)))))))

partition returns sequences of sequences, which are boxed. Our function scan-bytes wants to use a primitive byte array (a logical partition of the original buffer). So we need to define a way to generate these chunks. The naive implementation is to just return a seq of byte arrays based on indexing.

(defn byte-partitions [len ^bytes buf]
  (for [i (range 0 (alength buf) len)]
    (java.util.Arrays/copyOfRange buf  ^long i ^long (unchecked-add i len))))

This is faster than original implemention, about 2x, but still suboptimal due to needless lazy seq overhead

We have the promise of clojure.core/eduction, which provides a recipe for a transducing operation on a collection that can be invoked via reduce with no intermediate seq. However, when we use it, we still incur some overhead due to its generality (many xforms can be supplied, and they are sorted out with butlast/last seq calls that cost us in the small)

It’s still roughly 2x slower in this case, as it creates sequences when computing the eduction transformer through comp, which is not necessary for our binary arity:

(defn byte-partitions [len ^bytes buf]
  (eduction
   (map (fn [i]
          (java.util.Arrays/copyOfRange buf  ^long i ^long (unchecked-add i len))))
   (range 0 (alength buf) len)))

Fortunately we can just create the eduction manually since we have 2 arguments.
There is no need to the comp stuff. This is faster (although technically it depends on the internals of the Eduction constructor, which may get you scolded by clojure.core maintainers, but for let’s live dangerously for now).

(defn byte-partitions [len ^bytes buf]
  (clojure.core.Eduction.
   (map (fn [i]
          (java.util.Arrays/copyOfRange buf  ^long i ^long (unchecked-add ^long i ^long len))))
   (range 0 (alength buf) len)))

Our byte array overlap definition uses scan-bytes and byte-partitions to scan the buffer and compute long array of encoded values for nil or the original numeric byte value (as a long). We then coerce that into a boxed vector as with overlap, replacing bytenil values with nil.

(defn boverlap
  "overlap lengths of buffer with equality check. keeps value if equal, else nil."
  [len ^bytes buf]
  (->> (reduce  (fn [known bs]
                  (scan-bytes len known bs)) (long-array len bytenew) (byte-partitions len buf))
       (mapv (fn [^long x] (if (== x ^long bytenil) nil x)))))

;; user> (c/quick-bench (boverlap 4 b))
;; Evaluation count : 1033590 in 6 samples of 172265 calls.
;; Execution time mean : 562.870507 ns
;;~5x faster

Profiling with visualvm indicates we’re paying some costs even with reduce, so we can unroll into a simple loop without too much trouble: unroll the kraken

(defn boverlap2
  "overlap lengths of buffer with equality check. keeps value if equal, else nil."
  [len ^bytes buf]
  (let [bound (alength buf)
        len   (long len)]
    (loop [idx   0
           known (long-array len bytenew)]
      (if (== idx bound)
        (mapv (fn [^long x] (if (== x ^long bytenil) nil x)) known)
        (let [chunk (java.util.Arrays/copyOfRange buf  idx ^long (unchecked-add idx len))]
            (recur (unchecked-add idx len)
                   (scan-bytes len known chunk)))))))

;;user> (c/quick-bench (boverlap2 4 b))
;;Evaluation count : 1508508 in 6 samples of 251418 calls.
;;Execution time mean : 397.060787 ns
;;~7.14x faster

We can reuse the same buffer during reduction, maybe array allocation is costing:

(defn boverlap3
  "overlap lengths of buffer with equality check. keeps value if equal, else nil."
  [len ^bytes buf]
  (let [bound (alength buf)
        len   (long len)
        chunk (byte-array len)]
    (loop [idx   0
           known (long-array len bytenew)]
      (if (== idx bound)
        (mapv (fn [^long x] (if (== x ^long bytenil) nil x)) known)
        (do (System/arraycopy  buf  idx chunk 0  len)
            (recur (unchecked-add idx len)
                   (scan-bytes len known chunk)))))))

That turns out to be about the same speed with with multiple copies (maybe it matters if we have many chunks or partitions, where allocation could creep up).

Another relatively “expensive” area is the creation of the vector at the end. One option is to not create a vector and just make an object array. It’s compatible with seq stuff and indexed access, so maybe it’s passable. Will this speed up anything?

(defn box-array ^objects [^longs from]
  (let [bound  (alength from)
        res    (object-array bound)
        nilify (fn [^long x]
                  (if (== x ^long bytenil) nil x))]
    (dotimes [i bound]
      (aset res i (nilify (aget from i)) ))
    res))

(defn boverlap4
  "overlap lengths of buffer with equality check. keeps value if equal, else nil."
  [len ^bytes buf]
  (let [bound (alength buf)
        len   (long len)
        chunk (byte-array len)]
    (loop [idx   0
           known (long-array len bytenew)]
      (if (== idx bound)
        (box-array known)
        (do (System/arraycopy  buf  idx chunk 0  len)
            (recur (unchecked-add idx len)
                   (scan-bytes len known chunk)))))))

;; user> (c/quick-bench (boverlap4 4 b))
;; Evaluation count : 6375900 in 6 samples of 1062650 calls.
;; Execution time mean : 93.614657 ns
;;~30x faster

What if we coerce the boxed array to a vector? Is that less costly than mapv in this case, and faster than our original method?

(defn boverlap5
  "overlap lengths of buffer with equality check. keeps value if equal, else nil."
  [len ^bytes buf]
  (let [bound (alength buf)
        len   (long len)
        chunk (byte-array len)]
    (loop [idx   0
           known (long-array len bytenew)]
      (if (== idx bound)
        (vec (box-array known))
        (do (System/arraycopy  buf  idx chunk 0  len)
            (recur (unchecked-add idx len)
                   (scan-bytes len known chunk)))))))

;;user> (c/quick-bench (boverlap5 4 b))
;;Evaluation count : 3203394 in 6 samples of 533899 calls.
;;Execution time mean : 185.718450 ns

~15x faster, so creating the vector costs us. If we can live with an object array, that’s better.

We can define a wrapper type around the primitive long array, and eliminate the need for any intermediate byte array if we just index into the buffer.

(defn scan-bytes2
  [^long bidx ^long len ^longs res ^bytes b]
  (if (== (aget res 0) ^long bytenew)
    ;;uninitialized array, e.g. first reduction, just copy.
    (do (dotimes [i len]
          (aset res i (aget b i)))
        res)
    ;;otherwise do the scan like before
    (loop [idx  0]
      (if (== idx len)
        res
        (let [l (aget res idx)]
          ;;skip nil values since they already mismatch
          (when (> l  ^long bytenil)
            (let [r (aget b (unchecked-add bidx idx))]
              (aset res idx  ;;store match or nil
                    (if (== l r)
                      l
                      ^long bytenil))))
          (recur (inc idx)))))))

Defines a wrapper around our array with long-encoded nil values, which is extremely quick to create, but provides enough vector-like access to be used as a vector (not all operations are implemented like hashing and friends, but it’s not hard to do so if necessary). This isn’t exhaustively tested either, but works for the immediate task at hand:

(deftype longvec [^longs coll ]
  clojure.lang.IPersistentVector
  clojure.lang.Seqable
  (seq [this] (map (fn [^long x]
                     (if (== x ^long bytenil) nil x)) coll))
  clojure.lang.Indexed
  (nth [this idx]
    (let [n (aget coll idx)]
      (if (== n ^long bytenil) nil n)))
  (nth [this idx not-found]
    (if (and (> idx -1)
             (< idx (alength coll)))
      (.nth this idx)
      not-found))
  (cons [this item] (conj (vec this) item))
  clojure.lang.Counted
  (count [this] (alength coll))
  java.util.List
  (add [this item] (throw (ex-info "not-implemented" {:op :add})))
  (get [this idx]  (.nth this idx))
  java.lang.Iterable
  (iterator [this]
    (.iterator ^java.lang.Iterable (.seq this)))
  (toArray [this] (box-array coll)))

We use the new stuff:

(defn boverlap6
  "overlap lengths of buffer with equality check. keeps value if equal, else nil."
  [len ^bytes buf]
  (let [bound (alength buf)
        len   (long len)]
    (loop [idx   0
           known (long-array len bytenew)]
      (if (== idx bound)
        (longvec. known)
        (recur (unchecked-add idx len)
               (scan-bytes2 idx len known buf))))))

;;user> (c/quick-bench (boverlap6 4 b))
;;Evaluation count : 12316662 in 6 samples of 2052777 calls.
;;Execution time mean : 47.068487 ns

Eliminating the vector/array creation by using a wrapper type netted us 60x faster than the baseline without much sacrifice (result is a custom vector).

In summary, this probably doesn’t matter since it’s only a piece of your program, but if you are working in areas where primitives can be leveraged, definitely consider it. Particularly for performance oriented bottlenecks (e.g. numerics, working with bytes). If you can coerce your problem into primitive collections and operations, then the gains may be substantial enough to warrant chunking a bit of elegance (not all elegance though). There are additional means to get more performance beyond this kind of interop as well.

mdiin · December 12, 2020, 9:08am

You are really good at explaining this kind of optimisation approach and the reasoning behind each consecutive step. I have mostly never had the need for this kind of performance, but I always enjoy reading your explanations.

magnus0re · December 12, 2020, 9:11pm

Wow!
First of all, this is way out of my comfort zone in Clojure. Thank you for introducing it to me.

These are some great optimizations. Seeing this code makes me realize that I have lots of places where I can optimize quite a lot.

I have to learn to use visualvm now.

I’ve been studying the code today, and it is a great learning experience!

joinr · December 12, 2020, 10:08pm

Very glad the poorly spell-checked prose was informative

joinr · December 12, 2020, 10:13pm

Fyi, if we leverage bytenil and bytenew as constants:

(def ^:const bytenil -129)
(def ^:const bytenew -130)

then the clojure compiler will just inline the longs for us. That eliminates some overhead and gets the runtime to

user> (c/quick-bench (boverlap6 4 b))
Evaluation count : 15237528 in 6 samples of 2539588 calls.
             Execution time mean : 37.492619 ns
    Execution time std-deviation : 0.501656 ns
   Execution time lower quantile : 36.988994 ns ( 2.5%)
   Execution time upper quantile : 38.242332 ns (97.5%)
                   Overhead used : 1.939722 ns

Which is ~76.6x faster than baseline. I think additional inlining can be done to various effect. There’s some minor cost on the Clojure side since clojure prefers (and in a sense requires) longs as the default primitive integer type, where java array access actually expects int. This optimally compiles down to an uncheckedIntCast instruction to coerce the long to int e.g. when performing all the array-based indexing operations on the byte arrays, but there’s still some overhead. There are ways to get closer to java and avoid that (if it matters), via the JISE library or writing stuff in java and calling it via interop (less desirable for me personally).

Neo · December 13, 2020, 3:49pm

Do you have any free time to write a book on optimizing (Clojure) code? This is real mine of gold!

taf · December 13, 2020, 5:08pm

one more preorder from me!

joinr · December 14, 2020, 3:09am

There’s actually a nascent performance guide in development by @bsless as time permits. It should include some of this stuff. To be honest, most of the ideas are documented in existing texts. clojure-goes-fast has a lot of focused information as well.

Leveraging primitives is not a new idea. Moving from the general polymorphic boxed functions and representations into something specialized and primitive is a common theme. How far some texts go varies though. Are you willing to push all of your conditional logic into mathematical operations or bitwise ops? Can you represent your elegant sparse map lookup as an efficient array lookup in a primitive array? There are some interesting trade offs to weigh as you examine these, since even with macros and helper functions, you start to run into some encumbrance.

I think what’s not documented is the actual iterative process of taking a function or a program and honing in on the problem spots. Taking a profiler like VisualVM, running a benchmark, collecting the performance profile, snapshotting it, digging through the salient information in the performance analysis, interpreting the information from the statistics, hypothesizing plausible changes to change the profile (eliminate a bottleneck), then testing the hypothesis.

Maybe there’s a useful guide or blog post that describes this process, or at least what I have converged on for improving performance. I’ll ponder that.

winsome · December 15, 2020, 3:08pm

There was a Strange Loop talk from a few years ago about a “causal profiler”, coz that really opened my eyes into how useful a profiler can be in identifying bottlenecks. There is a version for the jvm, jcoz, but it’s not very well documented. If I were a rich person I would hire someone to make that an easy-to-use tool for clojure code.

mvarela · December 15, 2020, 6:37pm

There was actually some discussion about this last year, but I don’t know if the project ever took off… I’ve had the idea in the back-burner since I saw a talk about coz, which was utterly amazing.

This is the topic: Avoiding recursive function replacement with-redefs-fn

vvvvalvalval · December 15, 2020, 6:45pm

Come on Rich Hickey, stop pretending, we know it’s you.

taf · December 16, 2020, 1:30am

let’s hope you winsome money in the lottery then.

also, welcome to the community!

Victor_Gil · December 18, 2020, 11:11am

that answer was pure awesomeness.

system · June 18, 2021, 11:12pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.