Building off earlier replies, I noticed that your are working with primitive byte arrays. That opens up additional opportunity for optimization if you are willing to deviate from the path a bit (depending on how far you’d like to deviate, you could get even more performance…).
Looking primarily at overlap
, the variant that Sean proposed, we see this with the sample data under criterium:
(require '[criterium.core :as c])
(def b (byte-array [-1 1 2 3 -1 4 5 6]))
;; user> (c/quick-bench (overlap 4 b))
;; Evaluation count : 212262 in 6 samples of 35377 calls.
;; Execution time mean : 2.836395 µs
So that’s our baseline. Assuming scanning for overlaps is the bottleneck, we can leverage primitive math and contiguous arrays to avoid the penalties of the more general boxed operations.
(set! *unchecked-math* :warn-on-boxed)
Try to work with primitive unboxed values…since we have byte arrays, we can play with longs and bytes primitively…
We’d like to encode “new/not initialized” and “nil” as long values that are outside the byte range.
(def bytenil (long -129))
(def bytenew (long -130))
We’ll develop a function that performs a similar check to lists-equal, except we scan a primitive byte array and compare it to contents in a primitive long array with encoded values for bytenil and bytenew. use unchecked math where possible.
(defn scan-bytes
[^long len ^longs res ^bytes b]
(if (== (aget res 0) ^long bytenew)
;;uninitialized array, e.g. first reduction, just copy.
(do (dotimes [i len]
(aset res i (aget b i)))
res)
;;otherwise do the scan like before
(loop [idx 0]
(if (== idx len)
res
(let [l (aget res idx)]
;;skip nil values since they already mismatch
(when (> l ^long bytenil)
(let [r (aget b idx)]
(aset res idx ;;store match or nil
(if (== l r)
l
^long bytenil))))
(recur (inc idx)))))))
partition
returns sequences of sequences, which are boxed. Our function scan-bytes
wants to use a primitive byte array (a logical partition of the original buffer). So we need to define a way to generate these chunks. The naive implementation is to just return a seq of byte arrays based on indexing.
(defn byte-partitions [len ^bytes buf]
(for [i (range 0 (alength buf) len)]
(java.util.Arrays/copyOfRange buf ^long i ^long (unchecked-add i len))))
This is faster than original implemention, about 2x, but still suboptimal due to needless lazy seq overhead
We have the promise of clojure.core/eduction
, which provides a recipe for a transducing operation on a collection that can be invoked via reduce with no intermediate seq. However, when we use it, we still incur some overhead due to its generality (many xforms can be supplied, and they are sorted out with butlast/last seq calls that cost us in the small)
It’s still roughly 2x slower in this case, as it creates sequences when computing the eduction transformer through comp
, which is not necessary for our binary arity:
(defn byte-partitions [len ^bytes buf]
(eduction
(map (fn [i]
(java.util.Arrays/copyOfRange buf ^long i ^long (unchecked-add i len))))
(range 0 (alength buf) len)))
Fortunately we can just create the eduction manually since we have 2 arguments.
There is no need to the comp stuff. This is faster (although technically it depends on the internals of the Eduction constructor, which may get you scolded by clojure.core maintainers, but for let’s live dangerously for now).
(defn byte-partitions [len ^bytes buf]
(clojure.core.Eduction.
(map (fn [i]
(java.util.Arrays/copyOfRange buf ^long i ^long (unchecked-add ^long i ^long len))))
(range 0 (alength buf) len)))
Our byte array overlap definition uses scan-bytes
and byte-partitions
to scan the buffer and compute long array of encoded values for nil or the original numeric byte value (as a long). We then coerce that into a boxed vector as with overlap, replacing bytenil values with nil.
(defn boverlap
"overlap lengths of buffer with equality check. keeps value if equal, else nil."
[len ^bytes buf]
(->> (reduce (fn [known bs]
(scan-bytes len known bs)) (long-array len bytenew) (byte-partitions len buf))
(mapv (fn [^long x] (if (== x ^long bytenil) nil x)))))
;; user> (c/quick-bench (boverlap 4 b))
;; Evaluation count : 1033590 in 6 samples of 172265 calls.
;; Execution time mean : 562.870507 ns
;;~5x faster
Profiling with visualvm indicates we’re paying some costs even with reduce, so we can unroll into a simple loop without too much trouble: unroll the kraken
(defn boverlap2
"overlap lengths of buffer with equality check. keeps value if equal, else nil."
[len ^bytes buf]
(let [bound (alength buf)
len (long len)]
(loop [idx 0
known (long-array len bytenew)]
(if (== idx bound)
(mapv (fn [^long x] (if (== x ^long bytenil) nil x)) known)
(let [chunk (java.util.Arrays/copyOfRange buf idx ^long (unchecked-add idx len))]
(recur (unchecked-add idx len)
(scan-bytes len known chunk)))))))
;;user> (c/quick-bench (boverlap2 4 b))
;;Evaluation count : 1508508 in 6 samples of 251418 calls.
;;Execution time mean : 397.060787 ns
;;~7.14x faster
We can reuse the same buffer during reduction, maybe array allocation is costing:
(defn boverlap3
"overlap lengths of buffer with equality check. keeps value if equal, else nil."
[len ^bytes buf]
(let [bound (alength buf)
len (long len)
chunk (byte-array len)]
(loop [idx 0
known (long-array len bytenew)]
(if (== idx bound)
(mapv (fn [^long x] (if (== x ^long bytenil) nil x)) known)
(do (System/arraycopy buf idx chunk 0 len)
(recur (unchecked-add idx len)
(scan-bytes len known chunk)))))))
That turns out to be about the same speed with with multiple copies (maybe it matters if we have many chunks or partitions, where allocation could creep up).
Another relatively “expensive” area is the creation of the vector at the end. One option is to not create a vector and just make an object array. It’s compatible with seq stuff and indexed access, so maybe it’s passable. Will this speed up anything?
(defn box-array ^objects [^longs from]
(let [bound (alength from)
res (object-array bound)
nilify (fn [^long x]
(if (== x ^long bytenil) nil x))]
(dotimes [i bound]
(aset res i (nilify (aget from i)) ))
res))
(defn boverlap4
"overlap lengths of buffer with equality check. keeps value if equal, else nil."
[len ^bytes buf]
(let [bound (alength buf)
len (long len)
chunk (byte-array len)]
(loop [idx 0
known (long-array len bytenew)]
(if (== idx bound)
(box-array known)
(do (System/arraycopy buf idx chunk 0 len)
(recur (unchecked-add idx len)
(scan-bytes len known chunk)))))))
;; user> (c/quick-bench (boverlap4 4 b))
;; Evaluation count : 6375900 in 6 samples of 1062650 calls.
;; Execution time mean : 93.614657 ns
;;~30x faster
What if we coerce the boxed array to a vector? Is that less costly than mapv
in this case, and faster than our original method?
(defn boverlap5
"overlap lengths of buffer with equality check. keeps value if equal, else nil."
[len ^bytes buf]
(let [bound (alength buf)
len (long len)
chunk (byte-array len)]
(loop [idx 0
known (long-array len bytenew)]
(if (== idx bound)
(vec (box-array known))
(do (System/arraycopy buf idx chunk 0 len)
(recur (unchecked-add idx len)
(scan-bytes len known chunk)))))))
;;user> (c/quick-bench (boverlap5 4 b))
;;Evaluation count : 3203394 in 6 samples of 533899 calls.
;;Execution time mean : 185.718450 ns
~15x faster, so creating the vector costs us. If we can live with an object array, that’s better.
We can define a wrapper type around the primitive long array, and eliminate the need for any intermediate byte array if we just index into the buffer.
(defn scan-bytes2
[^long bidx ^long len ^longs res ^bytes b]
(if (== (aget res 0) ^long bytenew)
;;uninitialized array, e.g. first reduction, just copy.
(do (dotimes [i len]
(aset res i (aget b i)))
res)
;;otherwise do the scan like before
(loop [idx 0]
(if (== idx len)
res
(let [l (aget res idx)]
;;skip nil values since they already mismatch
(when (> l ^long bytenil)
(let [r (aget b (unchecked-add bidx idx))]
(aset res idx ;;store match or nil
(if (== l r)
l
^long bytenil))))
(recur (inc idx)))))))
Defines a wrapper around our array with long-encoded nil values, which is extremely quick to create, but provides enough vector-like access to be used as a vector (not all operations are implemented like hashing and friends, but it’s not hard to do so if necessary). This isn’t exhaustively tested either, but works for the immediate task at hand:
(deftype longvec [^longs coll ]
clojure.lang.IPersistentVector
clojure.lang.Seqable
(seq [this] (map (fn [^long x]
(if (== x ^long bytenil) nil x)) coll))
clojure.lang.Indexed
(nth [this idx]
(let [n (aget coll idx)]
(if (== n ^long bytenil) nil n)))
(nth [this idx not-found]
(if (and (> idx -1)
(< idx (alength coll)))
(.nth this idx)
not-found))
(cons [this item] (conj (vec this) item))
clojure.lang.Counted
(count [this] (alength coll))
java.util.List
(add [this item] (throw (ex-info "not-implemented" {:op :add})))
(get [this idx] (.nth this idx))
java.lang.Iterable
(iterator [this]
(.iterator ^java.lang.Iterable (.seq this)))
(toArray [this] (box-array coll)))
We use the new stuff:
(defn boverlap6
"overlap lengths of buffer with equality check. keeps value if equal, else nil."
[len ^bytes buf]
(let [bound (alength buf)
len (long len)]
(loop [idx 0
known (long-array len bytenew)]
(if (== idx bound)
(longvec. known)
(recur (unchecked-add idx len)
(scan-bytes2 idx len known buf))))))
;;user> (c/quick-bench (boverlap6 4 b))
;;Evaluation count : 12316662 in 6 samples of 2052777 calls.
;;Execution time mean : 47.068487 ns
Eliminating the vector/array creation by using a wrapper type netted us 60x faster than the baseline without much sacrifice (result is a custom vector).
In summary, this probably doesn’t matter since it’s only a piece of your program, but if you are working in areas where primitives can be leveraged, definitely consider it. Particularly for performance oriented bottlenecks (e.g. numerics, working with bytes). If you can coerce your problem into primitive collections and operations, then the gains may be substantial enough to warrant chunking a bit of elegance (not all elegance though). There are additional means to get more performance beyond this kind of interop as well.