TMD / Tablecloth Differentiation? How to use the libraries properly?

Introduction

I was trying to take the derivative of some unevenly sampled data, at the same time I’m learning (slowly) the tmd and related libraries for data science. So I explored a bit, and wondered where the performance hogs are, and if there are any good ways of using the libraries for better performance.

Basically the goal of the function is to take two columns, each representing a free variable in the dataset (like data over time), take the diff of each of those columns, and divide those.

First attempts

I began with what I felt like should be an ‘ok’ implementation (keep in mind I’m a n00b with these libraries)

(ns dev.differentiation
  (:require
   [tech.v3.dataset :as ds]
   [tech.v3.dataset.reductions :as ds-reduce]
   [tech.v3.datatype :as dtype]
   [tech.v3.datatype.gradient :as dt-grad]
   [tech.v3.datatype.functional :as dfn]
   [tablecloth.api :as api]))
(set! *warn-on-reflection* true)
(set! *unchecked-math* :warn-on-boxed)

(def example-data
  (api/dataset {:time [1 2 3 6 8 11 203 211] :x (take 8 (repeatedly #(rand 100)))}))

(defn d-dt1 [ds time-col data-col result-col]
  (let [grad-t (cons 1 (dt-grad/diff1d (time-col ds)))
        grad-x (cons 0 (dt-grad/diff1d (data-col ds))) ; empty first element
        compute-col (map #(/ ^Double %1 ^Double %2) grad-x grad-t)]
    (api/add-column ds result-col compute-col)))

Some demo data

(defn random-example-data [n]
  (api/dataset {:time (take n (iterate #(+ % (rand)) 1)) :x (take n (repeatedly #(rand 100)))}))

(def medium-example-data
  (random-example-data 10000))

(def big-example-data
  (random-example-data 10000000))

And a benchmark

(require '[criterium.core :as c])

(c/quick-bench (-> example-data
                     (d-dt1 :time :x :dt-dx))) ; 110 us


(c/quick-bench (-> medium-example-data
                     (d-dt1 :time :x :dt-dx))) ; 4 ms

(c/quick-bench (-> big-example-data
                     (d-dt1 :time :x :dt-dx))) ; 5 s

Okay, so 10e6 data points , M = 2 columns x (2 op for diff + 1 op for division) → 60e6 operations minimum, so a rough estimate of about 10e6 operations per second. slow. I would like a factor 1000 faster please :angel:. Hardware should be able to do more than 1 math op per cycle per thread, so this is in the order of 1e9 operations per second, times however many cores you have.

Also, not sure if the big example data is realized in memory before the benchmark is run, so might be some measurement error in there.

This clearly illustrates that I’m using the library inefficiently. So, on to second try I tried reading some more api docs.

(defn d-dt2 [ds time-col data-col result-col]
  (let [grad-t (cons 1 (dt-grad/diff1d (time-col ds)))
        grad-x (cons 0 (dt-grad/diff1d (data-col ds)))
        ds-intermediate (into ds [[:dt grad-t] [:dx grad-x]])
        ds-result (assoc ds
                         result-col (dfn// (ds-intermediate :dx) (ds-intermediate :dt)))]
    ds-result))

I found the docs for tech.v3.datatype.functional, and saw that you could call the functions straight on columns. No performance increase yet though.

For the third attempt, I got a factor of 3 improvement in speed on the big example data, not bad (I was thinking at the time)!

(defn d-dt3 [ds time-col data-col result-col]
  (let [ds (assoc ds
                  :dx (dtype/->array :float64 (cons 0 (dt-grad/diff1d (ds data-col))))
                  :dt (dtype/->array :float64 (cons 1.0 (dt-grad/diff1d (ds time-col)))))]
    (assoc (dissoc ds :dx :dt)
           result-col (dfn// (ds :dx) (ds :dt)))))

(c/quick-bench (-> example-data
                    (d-dt3 :time :x :dt-dx))) ; 500 us

(c/quick-bench (-> big-example-data
                   (d-dt3 :time :x :dt-dx))) ; 2 s !

Just one more minute…

While writing this post, I though “just another experiment…”

(defn d-dt-values [time x]
  (let [a (dt-grad/diff1d time)
        b (dt-grad/diff1d x)]
    (dfn// a b)))

(defn d-dt4 [ds time-col data-col result-col]
  (assoc ds result-col (d-dt-values (ds time-col) (ds data-col))))

And to my big surprise, this was way. way. faster. I got really exited seeing the benchmarks.


(c/quick-bench (-> example-data
                   (d-dt4 :time :x :dt-dx))) ; 99 us an improvement!

(c/quick-bench (-> medium-example-data
                   (d-dt4 :time :x :dt-dx))) ; 290 us even better!!

(c/quick-bench (-> big-example-data
                   (d-dt4 :time :x :dt-dx))) ; 34 ms !!!!

from 6 seconds to 0.034 seconds. a factor of almost 200.
60e6 / 0.034 = 1.7e9 operations per second. Hey! this is getting good. (unless I missed something and the data is just the first few values lazily evaluated in the benchmark…)

Last thoughts…

And that’s it for my experimentation. I hope that anyone can improve on the performance of this code a bit,
and let me see what speedy data processing code looks like using these libraries!

As always, any input is greatly appreciated :smiley:

All benchmarks together:

(comment
   (c/quick-bench (-> example-data
                     (d-dt1 :time :x :dt-dx))) ; 110 us

  (c/quick-bench (-> example-data
                     (d-dt2 :time :x :dt-dx))) ; 313 us

  (c/quick-bench (-> example-data
                     (d-dt3 :time :x :dt-dx))) ; 500 us

  (c/quick-bench (-> example-data
                     (d-dt4 :time :x :dt-dx))) ; 99 us


  (c/quick-bench (-> medium-example-data
                     (d-dt1 :time :x :dt-dx))) ; 4 ms

  (c/quick-bench (-> medium-example-data
                     (d-dt2 :time :x :dt-dx))) ; 4 ms

  (c/quick-bench (-> medium-example-data
                     (d-dt3 :time :x :dt-dx))) ; 4 ms

  (c/quick-bench (-> medium-example-data
                     (d-dt4 :time :x :dt-dx))) ; 290 us

  (c/quick-bench (-> big-example-data
                     (d-dt1 :time :x :dt-dx))) ; 5 s

  (c/quick-bench (-> big-example-data
                     (d-dt2 :time :x :dt-dx))) ; 6 s 

  (c/quick-bench (-> big-example-data
                     (d-dt3 :time :x :dt-dx))) ; 2 s, some improvement

  (c/quick-bench (-> big-example-data
                     (d-dt4 :time :x :dt-dx))) ; 34 ms !

Your first three examples are probably tossing efficiency by

(cons 0 (dt-grad/diff1d (ds data-col))))

Every time you do this, you are constructing an (unboxed) list from the dtype.next reader (an unboxed, efficient representation of contiguous memory where possible) the dt-grad/diff1d is returning. So you inject relatively massive overhead. You then invoke map over two unboxed lists, which kills iteration performance compared to the unboxed contiguous memory case.

d-dt2 doesn’t fair better since you’re still messing with lists and coercing them to a dataset then using the efficient functions for // on the tail end (paying the upfront iteration costs though).

d-dt3 does a bit better by forcing everything back into an unboxed reader representation for a typed array, although you’re constructing the array by iterating again over lists, which costs you, and only yields a 3x performance improvement (because you’re specifying the type of the dataset columns up front and t.m.d doesn’t have to check as it goes, probably).

d-dt-4 is fastest because it’s operating as t.m.d./TMD was meant to - leaving everything in an efficient packed representation and letting dtype.next do the lifting on top of indexed readers, primitive containers, and unboxed contiguous buffers where possible.

@Chris_Nuernberger and @generateme (authors respectively of tech.ml.dataset and tablecloth) have better insight, probably even more efficient ways to do stuff. I had not thought about the case where you want to do something like (cons 0 some-reader) and keep it in reader space. I am sure there is an equivalent dtype.next reader that does that (basically just concat 2 collections)

I think something like buffer concatenation

(require '[tech.v3.datatype.io-concat-buffer :as cb])
(cb/concat-buffers [[1 2 3] [4 5 6]])

(defn d-dt5 [ds time-col data-col result-col]
  (let [dx (cb/concat-buffers  [(dtype/->array :float64 [0.0]) (dt-grad/diff1d (ds data-col))])
        dt (cb/concat-buffers  [(dtype/->array :float64 [1.0]) (dt-grad/diff1d (ds time-col))])]
    (assoc ds result-col (dfn// dx dt))))

(defn test-it [data f]
  (-> data
      (f :time :x :dt-dx)))

(defn bench-it [data f]
  (c/quick-bench (test-it data f))

(bench-it big-example-data d-dt1)
;;Execution time mean : 2.645554 sec
(bench-it big-example-data d-dt4)
;;Execution time mean : 39.803576 ms
(bench-it big-example-data d-dt5)
;;Execution time mean : 48.576457 ms

There might even be faster ways to achieve that kind of simple concatenation/cons-like behavior to prepend a single element (e.g. reifying an indexed reader). At this point though, the overhead is mostly coming from constructing the new dataset via assoc, which has some overhead due to it trying to preserve map-like semantics so it’ll copy. There are in-place methods that do this mutably, which would reduce time further. Curious to hear what the authors think.

Also, the random data generation took like 34s on mine…

(defn random-example-data2 [n]
  (ds/->dataset
   {:time (dtype/->array :float64 (take n (iterate #(+ ^double % ^double (rand)) 1.0)))
    :x    (dtype/->array :float64 (take n (repeatedly #(rand 100))))}))

This takes 3s (could probably still be faster).

2 Likes

I don’t think I can add something more to @joinr answer. Maybe general advice: if you work with readers (TMD data storage abstraction) stick to them as much as possible. Jumping back and forth between dtype-next structures and Clojure functions can create performance issues.

Regarding this particular case: maybe it’s good to add a feature to diff1d and allow prepand with 0 value to get the same length?

Might make sense to have some higher level warnings akin to warn-on-reflection, although that would mean compiler integration (at the deepest level). That, or have a simple flag in the dtype.next conversion/coercion options (and subsequent tmd stuff) that will pick up on this option and warn the user that they are moving data between efficient and possibly inefficient representations.

Those kind of error messages would be really nice!

Nice! What about pmap? Maybe parallel processing would become more performant for even larger datasets. I’m thinking that 34 ms is so fast that spawning threads and dividing the work might not be worth it, but maybe for even larger data ?

TMD is already using parallel index operations under the hood where possible (maybe not “everywhere” but in many places). It uses a custom indexed parallel map I believe.

edit: indexed-map-reduce

Agreed with @joinr - not a lot to add to that answer. Instead of map consider emap.

I added an issue for diff1d based on this: Allow prepend/append constant with diff1d to keep sequence same length · Issue #24 · cnuernber/dtype-next · GitHub

gradient specifically keeps sequences the same length while diff1d shortens them by 1.

If you absolutely had to have extreme performance it would look like:

(defn d-dt1 [ds time-col data-col result-col]
  (let [grad-t (dtype/->reader (ds time-col))
        grad-x (dtype/->reader (ds data-col))]
    ;;Explicitly specify missing and provide a reader of a specific datatype
    ;;This avoids any scanning and the dataset system will just 'trust' that you
    ;;did it right.
    (assoc ds result-col
           #:tech.v3.dataset
           {:missing []
            :data (dtype/make-reader :float64 (dtype/ecount grad-t)
                                     (if (== idx 0)
                                       0
                                       (/ (- (.readDouble grad-x idx)
                                             (.readDouble grad-x (unchecked-dec idx)))
                                          (- (.readDouble grad-t idx)
                                             (.readDouble grad-t (unchecked-dec idx))))))})))


deriv> (c/quick-bench (d-dt1 example-data :time :x :res))
Evaluation count : 18954 in 6 samples of 3159 calls.
             Execution time mean : 33.328320 µs
    Execution time std-deviation : 1.762522 µs
   Execution time lower quantile : 30.658127 µs ( 2.5%)
   Execution time upper quantile : 35.019373 µs (97.5%)
                   Overhead used : 12.885640 ns
nil
deriv> (c/quick-bench (d-dt1 medium-example-data :time :x :res))
Evaluation count : 18456 in 6 samples of 3076 calls.
             Execution time mean : 34.394285 µs
    Execution time std-deviation : 1.142906 µs
   Execution time lower quantile : 33.397362 µs ( 2.5%)
   Execution time upper quantile : 35.966656 µs (97.5%)
                   Overhead used : 12.885640 ns
nil
deriv> (c/quick-bench (d-dt1 big-example-data :time :x :res))
Evaluation count : 18078 in 6 samples of 3013 calls.
             Execution time mean : 34.129411 µs
    Execution time std-deviation : 2.519198 µs
   Execution time lower quantile : 31.100122 µs ( 2.5%)
   Execution time upper quantile : 36.379883 µs (97.5%)
                   Overhead used : 12.885640 ns
nil

Perhaps that is cheating :slight_smile: as the result is completely lazy and the 34us is the time to construct the dataset itself.

1 Like

Would it? For Clojure functions to work on dtype-next structures, doesn’t it mean that somewhere some of the Clojure interfaces were implemented for them? Could those not then warn?

You could detect some of this at the protocol/interface implementation level, yes. Probably target the dtype.next layer and wire in at the lower levels.
Probably the absolute simplest change, with the most bang-for-the-buck, would be something like warn-on-seq, which has a check injected inside any ISeqable implementations. A little wrapper macro for certain tmd-specific functions could then detect undesirable seq coercion at runtime. Some of the collections/buffers just implement iterators and java.util.List though, which have seq coercions in clojure.lang.RT, so it doesn’t solve everything. Could be food for thought towards simple approaches to help debug performance.