Help with transducers performance

Hi everyone!

I have difficulty wrapping my head around this issue: I think I’m starting to understand transducers, but when I try to use one in a pipeline I’m developing I get a weird result concerning performance.

I have x JSON files in a folder - a total of about 50 mb, growing with time - I want to process:

(defn read-all-files
  "Read all JSON files in the given directory"
  [dir]
  (->> (all-files-in-dir dir)
       (pmap parse-with-city)
       (pmap get-all-maps)
       (apply concat)))

(quick-bench 
  (->> (read-all-files dir)
       (map #(join-strs :Description " + " %))
       (map #(apply dissoc % bad-fields))
       (map create-id)))

=> Evaluation count : 12 in 6 samples of 2 calls.
             Execution time mean : 77.889158 ms
    Execution time std-deviation : 13.784175 ms
   Execution time lower quantile : 51.979992 ms ( 2.5%)
   Execution time upper quantile : 90.811054 ms (97.5%)
                   Overhead used : 8.837352 ns

Found 1 outliers in 6 samples (16.6667 %)
	low-severe	 1 (16.6667 %)
 Variance from outliers : 47.9213 % Variance is moderately inflated by outliers

So I was thinking: what if I build a transducer doing only one pass over data?

(def test-transducer
  (comp
    (map #(join-strs :Description " + " %))
    (map #(apply dissoc % bad-fields))
    (map create-id)))

(quick-bench
  (into [] test-transducer (read-all-files dir)))

=> Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 565.977659 ms
    Execution time std-deviation : 19.585085 ms
   Execution time lower quantile : 542.982992 ms ( 2.5%)
   Execution time upper quantile : 589.564617 ms (97.5%)
                   Overhead used : 8.837352 ns

This is not the result I would expect, so either I totally misunderstood transducers or I made a stupid mistake I can’t see anywhere :sweat_smile:

Can anyone help me understand if there is a mistake or if this is an expected behaviour?

1 Like

Lazy sequences. You never realize any of your values in the first bench.

Also:

  • Better use (pmap (comp get-all-maps parse-with-city))
  • Avoid the apply concat and move it to the transducer cat
2 Likes

You’re right about the lazy sequences, sorry I was in a hurry, now doing:

(quick-bench 
  (->> (read-all-files dir)
       (map #(join-strs :Description " + " %))
       (map #(apply dissoc % bad-fields))
       (map create-id)
       doall))

=> Execution time mean : 603.928660 ms

(quick-bench
  (into [] test-transducer (read-all-files dir)))

=> Execution time mean : 595.415493 ms

I’ll try right away with your advice!

After your advice the sequence actually got worse, but the transducer stayed more or less the same

  • Result with sequence -> Execution time mean : 807.861160 ms
  • Result with transducer -> Execution time mean : 571.074660 ms

I think I’m going to get rid of concurrency for benchmarking, because I’m starting to think there are some issues on that side

Without knowing what your data looks like it’s tough to know how to speed this up. E.g. the apply isn’t too fast, but it could also be your join-strs function or create-id.You could likely use one pmap and do everything you need to do in that one function? Just a guess…

I have about 110 JSON files with different sizes and they contain structures like this one below (sorry about the Italian, but it doesn’t make much difference):

{:Vani ["2,5"],
  :Genere ["Immobiliare"],
  :Tipologia ["Appartamento"],
  :Nome "Appartamento Asta giudiziaria Via II Cortina S. Anna, 73 Portici",
  :Descrizione ["Appartamento ubicato al piano primo."],
  :Stato "A244902\n Da bandire tra più di 15 giorni",
  :Offerta_minima ["€ 11.270,40"],
  :Pubblicata_il ["24/01/2018"],
  :Esecuzione_immobiliare ["Nº 1818/2010"],
  :Deposito_in_conto_spese [""],
  :Numero_beni ["1"],
  :Disponibilità ["Occupato senza titolo opponibile."]}

The join-strs function updates the map like this and the create-id does something similar:

(defn join-strs
  "Join all the strs in a collection with 
  the given separator for the provided key"
  [k sep coll]
  (update coll k #(s/join sep %)))

My aim was to do everything in one pass at the end of the pipeline, I still have some other work to do. Anyway after substituting pmap with map I get even weirder results:

  • Sequence -> Execution time mean : 586.153660 ms
  • Transducer -> Execution time mean : 728.618326 ms

I also understand the JVM might be at play here, and the results I get after processing are right (I checked), what I don’t understand is how the result above is even possible if with the Sequence I pass many times over data and with the transducer just once…:confounded:

Is pmap actually reading the files? It might not be giving you much parallelism for IO. It’s mostly meant for computation.

You can try increasing the number of pmap threads (somewhere in the Clojure docs). Or use your own executor.

I will try that thanks!

Anyway I had an issue with the running JVM instance, and after some profiling it became clear the overhead coming from the reading function is too high to include it in the evaluation.

I was wondering wether iota could help in this case, but I guess I’ll know only by trying!