I have difficulty wrapping my head around this issue: I think I’m starting to understand transducers, but when I try to use one in a pipeline I’m developing I get a weird result concerning performance.
I have x JSON files in a folder - a total of about 50 mb, growing with time - I want to process:
(defn read-all-files
"Read all JSON files in the given directory"
[dir]
(->> (all-files-in-dir dir)
(pmap parse-with-city)
(pmap get-all-maps)
(apply concat)))
(quick-bench
(->> (read-all-files dir)
(map #(join-strs :Description " + " %))
(map #(apply dissoc % bad-fields))
(map create-id)))
=> Evaluation count : 12 in 6 samples of 2 calls.
Execution time mean : 77.889158 ms
Execution time std-deviation : 13.784175 ms
Execution time lower quantile : 51.979992 ms ( 2.5%)
Execution time upper quantile : 90.811054 ms (97.5%)
Overhead used : 8.837352 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 47.9213 % Variance is moderately inflated by outliers
So I was thinking: what if I build a transducer doing only one pass over data?
(def test-transducer
(comp
(map #(join-strs :Description " + " %))
(map #(apply dissoc % bad-fields))
(map create-id)))
(quick-bench
(into [] test-transducer (read-all-files dir)))
=> Evaluation count : 6 in 6 samples of 1 calls.
Execution time mean : 565.977659 ms
Execution time std-deviation : 19.585085 ms
Execution time lower quantile : 542.982992 ms ( 2.5%)
Execution time upper quantile : 589.564617 ms (97.5%)
Overhead used : 8.837352 ns
This is not the result I would expect, so either I totally misunderstood transducers or I made a stupid mistake I can’t see anywhere
Can anyone help me understand if there is a mistake or if this is an expected behaviour?
Without knowing what your data looks like it’s tough to know how to speed this up. E.g. the apply isn’t too fast, but it could also be your join-strs function or create-id.You could likely use one pmap and do everything you need to do in that one function? Just a guess…
I have about 110 JSON files with different sizes and they contain structures like this one below (sorry about the Italian, but it doesn’t make much difference):
{:Vani ["2,5"],
:Genere ["Immobiliare"],
:Tipologia ["Appartamento"],
:Nome "Appartamento Asta giudiziaria Via II Cortina S. Anna, 73 Portici",
:Descrizione ["Appartamento ubicato al piano primo."],
:Stato "A244902\n Da bandire tra più di 15 giorni",
:Offerta_minima ["€ 11.270,40"],
:Pubblicata_il ["24/01/2018"],
:Esecuzione_immobiliare ["Nº 1818/2010"],
:Deposito_in_conto_spese [""],
:Numero_beni ["1"],
:Disponibilità ["Occupato senza titolo opponibile."]}
The join-strs function updates the map like this and the create-id does something similar:
(defn join-strs
"Join all the strs in a collection with
the given separator for the provided key"
[k sep coll]
(update coll k #(s/join sep %)))
My aim was to do everything in one pass at the end of the pipeline, I still have some other work to do. Anyway after substituting pmap with map I get even weirder results:
Sequence -> Execution time mean : 586.153660 ms
Transducer -> Execution time mean : 728.618326 ms
I also understand the JVM might be at play here, and the results I get after processing are right (I checked), what I don’t understand is how the result above is even possible if with the Sequence I pass many times over data and with the transducer just once…
Anyway I had an issue with the running JVM instance, and after some profiling it became clear the overhead coming from the reading function is too high to include it in the evaluation.
I was wondering wether iota could help in this case, but I guess I’ll know only by trying!