Group-by and aggregate using core functions

Caveat: I am not a professional programmer and also new to Clojure/FP. I like to do some analysis by myself to keep my team honest :slight_smile:

I am trying to do all of the questions in the R data.table introduction using Clojure over here (R data.table introduction).

My goal is to use core functions as much as possible (and avoid using undocumented abandonware). So far I’ve done most of section 1 (selection and filtering), and need some help with section 2 which deals with grouping and aggregations.

To make this easier, consider this simplified example. Dataframe is a vector of maps.

(def dt [{:foo 10 :bar 20 :cat "C"}
         {:foo 15 :bar 10 :cat "C"}
         {:foo 11 :bar 9  :cat "C"}
         {:foo 17 :bar 6  :cat "B"}
         {:foo 2  :bar 14 :cat "B"}
         {:foo 15 :bar 4  :cat "B"}
         {:foo 9  :bar 6  :cat "A"}
         {:foo 7  :bar 4  :cat "A"}
         {:foo 3  :bar 7  :cat "A"}])

What I’d like to do is groupby :cat and then compute aggregations on :foo and :bar, e.g sum of all :foo and the max of all :bar grouped by :cat

So the answer should look like:

[{:cat "A" :sum-foo 19 :max-bar 7}
 {:cat "B" :sum-foo 34 :max-bar 14}
 {:cat "C" :sum-foo 36 :max-bar 20}]

Ideally, this function can be used to create arbitrary groupings and be passed arbitrary aggregation functions.

Again, I am new to this, so any and all advise is welcome. Many thanks!

Clojure has a group-by function:
https://clojuredocs.org/clojure.core/group-by

(map 
  (fn [[grp-key values]] 
    {:group grp-key
     :sum (reduce + (map :foo values))
     :max (reduce max (map :bar values))})
  (group-by :cat dt))
2 Likes

This helps, thanks! Any idea where I can find some of the more complicated aggregations like mean/median/percentile/std deviation etc?

I would prefer not to implement those.

Incanter should have everything you need:

In particular the incanter.stats namespace
http://incanter.github.io/incanter/stats-api.html

There’s also kixi.stats (https://github.com/MastodonC/kixi.stats) but it forces you to use transducers throughout which is a bit more of an advanced Clojure concept. I’d probably recommend sticking with incanter while you’re starting out.

In this case, how do I keep the original grouping columns as-is (not in a seperate :group)?

I’m not 100% positive what you’re asking but if you just want the :group key in my answer to be named :cat you just change that keyword inside the hash-map

(map 
  (fn [[grp-key values]]
    ;;you can name the keywords :cat/:sum/:max anything you'd like 
    {:cat   grp-key 
     :sum (reduce + (map :foo values))
     :max (reduce max (map :bar values))})
  (group-by :cat dt))

Apologies, I was not clear – the issue is if I pass more than one field to the group-by function like so:

(map 
  (fn [[grp-key values]]
    {:group   grp-key 
     :sum (reduce + (map :foo values))
     :max (reduce max (map :bar values))})
  (group-by (juxt :cat1 :cat2) dt))

In this case, the :group column has grouping variables in a vector

({:group ["A" "X"] :sum-foo ...

This is not conducive to chaining to other operations, so I prefer to have a regular vector of maps.

[{:cat1 "A" :cat2 "X" :sum-foo ...}
 {:cat1 "A" :cat2 "Y" :sum-foo ...}

You could always just reconstruct it from the vector key

(map 
  (fn [[grp-key values]]
    {:cat1 (first grp-key)
     :cat2 (second grp-key)
     :sum (reduce + (map :foo values))
     :max (reduce max (map :bar values))})
  (group-by (juxt :cat1 :cat2) dt))
1 Like

So I wrote a small function to do this “ungrouping” to add back the original columns

(defn ungroup [ks coll]
  (map conj coll
       (mapv zipmap (repeat ks)  (map :group coll))))

So a sample query looks like this:

;; How can we get the average arrival and departure delay for each orig, dest pair
;; for each month for carrier code "AA"?
(->> flights
     (filter #(= (:carrier %) "AA"))
     (group-by (juxt :origin :dest :month))
     (map (fn [[grp-key values]]
            {:group grp-key
             :avg-arr-delay (mean (map :arr_delay values))
             :avg-dep-delay (mean (map :dep_delay values))}))
     (ungroup [:origin :dest :month])

Please let me know if you see any obvious problems or optimisations.

One thing you could do is use instead of using juxt for your grouping function, use select-keys to use a subset of the map which just contains your “index keys” for the index instead. Then just merge that map with the new keys you want:

(def dt [{:foo 10 :bar 20 :cat1 "C" :cat2 "X"}
         {:foo 15 :bar 10 :cat1 "C" :cat2 "X"}
         {:foo 11 :bar 9 :cat1 "C" :cat2 "X"}
         {:foo 17 :bar 6 :cat1 "B" :cat2 "Y"}
         {:foo 2 :bar 14 :cat1 "B" :cat2 "Y"}
         {:foo 15 :bar 4 :cat1 "B" :cat2 "Y"}
         {:foo 9 :bar 6 :cat1 "A" :cat2 "Z"}
         {:foo 7 :bar 4 :cat1 "A" :cat2 "Z"}
         {:foo 3 :bar 7 :cat1 "A" :cat2 "Z"}])

(->> dt
     (group-by #(select-keys % [:cat1 :cat2]))
     (map 
      (fn [[grp-map values]]
        (assoc grp-map
               :sum (reduce + (map :foo values))
               :max (reduce max (map :bar values))))))
;;=>
({:cat1 "C", :cat2 "X", :sum 36, :max 20}
 {:cat1 "B", :cat2 "Y", :sum 34, :max 14}
 {:cat1 "A", :cat2 "Z", :sum 19, :max 7})

Perhaps it’s worth noting that clojure.set/index does pretty much this exact thing as well:

(require '[clojure.set :as set])
(->> (set/index dt [:cat1 :cat2])
     (map 
      (fn [[grp-map values]]
        (assoc grp-map
               :sum (reduce + (map :foo values))
               :max (reduce max (map :bar values))))))

I’d probably stick to group-by as a default though.

Edit: Also, the “values” returned in set/index would be a set rather than the vector group-by returns, and you will lose data if there are repeated values in DT.

select-keys is a bit slower than juxt + ungroup, but has the advantage of using entirely core functions. thanks for the tip.

My initial thought was to use Clojure.set, but since this is a learning excercise, I want to go it the hard way.

We can prevent the loss of data by creating a row-number column as we read the dataset. A big advantage of Clojure.set is that a lot of what a Dataframe needs is already built – select, project, join etc cover a lot of ground.