My goal is to use core functions as much as possible (and avoid using undocumented abandonware). So far I’ve done most of section 1 (selection and filtering), and need some help with section 2 which deals with grouping and aggregations.
To make this easier, consider this simplified example. Dataframe is a vector of maps.
There’s also kixi.stats (https://github.com/MastodonC/kixi.stats) but it forces you to use transducers throughout which is a bit more of an advanced Clojure concept. I’d probably recommend sticking with incanter while you’re starting out.
I’m not 100% positive what you’re asking but if you just want the :group key in my answer to be named :cat you just change that keyword inside the hash-map
(map
(fn [[grp-key values]]
;;you can name the keywords :cat/:sum/:max anything you'd like
{:cat grp-key
:sum (reduce + (map :foo values))
:max (reduce max (map :bar values))})
(group-by :cat dt))
;; How can we get the average arrival and departure delay for each orig, dest pair
;; for each month for carrier code "AA"?
(->> flights
(filter #(= (:carrier %) "AA"))
(group-by (juxt :origin :dest :month))
(map (fn [[grp-key values]]
{:group grp-key
:avg-arr-delay (mean (map :arr_delay values))
:avg-dep-delay (mean (map :dep_delay values))}))
(ungroup [:origin :dest :month])
Please let me know if you see any obvious problems or optimisations.
One thing you could do is use instead of using juxt for your grouping function, use select-keys to use a subset of the map which just contains your “index keys” for the index instead. Then just merge that map with the new keys you want:
I’d probably stick to group-by as a default though.
Edit: Also, the “values” returned in set/index would be a set rather than the vector group-by returns, and you will lose data if there are repeated values in DT.
select-keys is a bit slower than juxt + ungroup, but has the advantage of using entirely core functions. thanks for the tip.
My initial thought was to use Clojure.set, but since this is a learning excercise, I want to go it the hard way.
We can prevent the loss of data by creating a row-number column as we read the dataset. A big advantage of Clojure.set is that a lot of what a Dataframe needs is already built – select, project, join etc cover a lot of ground.