I’ve implemented a solution to a problem I’m calling “partition by capacity” (described below), which I believe is correct but not particularly idiomatic. I’m wondering if others have suggestions for a cleaner implementation.
Partition by capacity
I have a sequence of objects (specifically maps), where certain keys will be vectors. I want to partition these sequences such that the total size of all these vectors within the partition does not exceed a pre-specified capacity. There will in general be multiple keys with different capacities. In essence this is related to a Multi-dimensional knapsack problem (I think), while hopefully avoiding the NP-hardness by only requiring partitioning in order.
An example (the first argument is the capacities):
because the first map completely fills the first partition (just through :a in this case)
My solution
Below is my current implementation, which I believe works as required but presumably could be improved for readability (and quite likely efficiency)
(defn partition-by-capacity
[capacities coll]
(let [part (loop [ret []
cur 0
coll coll
;; Before adding anything, initial size is zero
current-size (zipmap (keys capacities) (repeat 0))]
(if (seq coll)
(let [[c & coll] coll
;; calculate the updated sizes if we add this newest
;; collection
new-size (reduce-kv (fn [m k v] (update m k + (count v)))
current-size
(select-keys c (keys capacities)))]
;; Check if any new size is over capacity
(if (seq (filter identity
(vals (reduce-kv (fn [m k v] (update m k > v))
new-size
capacities))))
;; If it is, start a new partition
(recur (conj ret (inc cur))
(inc cur)
coll
(reduce-kv (fn [m k v] (assoc m k (count v))) {} c))
;; Otherwise continue
(recur (conj ret cur) cur coll new-size)))
ret))]
;; partition by the generated assignments, making sure to drop the
;; annotations before returning
(->> (map assoc coll (repeat ::partition) part)
(partition-by ::partition)
(map #(map dissoc % (repeat ::partition))))))
for now I’m allowing cases that exceed capacity to just create their own partition. so the solution returned by my current implementation, which I’m happy with, would be:
(({:a [1 2 3 4], :b []}) ({:a [1], :b []}))
I’m not sure I understand your second case about granularity, could you give an example?
As a funny addendum, I looked at this and said “why not just leverage partition-by and get a transducer variant for free…just supply a custom key function…?”
So I think the subtle issue I forgot about (and an undocumented one) is that the seq version of partition-by will invoke f up to 2x for an entry (at the partition boundary). The transducer variant does not. So we get an oddity where the legacy seq variant ends up giving unexpected behavior when we use a key-fn that has internal state and is expected to be called 1x per entry, while the transducer version is fine. I think there was a semi-recent thread where this popped up. In any case, here’s another way to do it - with caveats. The seq implementation could probably be made to conform though.
oh yea that’s great, I had been trying to think up a good way to make a key function for this! I definitely would have had a hard time debugging that issue though.
I also found a strange behavior from your first implementation, which for the life of me I can’t track down the source of: