Partition by capacity

I’ve implemented a solution to a problem I’m calling “partition by capacity” (described below), which I believe is correct but not particularly idiomatic. I’m wondering if others have suggestions for a cleaner implementation.

Partition by capacity
I have a sequence of objects (specifically maps), where certain keys will be vectors. I want to partition these sequences such that the total size of all these vectors within the partition does not exceed a pre-specified capacity. There will in general be multiple keys with different capacities. In essence this is related to a Multi-dimensional knapsack problem (I think), while hopefully avoiding the NP-hardness by only requiring partitioning in order.

An example (the first argument is the capacities):

(partition-by-capacity {:a 4 :b 3} [{:a [1 2 3 4] :b []} {:a [1] :b []}])

returns (e.g. two partitions)

(({:a [1 2 3 4], :b []}) ({:a [1], :b []}))

because the first map completely fills the first partition (just through :a in this case)

My solution
Below is my current implementation, which I believe works as required but presumably could be improved for readability (and quite likely efficiency)

(defn partition-by-capacity
  [capacities coll]
  (let [part (loop [ret []
                    cur 0
                    coll coll
                    ;; Before adding anything, initial size is zero
                    current-size (zipmap (keys capacities) (repeat 0))]
               (if (seq coll)
                 (let [[c & coll] coll
                       ;; calculate the updated sizes if we add this newest
                       ;; collection
                       new-size (reduce-kv (fn [m k v] (update m k + (count v)))
                                           current-size
                                           (select-keys c (keys capacities)))]
                   ;; Check if any new size is over capacity
                   (if (seq (filter identity
                              (vals (reduce-kv (fn [m k v] (update m k > v))
                                               new-size
                                               capacities))))
                     ;; If it is, start a new partition
                     (recur (conj ret (inc cur))
                            (inc cur)
                            coll
                            (reduce-kv (fn [m k v] (assoc m k (count v))) {} c))
                     ;; Otherwise continue
                     (recur (conj ret cur) cur coll new-size)))
                 ret))]
    ;; partition by the generated assignments, making sure to drop the
    ;; annotations before returning
    (->> (map assoc coll (repeat ::partition) part)
         (partition-by ::partition)
         (map #(map dissoc % (repeat ::partition))))))
1 Like

Semantics aren’t clear to me for cases where partition size is exceeded or the entries are not granular enough to pack properly.

What is the result for ?

(partition-by-capacity {:a 3 :b 3} [{:a [1 2 3 4] :b []} {:a [1] :b []}])

for now I’m allowing cases that exceed capacity to just create their own partition. so the solution returned by my current implementation, which I’m happy with, would be:

(({:a [1 2 3 4], :b []}) ({:a [1], :b []}))

I’m not sure I understand your second case about granularity, could you give an example?

my loosely verified take:

(defn partition-by-capacity [capacities coll]
  (let [aux  (fn aux [caps part xs]
               (if-let [x (first xs)]
                 (let [new-caps (reduce-kv
                                 (fn [acc k v]
                                   (let [remaining (- (acc k) (count v))]
                                     (if (pos? remaining)
                                       (assoc acc k remaining)
                                       (reduced :empty))))
                                 caps x)]
                   (if (= new-caps :empty)
                     (lazy-seq (cons (conj part x) (aux capacities [] (rest xs))))
                     (aux new-caps (conj part x) (rest xs))))
                 [part]))]
    (aux capacities [] coll)))
1 Like

this is great, thank you!

As a funny addendum, I looked at this and said “why not just leverage partition-by and get a transducer variant for free…just supply a custom key function…?”

So

(defn capacity-key [capacities]
  (let [caps      (atom capacities)
        n         (atom 0)]
    (fn key-fn [x]
      (let [p @n
            res (reduce-kv
                 (fn [acc k v]
                   (let [remaining (- (acc k) (count v))]
                     (if (pos? remaining)
                       (assoc acc k remaining)
                       (reduced nil))))
                 @caps x)]
        (if res
          (reset! caps res)
          (do (reset! caps capacities)
              (swap! n inc)))
        p))))

;;invokes the seq path of partition-by...this will matter later
(defn partition-by-capacity2 [capacities xs]
  (partition-by (capacity-key capacities) xs))

;;some testing utils.
(def raw (->> (fn []
                   {:a (vec (range 1 (inc (rand-int 4))))
                    :b (vec (range 1 (inc (rand-int 3))))})
                 (repeatedly 10)))

(defn check [xs]
  (for [part xs]
    (->> part
         (reduce (fn [acc {:keys [a b]}]
                   (-> acc
                       (update :a + (count a))
                       (update :b + (count b)))) {:a 0 :b 0}))))

(check (partition-by-capacity {:a 4 :b 3} raw))
;;({:a 5, :b 2} {:a 5, :b 4} {:a 2, :b 3} {:a 2, :b 3})
(check (partition-by-capacity2 {:a 4 :b 3} raw))
;;({:a 5, :b 2} {:a 2, :b 0} {:a 1, :b 2} {:a 2, :b 2} {:a 2, :b 1} {:a 0, :b 2} {:a 2, :b 2} {:a 0, :b 1})
(check (into [] (partition-by (capacity-key {:a 4 :b 3})) raw))
;;({:a 5, :b 2} {:a 5, :b 4} {:a 2, :b 3} {:a 2, :b 3})

So I think the subtle issue I forgot about (and an undocumented one) is that the seq version of partition-by will invoke f up to 2x for an entry (at the partition boundary). The transducer variant does not. So we get an oddity where the legacy seq variant ends up giving unexpected behavior when we use a key-fn that has internal state and is expected to be called 1x per entry, while the transducer version is fine. I think there was a semi-recent thread where this popped up. In any case, here’s another way to do it - with caveats. The seq implementation could probably be made to conform though.

oh yea that’s great, I had been trying to think up a good way to make a key function for this! I definitely would have had a hard time debugging that issue though.

I also found a strange behavior from your first implementation, which for the life of me I can’t track down the source of:

(partition-by-capacity {:a 4, :b 3}
                       [{:a [], :b [1 2]} {:a [1 2 3], :b []} {:a [1], :b [1]}
                        {:a [1 2 3], :b []}])

will return (the last element is not in a vector)

([{:a [], :b [1 2]} {:a [1 2 3], :b []} {:a [1], :b [1]}]
 {:a [1 2 3], :b []})

it seems to be a problem with ending the on a partition with only one element, as this is fine:

(partition-by-capacity {:a 4, :b 3}
                       [{:a [], :b [1 2]} {:a [1 2 3], :b []} {:a [1], :b [1]}
                        {:a [1 2 3], :b []} {:a [1 2 3], :b []}])

returns

([{:a [], :b [1 2]} {:a [1 2 3], :b []} {:a [1], :b [1]}]
 [{:a [1 2 3], :b []} {:a [1 2 3], :b []}])

Yeah it was a screwup, should have returned [part] in that case. I updated the original answer.