How to transform nested map into flat sequence of [path, value] pairs

teodorlu · April 4, 2022, 4:58pm

Hi!

Some conversation on the Clojurians Slack got me curious about normalizing maps into a vector of [path value] pairs.

(defn todo [m])

;; such that

(= (todo
    {:a 99
     :x {:b 999
         :y {:z 123
             :w 456}}})
   [[:a] 99
    [:x :b] 999
    [:x :y :z] 123
    [:x :y :w] 456])

How would you write todo?

On the Clojurians Slack, this started in a thread, then I asked a separate question.

Teodor

joinr · April 4, 2022, 5:30pm

I resorted to dumb recursion:

ed. recovery implementation was incorrect per sample output, now is correct.

(defn recover [xs]
  (reduce (fn [acc x]
            (if  (or (vector? x) (not (seq x)))
              (into acc x)
              (into acc (recover x))))
          [] xs))

(defn normalize
  ([path m]
   (if (map? m)
     (for [[k v] m]
       (normalize (conj path k) v))
     [path m]))
  ([m] (-> (normalize [] m)
           recover)))

(def m {:a 99, :x {:b 999, :y {:z 123, :w 456}}})
;;user=> (normalize m)
;;[[:a] 99 [:x :b] 999 [:x :y :z] 123 [:x :y :w] 456]

seems like this is probably abstract-able into some kind of general reducible path walk thing.

zcaudate · April 5, 2022, 3:29am

where’s the faster, asynchronous (and obviously more superior) transducer implementation?

teodorlu · April 5, 2022, 6:08am

@zcaudate I would love to see you give it a shot

Bonus points if arbitrary deep maps don’t blow up the stack.

joinr · April 5, 2022, 6:28am

next to your humility, tool.

zcaudate · April 5, 2022, 7:04am

Have you tried already? Awesome. How deep do the maps have to get to blow up?

I’m not feeling masochistic enough to try but here’s a reduce implementation that may come in handy.

github.com

cgrand/xforms/blob/master/src/net/cgrand/xforms.cljc#L141-L161


      
          (macros/usetime
          
          ;; Workaround clojure.core/satisfies? being slow in Clojure
          ;; see https://ask.clojure.org/index.php/3304/make-satisfies-as-fast-as-a-protocol-method-call
          #?(:bb nil
             :cljd nil
             :clj
             (defn fast-satisfies?-fn
               "Ported from https://github.com/clj-commons/manifold/blob/37658e91f836047a630586a909a2e22debfbbfc6/src/manifold/utils.clj#L77-L89"
               [protocol-var]
               (let [^java.util.concurrent.ConcurrentHashMap classes
                     (java.util.concurrent.ConcurrentHashMap.)]
                 (add-watch protocol-var ::memoization (fn [& _] (.clear classes)))
                 (fn [x]
                   (let [cls (class x)
                         val (.get classes cls)]
                     (if (nil? val)
                       (let [val (core/satisfies? @protocol-var x)]
                         (.put classes cls val)
                         val)
                       val))))))

zcaudate · April 5, 2022, 7:07am

here’s a js version that’s faster:

function nested_keys(m,path){
  let out = [];
  for(let [k,v] of Object.entries(m)){
    if((null != v) && ("object" == (typeof v)) && !Array.isArray(v)){
      for(let e of nested_keys(v,[...path,k])){
        out.push(e);
      };
    }
    else{
      out.push([[...path,k],v]);
    }
  };
  return out;
}

nested_keys({"a":{"b":{"c":1,"d":2},"e":{"f":4,"g":5}}},[]);
=> [[["a", "b", "c"], 1]
    [["a", "b", "d"], 2]
    [["a", "e", "f"], 4]
    [["a", "e", "g"], 5]]

joinr · April 5, 2022, 8:27am

I didn’t catch the arbitrary nesting requirement. I think 0 seqs, with the ability to handle arbitrary nesting (I also think the docstring for clj jvm iterate is outdated, as it returns an internally reducible object that can optionally be seq coerced) [happy to be shown otherwise]:

(require '[net.cgrand.xforms :as x])

(defn compute-paths [{:keys [paths]}]
  (->> (x/for [[p c] paths]
         (if (map? c)
           (eduction cat
                     (x/for [[chld xs] c]
                       [(conj p chld) xs]))
           [p c]))
       (eduction cat (partition-all 2))))

(defn update-paths [{:keys [acc paths]} new-paths]
  (let [{:keys [mods adds]}
           (group-by (fn [[_ v]]
                       (if (map? v) :mods :adds)) new-paths)
        res  {:acc   (into acc adds)
              :paths mods}]
    (if (empty? mods)
      (reduced res)
      res)))

;;swap (into [] (comp x/last (map :acc) cat cat)))) to eduction.
(defn normalize [m]
  (->> {:acc [] :paths (into {} (map (fn [[k v]] [[k] v])) m)}
      (iterate (fn [seed]
                  (->> seed
                       compute-paths
                       (update-paths seed))))
      (eduction x/last (map :acc) cat cat)
      (into [])))

;;user=> (normalize m)
;;[[:a] 99 [:x :b] 999 [:x :y :z] 123 [:x :y :w] 456]

joinr · April 5, 2022, 2:20pm

shorter, faster recursive impl with mutation (unsafe results), possible blown stack.

(require '[net.cgrand.xforms :as x])
(import '[java.util List ArrayList])

(defn normalize [m]
  (let [aux (fn normalize! [^List acc ^List p m]
              (if (map? m)
                (->> (x/for [[k v] m]
                       (normalize! acc (doto p (.add k)) v))
                     (reduce (fn [_ _] acc)))
                (doto acc (.add (.clone p))  (.add m))))]
    (aux (ArrayList.) (ArrayList.) m)))

joinr · April 5, 2022, 4:32pm

I think this one will leverage some parallelism (although retaining order) while allowing arbitrary nesting depth. An unordered variant (pipeline preserves order) would get more gainz. Could be a fold variant around here somewhere.

(require '[net.cgrand.xforms :as x])
(require '[clojure.core.async :as a])

(defn normalize [m]
  (let [out        (a/chan 10)
        pending    (a/chan 10)
        remaining  (atom 0)
        push-work! (fn [n xs]
                     (do (swap! remaining + n)
                         (a/onto-chan! pending xs false)))
        _          (push-work! (count m) (x/for [[k v] m] [[k] v]))
        walk    (fn [[p xs]]
                  (let [res (if (map? xs)
                              (do (push-work! (count xs)
                                              (x/for [[k v] xs]
                                                [(conj p k) v]))
                                  :ignore)
                              [p xs])
                        _   (when (zero? (swap! remaining dec))
                              (a/close! pending))]
                    res))]
    (a/pipeline-blocking
     (+ 2 (.availableProcessors (Runtime/getRuntime)))
     out
     (comp (map walk)
           (filter #(not= % :ignore))
           cat)
     pending)
    (->> out
         (a/into [])
         a/<!!)))

teodorlu · April 5, 2022, 10:05pm

Using core.async internally for this is really interesting - I would never have thought of that.

joinr · April 6, 2022, 2:18am

Fun toy problem.

zcaudate · April 7, 2022, 2:46am

@joinr, it’s really interesting what you’ve done.

I’ll put what I’ve extracted out from your example as a pseudo lisp targeting js (hopefully it’s pretty readable). Because JS is single threaded, I’m using a standard array for accumulation but on the JVM, either an atom or a concurrent deque would work.

The most important bit is the step function for nested-keys-step-fn.
It takes three arguments → args, queue and accumulated output

It linearises a nested loop by taking an entry from a queue, doing an operation and putting more tasks on the queue if a child is an object:

(defn.js nested-keys-step-fn
  [args queue out]
  (var [path m] args)
  (k/for:object [[k v] m]
    (cond (k/obj? v)
          ;; add to queue if object
          (x:arr-push queue [[(:.. path) k] v])

          :else
          ;; add to accumulated output otherwise
          (x:arr-push out [[(:.. path) k] v])))
  (return [queue out]))

Examples:

(walk-step-fn [[] {"a" {"b" 2 "c" 3}}]
              []
              [])
=> [[[["a"] {"b" 2, "c" 3}]]  ;; queue
    []] ;; output

(walk-step-fn [["a"] {"b" 2, "c" 3}]
              []
              [])
=> [[] ;; queue
    [[["a" "b"] 2] [["a" "c"] 3]]] ;;output

Because the step function turns a nested loop into a linear one, we can use a generic walk implementation (using iterators, transducers, parallell threads, whatever).

The simplest is a while loop:

(defn.js walk [step-fn queue]
  (var out [])
  (while (<  0 (k/len queue))
    (var args (x:arr-pop queue))
    (step-fn args queue out))
  (return out))

and nested keys can be defined in terms of the walk function:

(defn.js nested-keys
  [m]
  (return (-/walk -/walk-step-fn [[[] m]])))

(nested-keys {:a {:b 1 :c 2}
              :d {:e 1 :f 2}})
=> [[["d" "e"] 1] 
    [["d" "f"] 2] 
    [["a" "b"] 1] 
    [["a" "c"] 2]]

The simplest async version of walk on JS looks like this:

(defn.js walk-async [step-fn queue out]
  (return (j/future
            (var args (x:arr-pop queue))
            (step-fn args queue out)
            (if (< 0 (k/len queue))
              (return (-/walk-async step-fn queue out))
              (return out)))))

(defn.js nested-keys-async
  [m]
  (return (-/walk-async -/walk-step-fn [[[] m]] [])))

and output:

(j/<! (-/nested-keys-async {:a {:b 1 :c 2}
                            :d {:e 1 :f 2}}))
[[["d" "e"] 1] [["d" "f"] 2] [["a" "b"] 1] [["a" "c"] 2]]

I don’t think it blows the stack (because of the micro-tasklet implementation) but am not completely sure.

zcaudate · April 7, 2022, 2:46am

The runnable javascript is here:

function walk_step_fn(args,queue,out){
  let [path,m] = args;
  for(let [k,v] of Object.entries(m)){
    if((null != v) && ("object" == (typeof v)) && !Array.isArray(v)){
      queue.push([[...path,k],v]);
    }
    else{
      out.push([[...path,k],v]);
    }
  };
  return [queue,out];
}

function walk(step_fn,queue){
  let out = [];
  while(0 < (queue).length){
    let args = queue.pop();
    step_fn(args,queue,out);
  }
  return out;
}

function walk_async(step_fn,queue,out){
  return new Promise(function (resolve,reject){
    try{
      resolve(      (function (){
              let args = queue.pop(); 
              step_fn(args,queue,out);
              if(0 < (queue).length){ 
                return walk_async(step_fn,queue,out);
              }
              else{
                return out;
              }
            })());
    }
    catch(e){
      reject(e);
    }
  });
}

function nested_keys(m){
  return walk(walk_step_fn,[[[],m]]); 
}

function nested_keys_async(m){
  return walk_async(walk_step_fn,[[[],m]],[]);
}

kennytilton · April 10, 2022, 10:20pm

What part of “tree” do you not respect? Resistance is futile!

Josh_Lemer · April 14, 2022, 5:30am

a simple and stack-safe solution

(defn todo [m]
  (let [out (transient [])]
    (loop [path []
           stack (list m)]
      (if (not-empty stack)
        (let [[x & xs] stack]
          (if (empty? x)
            (if (not-empty path)
              (recur (pop path) xs))
            (let [[k v] (first x)
                  kvs (rest x)
                  next-path (conj path k)]
              (if (map? v)
                (do
                  (recur next-path (conj xs kvs v)))
                (do
                  (conj! out [next-path v])
                  (recur path (conj xs kvs)))))))))
    (persistent! out)))

joinr · April 14, 2022, 7:34am

I think (in general) you need to put the out transient into an atom and swap! via conj! to mutate it (and capture the result), or include it in the loop bindings and explicitly pass the result of conj! to maintain a reference to the returned result. Depending on undefined behaviors, if you treat it as a “bash in place” mutable container and apply updates via a side-effecting model without capturing the result of conj! (as in the current do form), you can end up with incorrect data since the original reference may be modified under the hood (the result of conj! - the current state of the transient - may be a different object than the original bound value - which is not reflected in the binding). I had this error affect me years ago; uncertain what the exact failure semantics are, but the docs deter the bash-in-place pattern: “Note in particular that transients are not designed to be bashed in-place. You must capture and use the return value in the next call. In this way, they support the same code structure as the functional persistent code they replace.”

incorrect example reference

teodorlu · April 14, 2022, 8:31am

Ah, so each time you conj! onto a mutable container, you get the “new” container back (but the last one can’t be used), and have to use that?

Thanks for explaining.

Shouldn’t this be possible to fix in @Josh_Lemer’s answer by adding the transient as a loop/recur argument?

joinr · April 14, 2022, 1:22pm

Shouldn’t this be possible to fix in @Josh_Lemer’s answer by adding the transient as a loop/recur argument?

Yes, that was an alternative in the original suggestion.

Using swap! and an atom probably requires slightly less transformation of the original solution though.

Josh_Lemer · April 14, 2022, 2:49pm

Sorry I always forget about that!

(defn todo [m]
  (loop [path []
         stack (list m)
         out (transient [])]
    (if (not-empty stack)
      (let [[x & xs] stack]
        (if (empty? x)
          (if (empty? path)
            (persistent! out)
            (recur (pop path) xs out))
          (let [[[k v] & kvs] x
                next-path (conj path k)]
            (if (map? v)
              (recur next-path (conj xs kvs v) out)
              (recur path (conj xs kvs) (conj! out [next-path v])))))))))