How to transform nested map into flat sequence of [path, value] pairs

joinr · April 6, 2022, 2:18am

Fun toy problem.

zcaudate · April 7, 2022, 2:46am

@joinr, it’s really interesting what you’ve done.

I’ll put what I’ve extracted out from your example as a pseudo lisp targeting js (hopefully it’s pretty readable). Because JS is single threaded, I’m using a standard array for accumulation but on the JVM, either an atom or a concurrent deque would work.

The most important bit is the step function for nested-keys-step-fn.
It takes three arguments → args, queue and accumulated output

It linearises a nested loop by taking an entry from a queue, doing an operation and putting more tasks on the queue if a child is an object:

(defn.js nested-keys-step-fn
  [args queue out]
  (var [path m] args)
  (k/for:object [[k v] m]
    (cond (k/obj? v)
          ;; add to queue if object
          (x:arr-push queue [[(:.. path) k] v])

          :else
          ;; add to accumulated output otherwise
          (x:arr-push out [[(:.. path) k] v])))
  (return [queue out]))

Examples:

(walk-step-fn [[] {"a" {"b" 2 "c" 3}}]
              []
              [])
=> [[[["a"] {"b" 2, "c" 3}]]  ;; queue
    []] ;; output

(walk-step-fn [["a"] {"b" 2, "c" 3}]
              []
              [])
=> [[] ;; queue
    [[["a" "b"] 2] [["a" "c"] 3]]] ;;output

Because the step function turns a nested loop into a linear one, we can use a generic walk implementation (using iterators, transducers, parallell threads, whatever).

The simplest is a while loop:

(defn.js walk [step-fn queue]
  (var out [])
  (while (<  0 (k/len queue))
    (var args (x:arr-pop queue))
    (step-fn args queue out))
  (return out))

and nested keys can be defined in terms of the walk function:

(defn.js nested-keys
  [m]
  (return (-/walk -/walk-step-fn [[[] m]])))

(nested-keys {:a {:b 1 :c 2}
              :d {:e 1 :f 2}})
=> [[["d" "e"] 1] 
    [["d" "f"] 2] 
    [["a" "b"] 1] 
    [["a" "c"] 2]]

The simplest async version of walk on JS looks like this:

(defn.js walk-async [step-fn queue out]
  (return (j/future
            (var args (x:arr-pop queue))
            (step-fn args queue out)
            (if (< 0 (k/len queue))
              (return (-/walk-async step-fn queue out))
              (return out)))))

(defn.js nested-keys-async
  [m]
  (return (-/walk-async -/walk-step-fn [[[] m]] [])))

and output:

(j/<! (-/nested-keys-async {:a {:b 1 :c 2}
                            :d {:e 1 :f 2}}))
[[["d" "e"] 1] [["d" "f"] 2] [["a" "b"] 1] [["a" "c"] 2]]

I don’t think it blows the stack (because of the micro-tasklet implementation) but am not completely sure.

zcaudate · April 7, 2022, 2:46am

The runnable javascript is here:

function walk_step_fn(args,queue,out){
  let [path,m] = args;
  for(let [k,v] of Object.entries(m)){
    if((null != v) && ("object" == (typeof v)) && !Array.isArray(v)){
      queue.push([[...path,k],v]);
    }
    else{
      out.push([[...path,k],v]);
    }
  };
  return [queue,out];
}

function walk(step_fn,queue){
  let out = [];
  while(0 < (queue).length){
    let args = queue.pop();
    step_fn(args,queue,out);
  }
  return out;
}

function walk_async(step_fn,queue,out){
  return new Promise(function (resolve,reject){
    try{
      resolve(      (function (){
              let args = queue.pop(); 
              step_fn(args,queue,out);
              if(0 < (queue).length){ 
                return walk_async(step_fn,queue,out);
              }
              else{
                return out;
              }
            })());
    }
    catch(e){
      reject(e);
    }
  });
}

function nested_keys(m){
  return walk(walk_step_fn,[[[],m]]); 
}

function nested_keys_async(m){
  return walk_async(walk_step_fn,[[[],m]],[]);
}

kennytilton · April 10, 2022, 10:20pm

What part of “tree” do you not respect? Resistance is futile!

Josh_Lemer · April 14, 2022, 5:30am

a simple and stack-safe solution

(defn todo [m]
  (let [out (transient [])]
    (loop [path []
           stack (list m)]
      (if (not-empty stack)
        (let [[x & xs] stack]
          (if (empty? x)
            (if (not-empty path)
              (recur (pop path) xs))
            (let [[k v] (first x)
                  kvs (rest x)
                  next-path (conj path k)]
              (if (map? v)
                (do
                  (recur next-path (conj xs kvs v)))
                (do
                  (conj! out [next-path v])
                  (recur path (conj xs kvs)))))))))
    (persistent! out)))

joinr · April 14, 2022, 7:34am

I think (in general) you need to put the out transient into an atom and swap! via conj! to mutate it (and capture the result), or include it in the loop bindings and explicitly pass the result of conj! to maintain a reference to the returned result. Depending on undefined behaviors, if you treat it as a “bash in place” mutable container and apply updates via a side-effecting model without capturing the result of conj! (as in the current do form), you can end up with incorrect data since the original reference may be modified under the hood (the result of conj! - the current state of the transient - may be a different object than the original bound value - which is not reflected in the binding). I had this error affect me years ago; uncertain what the exact failure semantics are, but the docs deter the bash-in-place pattern: “Note in particular that transients are not designed to be bashed in-place. You must capture and use the return value in the next call. In this way, they support the same code structure as the functional persistent code they replace.”

incorrect example reference

teodorlu · April 14, 2022, 8:31am

Ah, so each time you conj! onto a mutable container, you get the “new” container back (but the last one can’t be used), and have to use that?

Thanks for explaining.

Shouldn’t this be possible to fix in @Josh_Lemer’s answer by adding the transient as a loop/recur argument?

joinr · April 14, 2022, 1:22pm

Shouldn’t this be possible to fix in @Josh_Lemer’s answer by adding the transient as a loop/recur argument?

Yes, that was an alternative in the original suggestion.

Using swap! and an atom probably requires slightly less transformation of the original solution though.

Josh_Lemer · April 14, 2022, 2:49pm

Sorry I always forget about that!

(defn todo [m]
  (loop [path []
         stack (list m)
         out (transient [])]
    (if (not-empty stack)
      (let [[x & xs] stack]
        (if (empty? x)
          (if (empty? path)
            (persistent! out)
            (recur (pop path) xs out))
          (let [[[k v] & kvs] x
                next-path (conj path k)]
            (if (map? v)
              (recur next-path (conj xs kvs v) out)
              (recur path (conj xs kvs) (conj! out [next-path v])))))))))

Josh_Lemer · April 14, 2022, 2:55pm

And a stack-safe, lazy implementation using generators from Cloroutine (cloroutine/01-generators.md at master · leonoel/cloroutine · GitHub)!

(defn todo [m]
  (generator
   (loop [path []
          stack (list m)]
     (if (not-empty stack)
       (let [[x & xs] stack]
         (if (empty? x)
           (if (empty? path)
             (recur (pop path) xs))
           (let [[[k v] & kvs] x
                 next-path (conj path k)]
             (if (map? v)
               (recur next-path (conj xs kvs v))
               (do
                 (yield [next-path v])
                 (recur path (conj xs kvs)))))))))))

I really love this, this is a great example of how generators can really come in handy and express an algorithm much more cleanly than in other ways. Why aren’t we all using these already??

joinr · April 14, 2022, 4:52pm

I really like these (originally used them in F# quite a bit to define sequences and even computation expressions (monads), as well as python stuff). Curious what the limitations there are - in the naive implementation for generator, there is at least a consistent dynamic var deref that has to occur (that could likely be optimized out though, I think this is for demonstrative purposes). So naive performance appears impacted, likely only in this implementation:

(defn my-repeat [x]
  (generator
   (loop []
     (yield x)
     (recur))))

cloro.core> (time (dotimes [i 1] (last (take 1000000 (my-repeat :a)))))
"Elapsed time: 681.395 msecs"
nil

;;naive pure lazy seq
(defn repeat-l [x]
  (lazy-seq
   (cons x (repeat-l x))))

cloro.core> (time (dotimes [i 1] (last (take 1000000 (repeat-l :a)))))
"Elapsed time: 156.2908 msecs"

cloro.core> (time (dotimes [i 1] (last (take 1000000 (repeat :a)))))
"Elapsed time: 89.4764 msecs"
nil

I assume one value prospect would be consistent costs with lazy sequences, while allowing imperative definition of the sequence, or even better performance (e.g. defining a generator that returns something iterable/reducible that can provide an efficient baseline for other stuff like transduce). Definitely a cool tool (I need some time to go through implementation though, hah).

lilactown · April 14, 2022, 7:01pm

I decided to take a crack at this.

My first intuition was that this seems like a map over each entry in a map, where if the child is a map it will recurse. So i started naively with

(defn todo
  ([path m]
  (map
   (fn [[key x]]
     (if (map? x)
       (todo (conj path key) x)
       [(conj path key) x]))
   m))
  ([m] (todo [] m)))

this doesn’t quite work how the initial question asked though

user=> (todo {:a 99
              :x {:b 999
                  :y {:z 123
                      :w 456}}})
([[:a] 99] ([[:x :b] 999] ([[:x :y :z] 123] [[:x :y :w] 456])))

Changing to mapcat solved this.

Once I had identified how to do this using core sequence operations, I then used cascade to make it eager and stackless via continuation passing

(require '[cascade.core :as c])

(defn todo-k
  ([k path m]
   (c/mapcat
    k
    (fn [k [key x]]
      (if (map? x)
        (todo-k k (conj path key) x)
        #(k [(conj path key) x])))
    m))
  ([m]
   (trampoline (todo-k identity [] m))))

(todo-k {:a 99
         :x {:b 999
             :y {:z 123
                 :w 456}}})
;; => ([:a] 99 [:x :b] 999 [:x :y :z] 123 [:x :y :w] 456)

lilactown · April 14, 2022, 7:09pm

you could make this async by using a special trampoline that understood how to await an async value and then used some async framework to fork or defer the computation when calling k.

leonoel · April 16, 2022, 7:52pm

I’ve made additional measurements. The results confirm your insight, the performance hit is largely due to the dynamic var access. java.lang.ThreadLocal is much faster.

Josh_Lemer · April 16, 2022, 9:22pm

I also have created an implementation which eliminates almost all overhead of using the coroutines library gen.clj · GitHub

didibus · April 16, 2022, 10:53pm

All these “generator/coroutine” solution people are bringing up, may I ask what’s the point of them? Just so you can implement a recursive solution that doesn’t consume stack?

Josh_Lemer · April 16, 2022, 11:29pm

Do you mean, what is the point of using generators at all? I think the Motivation section of the Python Generators PEP articulates it quite well PEP 255 – Simple Generators | peps.python.org

To paraphrase, the point is that as the complexity of a lazy “data producer-like thing” (iterator/sequence/whatever) increases, it becomes more and more difficult to manage its state by hand. Every time a .next() is called, the producer has to read from manually encoded state, continue processing to the next element, and ensure the state is stored correctly for the next .next() call. This is much more difficult to do in a correct and performant way than it is to use the language’s built in control structures, scoped variables, etc to manage its state.

This pushes developers to make a choice between giving up and producing the data strictly all at once in memory, and doing the much more difficult and error-prone task of manually constructing a lazy data producer.

This is where generators save us because they allow us to lazily produce data, while enjoying the benefits of all the language’s natural control structures. We get to have our cake and eat it too.

didibus · April 17, 2022, 2:48am

No, I mean the point of generators in this specific example? But I guess you kind of answered that as well, it is just to try for a lazy implementation? Are there other benefits though, if I don’t care about it being lazy, does generator enable anything else?

P.S: Really like that PEP, hadn’t look at a Python PEP before, well written, nice format and colors, a table of content navigation, I like the BDFL section at the end, haha, in contrast Java’s JEPs are so bare bone. Would be nice to one day have something similar for Clojure too, even if it was closed to the community, they are a nice way to see the thought process of the creators.

didibus · April 17, 2022, 5:38am

Also thought I give the original question a go.

Eager, and won’t blow the call-stack since I use a vector as my stack instead:

(defn todo
  [m]
  (loop [[k v] (first m) m (next m) path [] stack [] acc []]
    (if k
      (if (map? v)
        (recur (first v) (next v) (conj path k) (conj stack {:kv (first m) :m (next m) :path path}) acc)
        (recur (first m) (next m) path stack (conj acc (conj path k) v)))
      (if (empty? stack)
        acc
        (let [{:keys [kv m path]} (peek stack)]
          (recur kv m path (pop stack) acc))))))

(todo
 {:a 99
  :x {:b 999
      :y {:z 123
          :w 456}}})
;; => [[:a] 99 [:x :b] 999 [:x :y :z] 123 [:x :y :w] 456]

A lazy version, for extra bonus points?

(defn todo
  [m]
  (->>
   (iterate
    (fn [[_ kv m path stack]]
      (loop [[k v] kv m m path path stack stack]
        (if k
          (if (map? v)
            (recur (first v) (next v) (conj path k) (conj stack {:kv (first m) :m (next m) :path path}))
            (list [(conj path k) v] (first m) (next m) path stack))
          (if (empty? stack)
            nil
            (let [{:keys [kv m path]} (peek stack)]
              (recur kv m path (pop stack)))))))
    [nil (first m) (next m) [] []])
   (rest)
   (take-while seq)
   (mapcat first)))

(def t (todo {:a 99
              :b {:c {:d {:e {:f 100
                              :g 200}
                          :d 50}}}}))
(take 4 t)
;; => ([:a] 99 [:b :c :d :e :f] 100)

The async version is quite trivial, just swap loop for core.async go-loop. Or you could put the whole thing in a future and make it async just with future:

(defn todo
  [m]
  (a/go-loop [[k v] (first m) m (next m) path [] stack [] acc []]
    (if k
      (if (map? v)
        (recur (first v) (next v) (conj path k)
               (if-let [fm (first m)]
                 (conj stack {:kv fm :m (next m) :path path})
                 stack)
               acc)
        (recur (first m) (next m) path stack
               (conj acc (conj path k) v)))
      (if (empty? stack)
        acc
        (let [{:keys [kv m path]} (peek stack)]
          (recur kv m path (pop stack) acc))))))

(def t (todo {:a 99
              :b {:c {:d {:e {:f 100
                              :g 200}
                          :d 50}}}}))
(a/poll! t)
;; => [[:a] 99 [:b :c :d :e :f] 100 [:b :c :d :e :g] 200 [:b :c :d :d] 50]

I also investigated what it would take to make a concurrent asynchronous version. The above version in a go-loop will run synchronously and won’t yield the execution thread until it is done, so if running in Clojurescript for example, it will block any other go-block from making progress until it is done fully computing the result. In Clojure, you could just make it a a/thread (loop ...)) or use future or something else that’s OS thread based, but what about Clojurescript?

Well, you could in theory try to use Web Workers or in Node use Worker Threads, that said, it seems you can introduce a simple <! on some channel that resolves almost instantly and that causes the go block to yield to others:

(defn yielder
  "Taking from (yielder) with <! will give go blocks a chance
   to yield themselves to other go blocks."
  []
  (a/go))

(defn todo
  [m]
  (a/go-loop [[k v] (first m) m (next m) path [] stack [] acc []]
    (println "Boy")
    (a/<! (yielder))
    (if k
      (if (map? v)
        (recur (first v) (next v) (conj path k)
               (if-let [fm (first m)]
                 (conj stack {:kv fm :m (next m) :path path})
                 stack)
               acc)
        (recur (first m) (next m) path stack
               (conj acc (conj path k) v)))
      (if (empty? stack)
        acc
        (let [{:keys [kv m path]} (peek stack)]
          (recur kv m path (pop stack) acc))))))

(todo {:a 99
       :b {:c {:d {:e {:f 100
                       :g 200}
                   :d 50}}}})
(a/go-loop [a 10]
  (a/<! (yielder))
  (when (pos? a)
    (println "Ya")
    (recur (dec a))))

You’ll see that in ClojureScript it will print Ya Boy in alternate-ish fashion, where-as without the yielder, it prints all the Boy and then all the Ya, meaning the Go blocks are not concurrent. This also works in Clojure to give the go block a chance to yield, though in Clojure since you have 8 threads normally you might not notice that each go block is holding a thread till completion.

Generally, the rule with go blocks is to do very little between <! and >! operations so that they can yield often and you don’t block the thread pool.

system · October 16, 2022, 5:39pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.