How to transform nested map into flat sequence of [path, value] pairs

Josh_Lemer · April 14, 2022, 2:55pm

And a stack-safe, lazy implementation using generators from Cloroutine (cloroutine/01-generators.md at master · leonoel/cloroutine · GitHub)!

(defn todo [m]
  (generator
   (loop [path []
          stack (list m)]
     (if (not-empty stack)
       (let [[x & xs] stack]
         (if (empty? x)
           (if (empty? path)
             (recur (pop path) xs))
           (let [[[k v] & kvs] x
                 next-path (conj path k)]
             (if (map? v)
               (recur next-path (conj xs kvs v))
               (do
                 (yield [next-path v])
                 (recur path (conj xs kvs)))))))))))

I really love this, this is a great example of how generators can really come in handy and express an algorithm much more cleanly than in other ways. Why aren’t we all using these already??

joinr · April 14, 2022, 4:52pm

I really like these (originally used them in F# quite a bit to define sequences and even computation expressions (monads), as well as python stuff). Curious what the limitations there are - in the naive implementation for generator, there is at least a consistent dynamic var deref that has to occur (that could likely be optimized out though, I think this is for demonstrative purposes). So naive performance appears impacted, likely only in this implementation:

(defn my-repeat [x]
  (generator
   (loop []
     (yield x)
     (recur))))

cloro.core> (time (dotimes [i 1] (last (take 1000000 (my-repeat :a)))))
"Elapsed time: 681.395 msecs"
nil

;;naive pure lazy seq
(defn repeat-l [x]
  (lazy-seq
   (cons x (repeat-l x))))

cloro.core> (time (dotimes [i 1] (last (take 1000000 (repeat-l :a)))))
"Elapsed time: 156.2908 msecs"

cloro.core> (time (dotimes [i 1] (last (take 1000000 (repeat :a)))))
"Elapsed time: 89.4764 msecs"
nil

I assume one value prospect would be consistent costs with lazy sequences, while allowing imperative definition of the sequence, or even better performance (e.g. defining a generator that returns something iterable/reducible that can provide an efficient baseline for other stuff like transduce). Definitely a cool tool (I need some time to go through implementation though, hah).

lilactown · April 14, 2022, 7:01pm

I decided to take a crack at this.

My first intuition was that this seems like a map over each entry in a map, where if the child is a map it will recurse. So i started naively with

(defn todo
  ([path m]
  (map
   (fn [[key x]]
     (if (map? x)
       (todo (conj path key) x)
       [(conj path key) x]))
   m))
  ([m] (todo [] m)))

this doesn’t quite work how the initial question asked though

user=> (todo {:a 99
              :x {:b 999
                  :y {:z 123
                      :w 456}}})
([[:a] 99] ([[:x :b] 999] ([[:x :y :z] 123] [[:x :y :w] 456])))

Changing to mapcat solved this.

Once I had identified how to do this using core sequence operations, I then used cascade to make it eager and stackless via continuation passing

(require '[cascade.core :as c])

(defn todo-k
  ([k path m]
   (c/mapcat
    k
    (fn [k [key x]]
      (if (map? x)
        (todo-k k (conj path key) x)
        #(k [(conj path key) x])))
    m))
  ([m]
   (trampoline (todo-k identity [] m))))

(todo-k {:a 99
         :x {:b 999
             :y {:z 123
                 :w 456}}})
;; => ([:a] 99 [:x :b] 999 [:x :y :z] 123 [:x :y :w] 456)

lilactown · April 14, 2022, 7:09pm

you could make this async by using a special trampoline that understood how to await an async value and then used some async framework to fork or defer the computation when calling k.

leonoel · April 16, 2022, 7:52pm

I’ve made additional measurements. The results confirm your insight, the performance hit is largely due to the dynamic var access. java.lang.ThreadLocal is much faster.

Josh_Lemer · April 16, 2022, 9:22pm

I also have created an implementation which eliminates almost all overhead of using the coroutines library gen.clj · GitHub

didibus · April 16, 2022, 10:53pm

All these “generator/coroutine” solution people are bringing up, may I ask what’s the point of them? Just so you can implement a recursive solution that doesn’t consume stack?

Josh_Lemer · April 16, 2022, 11:29pm

Do you mean, what is the point of using generators at all? I think the Motivation section of the Python Generators PEP articulates it quite well PEP 255 – Simple Generators | peps.python.org

To paraphrase, the point is that as the complexity of a lazy “data producer-like thing” (iterator/sequence/whatever) increases, it becomes more and more difficult to manage its state by hand. Every time a .next() is called, the producer has to read from manually encoded state, continue processing to the next element, and ensure the state is stored correctly for the next .next() call. This is much more difficult to do in a correct and performant way than it is to use the language’s built in control structures, scoped variables, etc to manage its state.

This pushes developers to make a choice between giving up and producing the data strictly all at once in memory, and doing the much more difficult and error-prone task of manually constructing a lazy data producer.

This is where generators save us because they allow us to lazily produce data, while enjoying the benefits of all the language’s natural control structures. We get to have our cake and eat it too.

didibus · April 17, 2022, 2:48am

No, I mean the point of generators in this specific example? But I guess you kind of answered that as well, it is just to try for a lazy implementation? Are there other benefits though, if I don’t care about it being lazy, does generator enable anything else?

P.S: Really like that PEP, hadn’t look at a Python PEP before, well written, nice format and colors, a table of content navigation, I like the BDFL section at the end, haha, in contrast Java’s JEPs are so bare bone. Would be nice to one day have something similar for Clojure too, even if it was closed to the community, they are a nice way to see the thought process of the creators.

didibus · April 17, 2022, 5:38am

Also thought I give the original question a go.

Eager, and won’t blow the call-stack since I use a vector as my stack instead:

(defn todo
  [m]
  (loop [[k v] (first m) m (next m) path [] stack [] acc []]
    (if k
      (if (map? v)
        (recur (first v) (next v) (conj path k) (conj stack {:kv (first m) :m (next m) :path path}) acc)
        (recur (first m) (next m) path stack (conj acc (conj path k) v)))
      (if (empty? stack)
        acc
        (let [{:keys [kv m path]} (peek stack)]
          (recur kv m path (pop stack) acc))))))

(todo
 {:a 99
  :x {:b 999
      :y {:z 123
          :w 456}}})
;; => [[:a] 99 [:x :b] 999 [:x :y :z] 123 [:x :y :w] 456]

A lazy version, for extra bonus points?

(defn todo
  [m]
  (->>
   (iterate
    (fn [[_ kv m path stack]]
      (loop [[k v] kv m m path path stack stack]
        (if k
          (if (map? v)
            (recur (first v) (next v) (conj path k) (conj stack {:kv (first m) :m (next m) :path path}))
            (list [(conj path k) v] (first m) (next m) path stack))
          (if (empty? stack)
            nil
            (let [{:keys [kv m path]} (peek stack)]
              (recur kv m path (pop stack)))))))
    [nil (first m) (next m) [] []])
   (rest)
   (take-while seq)
   (mapcat first)))

(def t (todo {:a 99
              :b {:c {:d {:e {:f 100
                              :g 200}
                          :d 50}}}}))
(take 4 t)
;; => ([:a] 99 [:b :c :d :e :f] 100)

The async version is quite trivial, just swap loop for core.async go-loop. Or you could put the whole thing in a future and make it async just with future:

(defn todo
  [m]
  (a/go-loop [[k v] (first m) m (next m) path [] stack [] acc []]
    (if k
      (if (map? v)
        (recur (first v) (next v) (conj path k)
               (if-let [fm (first m)]
                 (conj stack {:kv fm :m (next m) :path path})
                 stack)
               acc)
        (recur (first m) (next m) path stack
               (conj acc (conj path k) v)))
      (if (empty? stack)
        acc
        (let [{:keys [kv m path]} (peek stack)]
          (recur kv m path (pop stack) acc))))))

(def t (todo {:a 99
              :b {:c {:d {:e {:f 100
                              :g 200}
                          :d 50}}}}))
(a/poll! t)
;; => [[:a] 99 [:b :c :d :e :f] 100 [:b :c :d :e :g] 200 [:b :c :d :d] 50]

I also investigated what it would take to make a concurrent asynchronous version. The above version in a go-loop will run synchronously and won’t yield the execution thread until it is done, so if running in Clojurescript for example, it will block any other go-block from making progress until it is done fully computing the result. In Clojure, you could just make it a a/thread (loop ...)) or use future or something else that’s OS thread based, but what about Clojurescript?

Well, you could in theory try to use Web Workers or in Node use Worker Threads, that said, it seems you can introduce a simple <! on some channel that resolves almost instantly and that causes the go block to yield to others:

(defn yielder
  "Taking from (yielder) with <! will give go blocks a chance
   to yield themselves to other go blocks."
  []
  (a/go))

(defn todo
  [m]
  (a/go-loop [[k v] (first m) m (next m) path [] stack [] acc []]
    (println "Boy")
    (a/<! (yielder))
    (if k
      (if (map? v)
        (recur (first v) (next v) (conj path k)
               (if-let [fm (first m)]
                 (conj stack {:kv fm :m (next m) :path path})
                 stack)
               acc)
        (recur (first m) (next m) path stack
               (conj acc (conj path k) v)))
      (if (empty? stack)
        acc
        (let [{:keys [kv m path]} (peek stack)]
          (recur kv m path (pop stack) acc))))))

(todo {:a 99
       :b {:c {:d {:e {:f 100
                       :g 200}
                   :d 50}}}})
(a/go-loop [a 10]
  (a/<! (yielder))
  (when (pos? a)
    (println "Ya")
    (recur (dec a))))

You’ll see that in ClojureScript it will print Ya Boy in alternate-ish fashion, where-as without the yielder, it prints all the Boy and then all the Ya, meaning the Go blocks are not concurrent. This also works in Clojure to give the go block a chance to yield, though in Clojure since you have 8 threads normally you might not notice that each go block is holding a thread till completion.

Generally, the rule with go blocks is to do very little between <! and >! operations so that they can yield often and you don’t block the thread pool.

system · October 16, 2022, 5:39pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.