Optimize the AST from Instaparse

Henrik_Larsson · February 21, 2024, 8:34pm

After creating an operator-precedence parser using Instapare I have arrived at an AST that has alot of potential for optimization. Below are 3 examples.

Given "((a|b)|(c&d))"
AST is produced

[:or_expr
 [:and_expr
  [:or_expr
   [:or_expr
    [:and_expr
     [:or_expr [:or_expr [:and_expr [:term "a"]]] [:and_expr [:term "b"]]]]]
   [:and_expr [:or_expr [:and_expr [:and_expr [:term "c"]] [:term "d"]]]]]]]

Could be optimized to the following

[:or_expr
 [:or_expr [:term "a"] [:term "b"]]
 [:and_expr [:term "c"] [:term "d"]]]

Given "((a|(b&c))|d)"
AST is produced

[:or_expr
 [:and_expr
  [:or_expr
   [:or_expr
    [:and_expr
     [:or_expr
      [:or_expr [:and_expr [:term "a"]]]
      [:and_expr
       [:or_expr [:and_expr [:and_expr [:term "b"]] [:term "c"]]]]]]]
   [:and_expr [:term "d"]]]]]

Could be optimized to the following

[:or_expr
 [:or_expr
  [:term "a"]
  [:and_expr [:term "b"] [:term "c"]]]
 [:term "d"]]

Given "((a&b)|c)|d)"
AST is produced

[:or_expr
 [:and_expr
  [:or_expr
   [:or_expr
    [:and_expr
     [:or_expr
      [:or_expr
       [:and_expr [:or_expr [:and_expr [:and_expr [:term "a"]] [:term "b"]]]]]
      [:and_expr [:term "c"]]]]]
   [:and_expr [:term "d"]]]]]

Could be optimized to the following

[:or_expr
 [:or_expr
  [:and_expr [:term "a"]] [:term "b"]
  [:term "c"]]
 [:term "d"]]

How would one think when producing an optimization algorithm in for an AST in Clojure and what would be an idiomatic implementation for the optimization above?

seancorfield · February 21, 2024, 8:44pm

This sort of structure crops up a lot in automated parsing and the parser itself has no notion of semantics so it can’t do the optimization directly. The code using the parser knows about semantics and can do that optimization.

At work, we use Instaparse to support a business DSL where we get a similar level of nesting and single-argument expressions – and we deal with it in the code that processes the AST, based on the semantics we’ve assigned to various operators: binary (or variadic) operators that are degenerate for single arguments all have their arguments “passed through” which recursively optimizes such trees to the minimum valid expression.

Henrik_Larsson · February 21, 2024, 9:06pm

This is my plan also, I will compile this AST to a malli spec, my plan was to first optimize the AST to get rid of lots of nesting, then use insta/transfrom which will then be a 1 to 1 transformation from the AST to malli spec.

Could you give some example how an algorithm would look like to un-nest the following structure. It feels like there should be a simple and elegant way of doing it but the only thing I have working is cumbersome and imperative destructuring and if statements.

joinr · February 21, 2024, 9:28pm

I used postwalk:

(require '[clojure.walk :as w])
(defn simplify [coll]
  (if (coll? coll)
         (case (first coll)
           :term coll
           (:and_expr :or_expr)
             (if (= (count coll) 2)
               (second coll)
               coll))
         coll))

(def expr
  [:or_expr
   [:and_expr
    [:or_expr
     [:or_expr
      [:and_expr
       [:or_expr [:or_expr [:and_expr [:term "a"]]] [:and_expr [:term "b"]]]]]
     [:and_expr [:or_expr [:and_expr [:and_expr [:term "c"]] [:term "d"]]]]]]])

user> (w/postwalk simplify expr)
[:or_expr
 [:or_expr [:term "a"] [:term "b"]]
 [:and_expr [:term "c"] [:term "d"]]]

I imagine you can get more exotic transforms if you leverage meander and rewrite rules, although in this case maybe the simple single-term and/or collapse is sufficient.

(require '[meander.epsilon :as m])
(require '[meander.strategy.epsilon :as m*])

(def simplify-conjunctions
  (m*/rewrite
   [:and_expr ?x] ?x
   [:or_expr ?x] ?x))

(def simplify-all-conjs
  (m*/bottom-up
   (m*/attempt simplify-conjunctions)))

user> (simplify-all-conjs expr)
[:or_expr
 [:or_expr [:term "a"] [:term "b"]]
 [:and_expr [:term "c"] [:term "d"]]]

for completeness, less-magic naive recursive solution:

(defn simplify-rec [coll]
  (if (coll? coll)
    (case (first coll)
      :term coll
      (:and_expr :or_expr)
      (if (= (count coll) 2)
        (simplify-rec (second coll))
        ;`[~(first coll) ~@(mapv simplify-rec (rest coll))] ;;equivalent..
        (into [(first coll)] (mapv simplify-rec (rest coll)))))
    coll))

Henrik_Larsson · February 22, 2024, 3:38pm

@joinr I do like the postwalk example the best, its a nice tradeoff between magic. To understand what you where doing I had to draw out the tree on paper and then trim it. After doing that I wanted to extend my grammar to include negation, both for terms and subexpressions so I ended up rewriting the parser for the 10000 time.
However after drawing the resulting tree I see no clear way to actually trim this tree.
Am I missing something or how would a solution for the new tree look?

This is my current parser:

((insta/parser
  "
S               = exp
<exp>       = or
or              = and {<'|'> and}
and           = not {<'&'> not}
not            = [<'~'>] primary | [<'~'>] <'('> exp <')'>
<primary> = term | <'('> exp <')'>
term          = #'[a-zA-Z][a-zA-Z0-9]*'
") "(~a|(~(b|c)&~d))")

Henrik_Larsson · February 22, 2024, 6:39pm

I solved it, not sure there is a better way but by changing the following line in the parser:

not = ['~'] primary | ['~'] <'('> exp <')'>

I got enough information to update simplify to:

(defn simplify [coll]
  (if (coll? coll)
    (case (first coll)
      :term coll
      :S coll
      :not
      (if (= (count coll) 3)
        [(first coll) (last coll)]
        (second coll))
      (:and :or)
      (if (= (count coll) 2)
        (second coll)
        coll))
    coll))

joinr · February 22, 2024, 10:09pm

Another option I avoided (for simplicity) would be to use multimethods to define simplification rules for ast nodes. That way you could expand alongside the grammar. However, if it’s limited then just maintaining single fn is sufficient.

Regarding instaparse…I was doing some stuff with it over the years, primarily parsing some older lua variant. It’s extremely slow even with unambiguous grammar. I stumbled across clj-antlr last night and migrated to it for my use case. Easily 100x faster out of the box.

system · August 23, 2024, 10:09am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.