Optimize the AST from Instaparse

After creating an operator-precedence parser using Instapare I have arrived at an AST that has alot of potential for optimization. Below are 3 examples.

Given "((a|b)|(c&d))"
AST is produced

[:or_expr
 [:and_expr
  [:or_expr
   [:or_expr
    [:and_expr
     [:or_expr [:or_expr [:and_expr [:term "a"]]] [:and_expr [:term "b"]]]]]
   [:and_expr [:or_expr [:and_expr [:and_expr [:term "c"]] [:term "d"]]]]]]]

Could be optimized to the following

[:or_expr
 [:or_expr [:term "a"] [:term "b"]]
 [:and_expr [:term "c"] [:term "d"]]]

Given "((a|(b&c))|d)"
AST is produced

[:or_expr
 [:and_expr
  [:or_expr
   [:or_expr
    [:and_expr
     [:or_expr
      [:or_expr [:and_expr [:term "a"]]]
      [:and_expr
       [:or_expr [:and_expr [:and_expr [:term "b"]] [:term "c"]]]]]]]
   [:and_expr [:term "d"]]]]]

Could be optimized to the following

[:or_expr
 [:or_expr
  [:term "a"]
  [:and_expr [:term "b"] [:term "c"]]]
 [:term "d"]]

Given "((a&b)|c)|d)"
AST is produced

[:or_expr
 [:and_expr
  [:or_expr
   [:or_expr
    [:and_expr
     [:or_expr
      [:or_expr
       [:and_expr [:or_expr [:and_expr [:and_expr [:term "a"]] [:term "b"]]]]]
      [:and_expr [:term "c"]]]]]
   [:and_expr [:term "d"]]]]]

Could be optimized to the following

[:or_expr
 [:or_expr
  [:and_expr [:term "a"]] [:term "b"]
  [:term "c"]]
 [:term "d"]]

How would one think when producing an optimization algorithm in for an AST in Clojure and what would be an idiomatic implementation for the optimization above?

1 Like

This sort of structure crops up a lot in automated parsing and the parser itself has no notion of semantics so it can’t do the optimization directly. The code using the parser knows about semantics and can do that optimization.

At work, we use Instaparse to support a business DSL where we get a similar level of nesting and single-argument expressions – and we deal with it in the code that processes the AST, based on the semantics we’ve assigned to various operators: binary (or variadic) operators that are degenerate for single arguments all have their arguments “passed through” which recursively optimizes such trees to the minimum valid expression.

2 Likes

This is my plan also, I will compile this AST to a malli spec, my plan was to first optimize the AST to get rid of lots of nesting, then use insta/transfrom which will then be a 1 to 1 transformation from the AST to malli spec.

Could you give some example how an algorithm would look like to un-nest the following structure. It feels like there should be a simple and elegant way of doing it but the only thing I have working is cumbersome and imperative destructuring and if statements.

I used postwalk:

(require '[clojure.walk :as w])
(defn simplify [coll]
  (if (coll? coll)
         (case (first coll)
           :term coll
           (:and_expr :or_expr)
             (if (= (count coll) 2)
               (second coll)
               coll))
         coll))

(def expr
  [:or_expr
   [:and_expr
    [:or_expr
     [:or_expr
      [:and_expr
       [:or_expr [:or_expr [:and_expr [:term "a"]]] [:and_expr [:term "b"]]]]]
     [:and_expr [:or_expr [:and_expr [:and_expr [:term "c"]] [:term "d"]]]]]]])

user> (w/postwalk simplify expr)
[:or_expr
 [:or_expr [:term "a"] [:term "b"]]
 [:and_expr [:term "c"] [:term "d"]]]

I imagine you can get more exotic transforms if you leverage meander and rewrite rules, although in this case maybe the simple single-term and/or collapse is sufficient.

(require '[meander.epsilon :as m])
(require '[meander.strategy.epsilon :as m*])

(def simplify-conjunctions
  (m*/rewrite
   [:and_expr ?x] ?x
   [:or_expr ?x] ?x))

(def simplify-all-conjs
  (m*/bottom-up
   (m*/attempt simplify-conjunctions)))

user> (simplify-all-conjs expr)
[:or_expr
 [:or_expr [:term "a"] [:term "b"]]
 [:and_expr [:term "c"] [:term "d"]]]

for completeness, less-magic naive recursive solution:

(defn simplify-rec [coll]
  (if (coll? coll)
    (case (first coll)
      :term coll
      (:and_expr :or_expr)
      (if (= (count coll) 2)
        (simplify-rec (second coll))
        ;`[~(first coll) ~@(mapv simplify-rec (rest coll))] ;;equivalent..
        (into [(first coll)] (mapv simplify-rec (rest coll)))))
    coll))
4 Likes

@joinr I do like the postwalk example the best, its a nice tradeoff between magic. To understand what you where doing I had to draw out the tree on paper and then trim it. After doing that I wanted to extend my grammar to include negation, both for terms and subexpressions so I ended up rewriting the parser for the 10000 time.
However after drawing the resulting tree I see no clear way to actually trim this tree.
Am I missing something or how would a solution for the new tree look?

This is my current parser:

((insta/parser
  "
S               = exp
<exp>       = or
or              = and {<'|'> and}
and           = not {<'&'> not}
not            = [<'~'>] primary | [<'~'>] <'('> exp <')'>
<primary> = term | <'('> exp <')'>
term          = #'[a-zA-Z][a-zA-Z0-9]*'
") "(~a|(~(b|c)&~d))")

I solved it, not sure there is a better way but by changing the following line in the parser:

not = ['~'] primary | ['~'] <'('> exp <')'>

I got enough information to update simplify to:

(defn simplify [coll]
  (if (coll? coll)
    (case (first coll)
      :term coll
      :S coll
      :not
      (if (= (count coll) 3)
        [(first coll) (last coll)]
        (second coll))
      (:and :or)
      (if (= (count coll) 2)
        (second coll)
        coll))
    coll))

Another option I avoided (for simplicity) would be to use multimethods to define simplification rules for ast nodes. That way you could expand alongside the grammar. However, if it’s limited then just maintaining single fn is sufficient.

Regarding instaparse…I was doing some stuff with it over the years, primarily parsing some older lua variant. It’s extremely slow even with unambiguous grammar. I stumbled across clj-antlr last night and migrated to it for my use case. Easily 100x faster out of the box.

1 Like