An acronym capitalization checker with regex

I’m currently writing a major technical work. And in the field there are lots of acronyms. I mean lots. The work is spread out among many .lyx files, which is a markup that transpiles to latex, but is still text based.

So I wrote some small code to fix common mistakes, like my own personalized spell checker, but with Clojure and regex.

the basic clause is a dict with keys :regex that matches the error, and :problem which describes the issue more clearly.

(def illegals
   [{:regex #"(?i)hardon" ; seriously there are multiple articles on arXiv with this problem
     :problem "should be hadron"}
    {:regex #"\s\s"
     :problem "double space"}]
   (capitalization-rule "CERN" "FoCal" "SystemC" "ASIC" "FPGA" "CMS" "ATLAS" "LHCb")))

Now, the interesting part here, for me at least, was the regex. I wanted a regex that searches for every ocurrence of the acronym and matches only if it has typos. This required negative lookbehind.

The basic expression is #"((?i)propercase(?-i)(?<!ProperCase))", where regex101 helped a lot. And I embedded it into a function

(defn capitalization-regex
    (str/lower-case proper-case)
    "(?-i)\\b(?<!" proper-case "))")))

(defn capitalization-rule
   {:regex (capitalization-regex item)
    :problem (str "wrong capitalizations for " item)})
  ([item & items]
   (cons (capitalization-rule item)
         (map capitalization-rule items))))

there is also some fake-hits filtering, for example for urls, but the rest is very basic clojure code.

I had some fun making that regex, and I though I would share it.


Very cool! Nice and clean!

This is rather pedantic, but some people enjoy learning these things…technically these are all initialisms, not acronyms. Acronyms are only the ones we pronounce like words (e.g. “NASA”, “LASER”, but not “USA” or “FBI”). #funfact

I don’t know how much text you’re checking / if speed is a concern, but you could massively improve the speed through some small tweaks. Right now you’re repeatedly scanning all of the text for each pattern, but you could do one overall pass of the text for potentially problematic words—the case-insensitive union of all of your patterns—and then scan that (much smaller) set for actual problems.

It shouldn’t be that hard to implement. You could start by mashing all of your patterns into a big regex alternation (something like (re-pattern (str "(" (str/join "|" options) ")"))). And, if you wanted to, you can take that a step further by optimizing it by unifying the common prefixes of the regexes. Could be a fun bit of code to write!

Emacs has a built-in function which does this that could show you the way:

(regexp-opt '("APPLE" "ABC" "ABOUT"))
;=> "\\(?:A\\(?:B\\(?:C\\|OUT\\)\\|PPLE\\)\\)"

(regexp-opt '("CERN" "FoCal" "SystemC" "ASIC" "FPGA" "CMS" "ATLAS" "LHCb"))
;=> "\\(?:A\\(?:SIC\\|TLAS\\)\\|C\\(?:ERN\\|MS\\)\\|F\\(?:PGA\\|oCal\\)\\|LHCb\\|SystemC\\)"

(FWIW, I’d unify the case of these string literals before building the optimized regex in this case.)

Anyway, just spitballing. Nice work!

Also, from a style perspective, I would probably skip the higher-arity version of capitalization-rule and just do the mapping inside of illegals. This keeps the focus of capitalization-rule on what it does best, and lets map do what it does best. Simplify + reuse general tools across various and sundry data—The Clojure Way™.

I decided to take a stab an implementing this since it seemed like a fun little challenge:

  (:require [clojure.walk :as walk]
            [clojure.string :as str]))

;; Convert strings to a graph (nested maps) representing common prefix
;; strings:
;; A -> B -> C
;;   -> O -> U -> T
;;   -> P -> P -> L -> E
;; Then collapse tails into strings and branches into regex
;; alternations.
(defn- re-literal-opts-str [opts]
  (->> opts
       (reduce (fn [acc s]
                 (assoc-in acc (butlast s) {(last s) nil}))
       (walk/postwalk (fn [x]
                          (not (map? x))  x
                          (= 1 (count x)) (apply str (first x))
                          :else           (str "(?:" (str/join "|" (map (fn [[k v]] (str k v)) x)) ")"))))))

(defn re-literal-opts
  "Builds an optimized regular expression for matching any of the string
  literals in `opts`."
  (re-pattern (re-literal-opts-str opts)))

(defn re-literal-word-opts
  "Builds an optimized regular expression for matching any of the string
  literals in `opts` at word boundaries."
  (re-pattern (str "\\b(?:" (re-literal-opts-str opts) ")\\b")))

;;; Examples

(def sample-opts ["ABC" "ABOUT" "APPLE" "BOTTLE"])

(re-literal-opts-str sample-opts)
;; => "(?:A(?:B(?:C|OUT)|PPLE)|BOTTLE)"

(def p1 (re-literal-opts sample-opts))


(def p2 (re-literal-word-opts sample-opts))

;; => ("ABOUT" "BOTTLE")

Pretty happy with how it turned out. Only about 20 minutes of work given the magic of Clojure. :mage:

1 Like

Very cool. I’m stealing this code :wink:

Looks like a 5x improvement in speed for the processing step! Of course this depends on input source etc…

I’m having performance issues in reading the files though :frowning: so it’s minimal impact on the overall processing time.

1 Like

I’m no performance guru, but if you’re using slurp or similar to read the whole file into memory, definitely look into streaming the input (line-seq is an easy approach if you don’t have patterns that span lines).

@magnus0re Just noticed the above implementation of re-literal-opts-str was slightly busted…it was order dependent and would break on prefixes (“AB” could replace “ABC” entirely). Suggest replacing (assoc-in acc (butlast s) {(last s) nil}) with (assoc-in acc (conj (vec s) nil) nil)). Full implementation here (as re-opt):

Thanks for the bugfix!
Found the performance issue. It was in finding the files. I’m running Clojure under WSL2, and the folder searched with file-seq was under windows. For all practical purposes this means that the directory was networked on localhost.
Under the hood file-seq recursively calls isDirectory and listFiles, so in a large folder, with many folders it super slow in my particular configuration.
I swapped to glob and instantly got more than 40x performance.

Woah! Nice work! I would not have expected file-seq to be the cause! TIL.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.