I’m currently writing a major technical work. And in the field there are lots of acronyms. I mean lots. The work is spread out among many .lyx files, which is a markup that transpiles to latex, but is still text based.
So I wrote some small code to fix common mistakes, like my own personalized spell checker, but with Clojure and regex.
the basic clause is a dict with keys :regex that matches the error, and :problem which describes the issue more clearly.
(def illegals
(concat
[{:regex #"(?i)hardon" ; seriously there are multiple articles on arXiv with this problem https://arxiv.org/search/?query=large+hardon+collider&searchtype=all&source=header
:problem "should be hadron"}
{:regex #"\s\s"
:problem "double space"}]
(capitalization-rule "CERN" "FoCal" "SystemC" "ASIC" "FPGA" "CMS" "ATLAS" "LHCb")))
Now, the interesting part here, for me at least, was the regex. I wanted a regex that searches for every ocurrence of the acronym and matches only if it has typos. This required negative lookbehind.
This is rather pedantic, but some people enjoy learning these things…technically these are all initialisms, not acronyms. Acronyms are only the ones we pronounce like words (e.g. “NASA”, “LASER”, but not “USA” or “FBI”). #funfact
I don’t know how much text you’re checking / if speed is a concern, but you could massively improve the speed through some small tweaks. Right now you’re repeatedly scanning all of the text for each pattern, but you could do one overall pass of the text for potentially problematic words—the case-insensitive union of all of your patterns—and then scan that (much smaller) set for actual problems.
It shouldn’t be that hard to implement. You could start by mashing all of your patterns into a big regex alternation (something like (re-pattern (str "(" (str/join "|" options) ")"))). And, if you wanted to, you can take that a step further by optimizing it by unifying the common prefixes of the regexes. Could be a fun bit of code to write!
Emacs has a built-in function which does this that could show you the way:
Also, from a style perspective, I would probably skip the higher-arity version of capitalization-rule and just do the mapping inside of illegals. This keeps the focus of capitalization-rule on what it does best, and lets map do what it does best. Simplify + reuse general tools across various and sundry data—The Clojure Way™.
I decided to take a stab an implementing this since it seemed like a fun little challenge:
(ns camdez.re-opt
(:require [clojure.walk :as walk]
[clojure.string :as str]))
;; Convert strings to a graph (nested maps) representing common prefix
;; strings:
;;
;; A -> B -> C
;; -> O -> U -> T
;; -> P -> P -> L -> E
;;
;; Then collapse tails into strings and branches into regex
;; alternations.
(defn- re-literal-opts-str [opts]
(->> opts
(reduce (fn [acc s]
(assoc-in acc (butlast s) {(last s) nil}))
{})
(walk/postwalk (fn [x]
(cond
(not (map? x)) x
(= 1 (count x)) (apply str (first x))
:else (str "(?:" (str/join "|" (map (fn [[k v]] (str k v)) x)) ")"))))))
(defn re-literal-opts
"Builds an optimized regular expression for matching any of the string
literals in `opts`."
[opts]
(re-pattern (re-literal-opts-str opts)))
(defn re-literal-word-opts
"Builds an optimized regular expression for matching any of the string
literals in `opts` at word boundaries."
[opts]
(re-pattern (str "\\b(?:" (re-literal-opts-str opts) ")\\b")))
I’m no performance guru, but if you’re using slurp or similar to read the whole file into memory, definitely look into streaming the input (line-seq is an easy approach if you don’t have patterns that span lines).
@magnus0re Just noticed the above implementation of re-literal-opts-str was slightly busted…it was order dependent and would break on prefixes (“AB” could replace “ABC” entirely). Suggest replacing (assoc-in acc (butlast s) {(last s) nil}) with (assoc-in acc (conj (vec s) nil) nil)). Full implementation here (as re-opt):
Thanks for the bugfix!
Found the performance issue. It was in finding the files. I’m running Clojure under WSL2, and the folder searched with file-seq was under windows. For all practical purposes this means that the directory was networked on localhost.
Under the hood file-seq recursively calls isDirectory and listFiles, so in a large folder, with many folders it super slow in my particular configuration.
I swapped to glob and instantly got more than 40x performance.