An acronym capitalization checker with regex

camdez · August 8, 2021, 4:36pm

Very cool! Nice and clean!

This is rather pedantic, but some people enjoy learning these things…technically these are all initialisms, not acronyms. Acronyms are only the ones we pronounce like words (e.g. “NASA”, “LASER”, but not “USA” or “FBI”). #funfact

I don’t know how much text you’re checking / if speed is a concern, but you could massively improve the speed through some small tweaks. Right now you’re repeatedly scanning all of the text for each pattern, but you could do one overall pass of the text for potentially problematic words—the case-insensitive union of all of your patterns—and then scan that (much smaller) set for actual problems.

It shouldn’t be that hard to implement. You could start by mashing all of your patterns into a big regex alternation (something like (re-pattern (str "(" (str/join "|" options) ")"))). And, if you wanted to, you can take that a step further by optimizing it by unifying the common prefixes of the regexes. Could be a fun bit of code to write!

Emacs has a built-in function which does this that could show you the way:

(regexp-opt '("APPLE" "ABC" "ABOUT"))
;=> "\\(?:A\\(?:B\\(?:C\\|OUT\\)\\|PPLE\\)\\)"

(regexp-opt '("CERN" "FoCal" "SystemC" "ASIC" "FPGA" "CMS" "ATLAS" "LHCb"))
;=> "\\(?:A\\(?:SIC\\|TLAS\\)\\|C\\(?:ERN\\|MS\\)\\|F\\(?:PGA\\|oCal\\)\\|LHCb\\|SystemC\\)"

(FWIW, I’d unify the case of these string literals before building the optimized regex in this case.)

Anyway, just spitballing. Nice work!