Normalizing non-ASCII7 characters

Hi all,
I sometimes have the need to “normalize” non-Ascii7 characters, like I have an input string like “La vita è bella” and want “La vita e bella” that is valid Ascii7 but retains as much information as possible (“La vita - bella” sucks). I usually do it manually with a set of replaces, but I’m sure somebody already thought about it. Is there any library you know that does it?

EDIT: I came across this: where somebody did the heavy lifting at the JVM level, but still requires a bit of work (and it’s not cross-platform).

You should have a look at transliteration. Its the process of going from one character set to another.

Like this lib:

1 Like

Give StringUtils.stripAccents a try.

$ clj -Sdeps '{:deps {org.apache.commons/commons-lang3 {:mvn/version "3.7"}}}'
Clojure 1.9.0
user=> (import 'org.apache.commons.lang3.StringUtils)
user=> (StringUtils/stripAccents "źdźbło")

Lucene’s character folding is pretty sophisticated, but I’ve never tried to use this functionality outside of lucene.

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.