Normalizing non-ASCII7 characters

l3nz · July 14, 2018, 6:50am

Hi all,
I sometimes have the need to “normalize” non-Ascii7 characters, like I have an input string like “La vita è bella” and want “La vita e bella” that is valid Ascii7 but retains as much information as possible (“La vita - bella” sucks). I usually do it manually with a set of replaces, but I’m sure somebody already thought about it. Is there any library you know that does it?
Thanks

EDIT: I came across this: http://web.archive.org/web/20070917051642/http://java.sun.com/mailers/techtips/corejava/2007/tt0207.html#1 where somebody did the heavy lifting at the JVM level, but still requires a bit of work (and it’s not cross-platform).

didibus · July 14, 2018, 7:05am

You should have a look at transliteration. Its the process of going from one character set to another.

Like this lib: http://userguide.icu-project.org/transforms/general

jan · July 14, 2018, 7:06am

Give StringUtils.stripAccents a try.

$ clj -Sdeps '{:deps {org.apache.commons/commons-lang3 {:mvn/version "3.7"}}}'
Clojure 1.9.0
user=> (import 'org.apache.commons.lang3.StringUtils)
org.apache.commons.lang3.StringUtils
user=> (StringUtils/stripAccents "źdźbło")
"zdzblo"

jmlsf · July 14, 2018, 3:40pm

Lucene’s character folding is pretty sophisticated, but I’ve never tried to use this functionality outside of lucene. https://lucene.apache.org/core/3_1_0/api/core/org/apache/lucene/analysis/ASCIIFoldingFilter.html

system · January 13, 2019, 3:40am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.