ASCII transliteration of Unicode strings

It’s sometimes useful, or even necessary, to represent strings containing accented or other letters, which are outside of the US-ASCII set, as pure ASCII. That is, for instance:

perché   ==> perche

This transliteration might be desirable for various reasons, mainly to use the string somewhere where only ASCII is supported (or desirable). Some folks call this process deaccent, as it’s commonly used to remove accents from words in order to make comparisons possible. In practice, accents are not necessarily the only problem, and you’ll want to handle things like:

straße   ==> strasse
Tromsø   ==> Tromso

There’s a CPAN module which can help here: Text::Unidecode by Sean M. Burke.

use utf8;
use Modern::Perl;
use Text::Unidecode;

for my $word(qw/Tromsø perché straße/) {
    # ASCII representation
    say unidecode($word);
}

This will print, as expected:

Tromso
perche
strasse

As you can see in the module documentation, it’s not meticulous, so it doesn’t always do a good job. However, Text::Unidecode works nicely with Western European languages along with some others.

The Cattle Grid

Splashes of digital ink by Michele Beltrame

ASCII transliteration of Unicode strings