Hidden Java gems: java.text.Normalizer

Java has a build-in java.text.Normalizer class to transform Unicode text into an equivalent composed or decomposed form. Dafuq?

The letter ‘Á’ can be represented in a composed form

U+00C1 LATIN CAPITAL LETTER A WITH ACUTE

and a decomposed form

U+0041    LATIN CAPITAL LETTER A
U+0301    COMBINING ACUTE ACCENT

Normalizer handles this for your:

import java.text.Normalizer;
import java.text.Normalizer.Form;
 
public class NormalizerExample {	
	public static void main(String[] args) {
		String s = Normalizer.normalize("Á", Form.NFD);
		System.out.println("Decomposed:");
		for(int i=0;i<s.length();++i)
			System.out.println(Integer.toHexString((int)s.charAt(i)));
		s = Normalizer.normalize(s, Form.NFC);
		System.out.println("Composed:");
		for(int i=0;i<s.length();++i)
			System.out.println(Integer.toHexString((int)s.charAt(i)));
	}
}

Output:

Decomposed:
41
301
Composed:
c1

Normalizer is available since JDK6.

What is this good for?

I use it to build nice slugs, seen here, like so:

String name = "Die Ärzte 2013!";
 
// Decompose unicode characters
String slug = Normalizer.normalize(name.toLowerCase(), Form.NFD)
// replace all combining diacritical marks and also everything that isn't a word or a whitespace character
	.replaceAll("\\p{InCombiningDiacriticalMarks}|[^\\w\\s]", "")
// replace all occurences of whitespaces or dashes with one single whitespace 
	.replaceAll("[\\s-]+", " ")
// trim the string
	.trim()
// and replace all blanks with a dash
	.replaceAll("\\s", "-");

4 comments

Post a comment | Trackback URI | RSS 2.0 feed for those comments

Luis wrote:

Hello Michael, thank you for sharing, just a couple of questions, I’m beginning to code so I appreciate your support:

1. Where can I learn the sintax you used like “\\p{InCombiningDiacriticalMarks}|[^\\w\\s]”?

2. I have used in my code something similar to your example using Normalizer, but the letters with diacritical marks just disappear and are not subsituted with normal types, any idea why? (I have copied some paragraphs from word and paste it into a SQL DB using sqlite3, the text is in spanish, in my code I extract the specific “cell” from the cursor and apply the normalizer to this string, the result in the screen is the text without the special characters)

Thanks

Posted on January 13, 2015 at 12:37 AM | Permalink
Michael wrote:

Hi Luis,

1. Check Pattern java doc. There you’ll see the “Classes for Unicode scripts, blocks, categories and binary properties”. Those are regex classes starting with \p{In…}. What comes after the In are the names of Unicode blocks, without the spaces and in camel case. You’ll find all available unicode blocks here Blocks.txt.

2. The replaceAll method call in my example above does exactly that: The normalizer first decomposes the diacritics and then the regex replaces them all. Hope that helps.

Posted on January 14, 2015 at 6:32 PM | Permalink
atolagberidwan wrote:

how can I make a letter with one diacritic mark and a dot underneath it put.

Posted on April 20, 2015 at 1:10 AM | Permalink
Michael wrote:

In which system? Just entering it somewhere?!

Posted on April 20, 2015 at 9:13 AM | Permalink

Post a Comment

Name *

Email *

Website

Sum of seven + ten ? *

Comment

Your email is never published. We need your name and email address only for verifying a legitimate comment. For more information, a copy of your saved data or a request to delete any data under this address, please send a short notice to michael@simons.ac from the address you used to comment on this entry.
By entering and submitting a comment, wether with or without name or email address, you'll agree that all data you have entered including your IP address will be checked and stored for a limited time by Automattic Inc., 60 29th Street #343, San Francisco, CA 94110-4929, USA. only for the purpose of avoiding spam. You can deny further storage of your data by sending an email to support@wordpress.com, with subject “Deletion of Data stored by Akismet”.
Required fields are marked *

info.michael-simons.eu