Java has a build-in java.text.Normalizer class to transform Unicode text into an equivalent composed or decomposed form. Dafuq?
The letter ‘Á’ can be represented in a composed form
U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
and a decomposed form
U+0041 LATIN CAPITAL LETTER A U+0301 COMBINING ACUTE ACCENT
Normalizer handles this for your:
import java.text.Normalizer; import java.text.Normalizer.Form; public class NormalizerExample { public static void main(String[] args) { String s = Normalizer.normalize("Á", Form.NFD); System.out.println("Decomposed:"); for(int i=0;i<s.length();++i) System.out.println(Integer.toHexString((int)s.charAt(i))); s = Normalizer.normalize(s, Form.NFC); System.out.println("Composed:"); for(int i=0;i<s.length();++i) System.out.println(Integer.toHexString((int)s.charAt(i))); } } |
Output:
Decomposed: 41 301 Composed: c1
Normalizer is available since JDK6.
What is this good for?
I use it to build nice slugs, seen here, like so:
String name = "Die Ärzte 2013!"; // Decompose unicode characters String slug = Normalizer.normalize(name.toLowerCase(), Form.NFD) // replace all combining diacritical marks and also everything that isn't a word or a whitespace character .replaceAll("\\p{InCombiningDiacriticalMarks}|[^\\w\\s]", "") // replace all occurences of whitespaces or dashes with one single whitespace .replaceAll("[\\s-]+", " ") // trim the string .trim() // and replace all blanks with a dash .replaceAll("\\s", "-"); |
4 comments
Hello Michael, thank you for sharing, just a couple of questions, I’m beginning to code so I appreciate your support:
1. Where can I learn the sintax you used like “\\p{InCombiningDiacriticalMarks}|[^\\w\\s]”?
2. I have used in my code something similar to your example using Normalizer, but the letters with diacritical marks just disappear and are not subsituted with normal types, any idea why? (I have copied some paragraphs from word and paste it into a SQL DB using sqlite3, the text is in spanish, in my code I extract the specific “cell” from the cursor and apply the normalizer to this string, the result in the screen is the text without the special characters)
Thanks
Hi Luis,
1. Check Pattern java doc. There you’ll see the “Classes for Unicode scripts, blocks, categories and binary properties”. Those are regex classes starting with \p{In…}. What comes after the In are the names of Unicode blocks, without the spaces and in camel case. You’ll find all available unicode blocks here Blocks.txt.
2. The replaceAll method call in my example above does exactly that: The normalizer first decomposes the diacritics and then the regex replaces them all. Hope that helps.
how can I make a letter with one diacritic mark and a dot underneath it put.
In which system? Just entering it somewhere?!
Post a Comment