UTF-8 has always been a multi-byte encoding but you probably had to handle only 2 byte (16bit) UTF-8 characters. With the raise of Emojis 4 byte characters rose as well so handling 4 byte UTF-8 characters is not only of interest for handling exotic languages but also for the needs of average users who want to post fancy smilies with their phones.
I won’t go into detail too much but only note some tips and caveats for supporting 4 byte UTF-8 characters in a Java / MySQL ecosystem. You’ll find the basic setup for your MySQL database, considerations about MySQL performance, connecting your Java program to the database and finally a little information about handling 4 byte UTF-8 strings in java:
MySQL / utf8mb4
MySQL has two different UTF-8 datatypes, UTF8 and UTF8mb4. If you want to store all possible UTF-8 characters, you have to use UTF8mb4 and you must convert your tables and columns, not only changing the definition. If not, you get an error 1366 “Incorrect string value”.
Here is a nice article that guides you through the process: http://mathiasbynens.be/notes/mysql-utf8mb4.
Key points are
ALTER TABLE TABLE_NAME CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
- Check the column length!
- More important, check the index length!
- Think about your collation.
- You can skip altering the column definition if you haven’t defined them in the first place, like i did, and chosen to go with the table default.
In contrast to the article i’ve chosen utf8mb4_bin as collation. The nicer ordering of utf8mb4_general_ci or better utf8mb4_unicode_ci is kinda nice but i always felt that having “Die Ärzte” = “die arzte” evaluating to true is just plain wrong. As there is no utf8mb4_unicode_cs collation i went for the binary.
Performance of 4-byte UTF-8 (utf8mb4) is slower than for 3-byte UTF-8 (utf8). If you do not want to incur this penalty, continue to use utf8.
See Upgrading from Previous to Current Unicode Support. I didn’t notice this penalty but what i did notice was a real slow down when joining to string columns with a different utf8 charset and collation (one column UTF8, the other UTF8mb4). My server load was exorbitant high.
Thinking about it it doesn’t surprise as MySQL must convert each row into a compatible format. In my first solution i wanted to only change the tables that i absolutely must to support the end users entering 4 byte UTF-8 characters.
If you want to support 4 byte UTF-8 in your database, you should convert all tables, otherwise any inner join between an UTF8 and UTF8mb4 table on varchar columns will be terrible slow.
After converting all tables i didn’t notice any further performance problems.
Java and MySQL Connector/J
First of all, the server settings. It is real important (as mentioned in the article above) to add the following snipped to your my.cnf:
[mysqld] character-set-server = utf8mb4 collation-server = utf8mb4_unicode_ci # or whatever utf8mb4 collation you choose
I omitted the “character-set-client-handshake = FALSE”, i want my server to negotiate the character set with the client.
The “character-set-server” is important for MySQL Connector/J. Without this global definition MySQL Connector/J won’t use UTF8mb4. Also, you need at least version 5.1.13 and you must specify
in the connection string.
And now comes the tricky part: In version 5.1.13 you must also specify
In higher versions (i couldn’t get the exact version, the behavior shows at least in 5.1.21 and 5.1.22), you must remove the explicit characterEncoding as otherwise the characters are stored wrong (which is very, very bad) or you possible get the “Incorrect string value” exception from your database layer (which is also bad). Either way, the solution is not that nice and i would have preferred an explicit value (but that’s what you get with a wrong design decision like having 2 UTF-8 datatypes)
Here’s what i am using with MySQL Connector/J 5.1.22
as part of a jdbc url:
or with a JDBC DataSource:
final MysqlDataSource rv = new MysqlDataSource(); rv.setServerName("localhost"); rv.setPortNumber(3306); rv.setDatabaseName("your_database"); rv.setUseUnicode(true);
Handling 4 byte UTF-8 characters in Java
I spoke about a wrong design decision within MySQL, having 2 UTF-8 character set. The Java designers choose the other way.
In Java strings are internally just an array of chars and a character is 16bit datatype which leaves us with a problem (i guess this is one of the reasons for the decision of MySQL): How to represent 32bit wide characters? Java represents those supplementary characters as pairs of surrogate chars, the first one is a high-surrogate, the second a low-surrogate.
What does that mean? Let me give you a little, commented example:
I’m expressing my frustration about several java libraries not being able to handle those characters well with the pile of poo… I rewrote part of JSoup and Mylyn/WikiText to supply correct rendering at Daily Fratze.
I see why the Java developers needed to use a construct like high- and low surrogate and i like this solution much better than having two separate character sets. What you absolutely need to get right when you’re dealing with strings at character level is wether you want to count codepoints or chars and you must respect character pairs.