Adi Drumea

Monday, March 10, 2008

[MySQL, PHP] Using UTF-8 correctly -- updated

Here is a simple recipe for people who need to work with UTF-8 only in PHP and MySQL:

In my.ini set the params below and restart the service:

[mysql]
default-character-set=utf8
[mysqld]
default-character-set=utf8

In php.ini set the params below and restart the web server.

default_charset = "utf-8"

When using mysql/i to make a connection from PHP, execute the following after connecting:

mysqli_set_charset($db, 'utf8');

Use mb_xxx functions for string functions in PHP. See this page for reference:

http://www.phpwact.org/php/i18n/charsets

Here is a summary of the functions calls you need to replace in PHP (incomplete):

mail() --> mb_send_mail()
strlen() --> mb_strlen()
strpos() --> mb_strpos()
stripos() --> mb_stripos()
strrpos() --> mb_strrpos()
substr() --> mb_substr()
strtolower() --> mb_strtolower()
strtoupper() --> mb_strtoupper()
ereg*() --> preg*() (In general preg* is recommended)
preg_* --> preg_* /ui (make sure you add /u switch for safety)
sprintf --> ??? (uncertain, probably sprintf works out of the box)
s[i] --> mb_substr(s, i, 1) (indexing gets byte, not char, at index)
strstr() --> mb_strstr()
stristr() --> mb_stristr()
split() --> mb_split()

For preg_* functions make sure character classes like [a-z] becomes [\pLl] and [A-Z] becomes [\pLu]

Also:

html_special_chars seems to work just fine. Not sure about addslashes. In general you should audit the correct usage of every string function throughout your application.

Bonus:

1. In MySQL, to sort correctly by a utf8 column, specify collation after order by (or specify collation on that column when you create the table). For example to order by a utf8 column with romanian collation:

select * from table order by column collate utf8_romanian_ci desc

This is meant as a simple guide for the situation when you use only utf8 throughout your database and your application, which should be fine for most languages.

2. I have not managed to make DBManager show UTF-8 characters in the results grid, so you will see garbage although the data is stored correctly. The same applies to the mysql command line client (on Windows).

3. There may be some problems with inserting blobs if the default charset is utf8, I am currently looking at this and update this.