MySQL + UTF-8 = Not So Obvious

Back in the day of MySQL 4, UTF-8 was a relatively straight-forward thing. You declared the charset to be utf-8 and behold! so it was. Fast-forward to today and ensuring your database is actually supporting UTF-8 is not so straight-forward.

I had noticed some issues with Kotoba that I thought resolved around collation (the linguistic strategy a database uses to sort strings) as evidenced by Japanese words not sorting in their natural-order (e.g. あいうえお等). Namely, I just chalked it up to a simple mis-configuration. However, in developing character (read Kanji!) functionality this past week (see the next post!), I noticed some more aberrant behavior. In particular, when searching for kanji I would get false positives. For example, if I searched for 会 I also got a few other kanji; a linguistically non-sensical result. I started to dig a bit deeper, and to my disappoint much of the underlying data in Kotoba was corrupted.

The issue, at it is core, is that while I had configured the database for UTF-8, I had missed a few spots that fundamentally enabled the database to store non-UTF-8 characters: a very subtle form of 文字化け.

The existing literature on the topic including possible fixes [here, here, here, here, here, here, horror story here, here] encourage the configuration of the MySQL daemon (mysqld) to use skip-character-set-client-handshake. This effectively ignores the client handshake with the server, and forces communications to use utf-8. This can be configured as below in your MySQL my.cnf file, or:

[client]
default-character-set=utf8

[mysqld]
default-character-set = utf8
skip-character-set-client-handshake
character-set-server = utf8
collation-server = utf8_general_ci
init-connect = SET NAMES utf8

However, you may not have access to configure your MySQL server as is currently the case with Kotoba’s shared hosting. So how does one get around this problem? In my case, ensure my client (Rails 2.2) strictly enforces utf-8.

Modify database.yml to include encoding: utf8, or:

development:
  adapter: mysql
  database: DATABASE
  username: USERNAME
  password: PASSWORD
  socket: /tmp/mysql.sock
  encoding: utf8
  timeout: 5000

Add to environment.rb the line:

$KCODE = "UTF8"

Add to application.rb the following:

  before_filter :set_charset

  def set_charset
    headers["Content-Type"] = "text/html; charset=UTF-8"
  end 

And if you really are paranoid, and you should be, then I would also add (blatant copy of Artuaz’s set_names_utf8) lib/my_app_utf8.rb:

module MyAppUtf8
  class SetNamesUtf8
    def self.filter(controller)
      suppress(ActiveRecord::StatementInvalid) do
        ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
      end
      true
    end
  end
end

And also update init.rb to include:

##
## Ensure that all SQL queries use UTF8
##
ActionController::Base.send :prepend_before_filter, MyAppUtf8::SetNamesUtf8
suppress(ActiveRecord::StatementInvalid) do
  ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end

For awhile I tried to convert the data from its original charset to utf8; however, all the tools I used (including iconv and charguess) did not help me recover the data. In the end, I dumped all the corrupted tables and re-imported them. All data is now sufficiently remedied. Lesson learned? Check!

Author: Ward

I’m the creator and operator of this little corner of the internets, writing on all things related to art and more specifically my experiences trying to figure this whole thing out. I guess I’m trying to figure out life, too, but mostly I just post about art here.

0 thoughts on “MySQL + UTF-8 = Not So Obvious”

Breath some fire into this post!

This site uses Akismet to reduce spam. Learn how your comment data is processed.