Kotoba achieved another important milestone this evening. Characters! As alluded to in previous posts, my work has been focused on developing a language-agnostic model of language characters. The work is heavily inspired on Kanjidict which is also the source of the 10,000+ chinese characters that are now a part of Kotoba.
Author: Ward
MySQL + UTF-8 = Not So Obvious
Back in the day of MySQL 4, UTF-8 was a relatively straight-forward thing. You declared the charset to be utf-8 and behold! so it was. Fast-forward to today and ensuring your database is actually supporting UTF-8 is not so straight-forward.
I had noticed some issues with Kotoba that I thought resolved around collation (the linguistic strategy a database uses to sort strings) as evidenced by Japanese words not sorting in their natural-order (e.g. あいうえお等). Namely, I just chalked it up to a simple mis-configuration. However, in developing character (read Kanji!) functionality this past week (see the next post!), I noticed some more aberrant behavior. In particular, when searching for kanji I would get false positives. For example, if I searched for 会 I also got a few other kanji; a linguistically non-sensical result. I started to dig a bit deeper, and to my disappoint much of the underlying data in Kotoba was corrupted.
The issue, at it is core, is that while I had configured the database for UTF-8, I had missed a few spots that fundamentally enabled the database to store non-UTF-8 characters: a very subtle form of 文字化け.
The existing literature on the topic including possible fixes [here, here, here, here, here, here, horror story here, here] encourage the configuration of the MySQL daemon (mysqld) to use skip-character-set-client-handshake
. This effectively ignores the client handshake with the server, and forces communications to use utf-8. This can be configured as below in your MySQL my.cnf
file, or:
[client]
default-character-set=utf8
[mysqld]
default-character-set = utf8
skip-character-set-client-handshake
character-set-server = utf8
collation-server = utf8_general_ci
init-connect = SET NAMES utf8
However, you may not have access to configure your MySQL server as is currently the case with Kotoba’s shared hosting. So how does one get around this problem? In my case, ensure my client (Rails 2.2) strictly enforces utf-8.
Modify database.yml
to include encoding: utf8
, or:
development:
adapter: mysql
database: DATABASE
username: USERNAME
password: PASSWORD
socket: /tmp/mysql.sock
encoding: utf8
timeout: 5000
Add to environment.rb
the line:
$KCODE = "UTF8"
Add to application.rb
the following:
before_filter :set_charset
def set_charset
headers["Content-Type"] = "text/html; charset=UTF-8"
end
And if you really are paranoid, and you should be, then I would also add (blatant copy of Artuaz’s set_names_utf8) lib/my_app_utf8.rb
:
module MyAppUtf8
class SetNamesUtf8
def self.filter(controller)
suppress(ActiveRecord::StatementInvalid) do
ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end
true
end
end
end
And also update init.rb
to include:
##
## Ensure that all SQL queries use UTF8
##
ActionController::Base.send :prepend_before_filter, MyAppUtf8::SetNamesUtf8
suppress(ActiveRecord::StatementInvalid) do
ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end
For awhile I tried to convert the data from its original charset to utf8; however, all the tools I used (including iconv
and charguess
) did not help me recover the data. In the end, I dumped all the corrupted tables and re-imported them. All data is now sufficiently remedied. Lesson learned? Check!
Sexy Models
Well, maybe sexy is too strong a word for most people. But if you are like me, then data modeling is sexy. And if that is the case then Kotoba can be particularly sexy.
One of the main objectives at this point in Kotoba’s development is ensuring the entity-relationships are modeled correctly. While no easy task, it is something that has been growing organically through trial and error and a fair bit of research into other projects.
Remember, while Kotoba is currently geared toward Japanese one of the goals is to ensure, as best as possible, something that can be more universal. That said, ironically, the most challenging task thus far is to sufficiently normalize characters rather than for words due to the numerous normalized, language-based character attributes that one might wish to track.
How Do You Spell β?
In an attempt to help clarify where Kotoba is at, we have appropriately adorned Kotoba with a β (beta) moniker to help ensure everyone knows what to expect. And by “what to expect” I mean to expect Google’s definition of beta; usable but constantly growing in functionality.
For the curious, I used Gimp and this great tutorial to help create the images.
Kotoba Making Friends
Kotoba wants to be your friend!
In what will be an interesting experiment in social networking, Kotoba now befriends other Twitter users who make posts that match a predefined list of words or the word of the moment. The idea is for Kotoba to reach out to like-minded Twitters in hopes they are interested in learning a new language. Let me know what you think: for, against, or indifferent.