Multi-byte Strikes Again

Multi-byte support, while by-and-large a now a well-supported stable of modern programming languages, is still something that trip up a person from time to time.  In particular, while UTF-8 is the de facto standard for encoding it does pose an issue when you get down to the character level.

I ran into this today while trying to shrink a potentially large bit of text to a manageable “chunk” of characters.  Ruby, it appears, separates a string into 8-bit characters.   When grabbing a substr from a string with mixed Japanese and English you end up with the dreaded 文字化け (mo-ji-ba-ke); something now only whispered in darkened corners from veterans reliving the horror days before unicode.

Well, the short of the long is a nice bit of hackery provided by at 山下英孝.  Basically, the trick involves a slice after grabbing the characters directly from the string.

However.  While this is a fun hack, it is hack.  The better way to handle this is to use chars instance method on the String class.  This ensures that a character is a logical character (e.g. ‘a’ or ‘あ’) and not the physical char returned directly by the array.

In summary, you want to use:

multi_byte_string = "私のマルチバイト文です。My multi-byte sentence."
# this is a hack of the physical characters
# you can, of course, use this; but, you should not
multi_btye_string[0,5].slice(/\A.{0,}/m)
# Instead, this is a much better approach which uses 
# all the UTF-8 goodiness to get logical characters from the string
mutli_byte_string.chars[0,5]

Author: Ward

I’m the creator and operator of this little corner of the internets, writing on all things related to art and more specifically my experiences trying to figure this whole thing out. I guess I’m trying to figure out life, too, but mostly I just post about art here.

Breath some fire into this post!

This site uses Akismet to reduce spam. Learn how your comment data is processed.