Multi-byte support, while by-and-large a now a well-supported stable of modern programming languages, is still something that trip up a person from time to time. In particular, while UTF-8 is the de facto standard for encoding it does pose an issue when you get down to the char
acter level.
I ran into this today while trying to shrink a potentially large bit of text to a manageable “chunk” of characters. Ruby, it appears, separates a string into 8-bit characters. When grabbing a substr
from a string with mixed Japanese and English you end up with the dreaded 文字化け (mo-ji-ba-ke); something now only whispered in darkened corners from veterans reliving the horror days before unicode.
Well, the short of the long is a nice bit of hackery provided by at 山下英孝. Basically, the trick involves a slice after grabbing the characters directly from the string.
However. While this is a fun hack, it is hack. The better way to handle this is to use chars
instance method on the String class. This ensures that a character is a logical character (e.g. ‘a’ or ‘あ’) and not the physical char
returned directly by the array.
In summary, you want to use:
multi_byte_string = "私のマルチバイト文です。My multi-byte sentence."
# this is a hack of the physical characters
# you can, of course, use this; but, you should not
multi_btye_string[0,5].slice(/\A.{0,}/m)
# Instead, this is a much better approach which uses
# all the UTF-8 goodiness to get logical characters from the string
mutli_byte_string.chars[0,5]