4.22 Handling non-English charactersThis is NOT the latest copy of this book; click here for the latest version.
Although the American Standard Code for Information Interchange (ASCII) works for most of us, it only allows a set of 256 characters to be used to describe the alphanumeric characters available to print. That range, 0 to 255, is used because it is the size of a "byte" - eight ones and zeroes in computing terminology. Languages such as Russian, Korean, and Japanese have special characters in them, which means you need more than 256 characters, and therefore need more than one byte of space - you need a multibyte character.
Dealing with these complex characters is a little different to working with normal characters, because functions like substr() and strtoupper() expect precisely one byte per character, and will corrupt a multibyte string. Instead, you should use the multibyte equivalents of these functions, such as mb_strtoupper() instead of strtoupper(), mb_ereg_match() rather than ereg_match(), and mb_strlen() rather than strlen(). The parameters required for these functions are the same as their original, except that most accept an optional extra parameter to force specific encoding.
So, working with multibyte strings is easy for the most part, there is one exception: what do you do with an existing script you'd like to multibyte enable? To cope with that scenario, there's a special php.ini setting: mbstring.func_overload. By default this is set to 0, which means functions behave as you would expect them to. If you set it to 1, calling the mail() function gets silently rerouted to the mb_send_mail() function. If you set it to 2, all the functions starting with "str" get rerouted to their multibyte partners. If you set it to 4, all the "ereg" functions get rerouted. You can combine these together as you please by simply adding them - for example, for "mail" and "str" rerouting you add 1 and 2, giving 3, so you set mbstring.func_overload to 3 to overload these two. To overload everything, set it to 7 - 1 ("mail") + 2 ("str") + 4 ("ereg").
|
Want to see this stuff in print? PHP in a Nutshell takes the core topics covered here, adds in thousands of edits from the editorial team and myself, and combines them to make an unbeatable reference for PHP programmers at all levels.
My latest book has hundreds more tips on how to use PHP, Apache, and MySQL, plus Perl, Python, shell scripts, performance tuning, and more!
|