Lately I’ve been working with a friend on a daily-deal aggregator. The Groupon-like sites are popping up everywhere and the market for aggregators is still fairly unfilled.
My project, Alladeals, target the Swedish daily deals market and as such it needs to support Swedish characters.
In future it might have to support other languages as well so I decided that UTF8 was the way to go.
Since most webpages are encoded in UTF-8 these days it has been fairly painless to actually work with UTF-8 in PHP, that is, until yesterday.
PHP does not natively support UTF-8. This is fairly important to keep in mind when dealing with UTF-8 encoded data in PHP. Usually I’m pretty good at remembering that, however yesterday I happened upon a bug which could easily have gone unnoticed for months if not for some random luck.
The bug manifested itself in the deal titles, the design is not well suited for really long titles so it was decided that it would be best to make sure that the titles did not exceed a length of 140 characters. To cut the the title the following code was used:
$title = substr($deal['title'], 0, 140);
Catch the error? Remember that PHP does not natively support UTF-8? This means that functions like substr doesn’t count characters like the PHP manual says:
“the string returned will contain at most length characters beginning from start."
Rather, it actually counts bytes. This works fine for single byte character encodings, but UTF-8 is multi-byte, meaning that some characters can be more than 1 byte in length. This means that if the 140th byte of a string happens to be a multi-byte character you effectively cut it off in the middle of a character, resulting in one of those lovely question marks on a black background characters.
Luckily PHP has the multi-byte extension which implements a lot of the standard functions in a multi-byte safe way. This means that fixing our bug is as easy as converting our code to the following:
$title = mb_substr($deal['title'], 0, 140, 'UTF-8');
To be honest this is a stupid bug, one really should keep the mb_ functions in mind, but it happens and I was lucky it showed up early before it could affect too many visitors.
8 Comments. Leave new
UTF8 in PHP is … ummm … interesting, isn’t it? You can configure the mbstring extension to actually overload the str* functions (a call to strlen actually calls mb_strlen) but I’ve never liked this as there is no way back to the original.
There are loads of gotchas beside substr that I guessed you’ve encountered already: making sure your DB connection is in UTF8, sending UTF8 headers explicitally from PHP, that 3rd argument to htmlentities(), and the slash u modifier on all your regular expressions. /[a-z]/i becomes the weird /^\p{L}+$/u
Interesting is certainly one word for it. I didn’t actually know you could overload the normal string functions but I’m usually not a fan of magic in PHP so I’ll just stick to explicitly calling mb_ functions.
Thankfully most of the setup required for UTF-8 is hidden in my framework, even 3rd argument to htmlentities, so I mostly need to worry about my own code, which since I’m dealing with a scraper right now involves those shitty regexes at times. =/
The language designers (not just PHP’s) certainly didn’t have other languages in mind 🙂
My favorite gotcha, or rather a nuisance that still haunts me after years, the magic of UTF-8 itself, especially the non-BOM flavor. It goes like this: You are just working with another new source file, in a quick coding run. Since you are in turbo mode, you don’t necessarily keep an eye on the status bar, which indicates that the file is ASCII, just before you are saving.
And after a moment (or a day), you come for another edit, adding a message in your native language. And kaboom! Your message is decorated with lovely question diamonds 🙂 Since UTF-8 without BOM can’t be discerned from ASCII without actual non-ASCII characters, your file goes on in the ASCII encoding of your language, e.g. ISO-8859-9. And UTF-8 with BOM is still not supported in some popular software.
And to top if off, even if you enable all UTF-8 settings in all of your software to be on by default, sooner or later you’ll need to edit a file that was not made by you, and you forget to look at its encoding. Easy call if you are testing your things in a dev/test environment first, but still a nuisance 🙂
My favourite IDE feature is quite literally “Load as UTF-8” which force converts files to UTF-8. It’s so subtle but you end up using it so often without even knowing.
I’d be a frequent user of it too, only if I didn’t have to deal with legacy non-UTF8 code. Anyway, looks like I’m not working with much ASCII code nowadays, hmm, what happens if I give it a try… 🙂
I ran into the PHP UTF-8 thing a few years ago. I ended up writing a String class that forced everything to be UTF-8, and then used that class everywhere instead of bare strings. The problem was that I soon found myself rewriting almost everything. Now I’m back working mostly in Java which, from day 1, used Unicode natively and everywhere. PHP needs to remove every single function and refactor everything into namespaced classes.
I think that’s a bit drastic. Breaking backwards compatibility like that for something which is easily solved with a half decent editor?
Well, a lot of the time you’ll need to be working with characters, yet some times you’ll need to be working with bytes.
Some times you’ll need to parse a binary file with an alien format or a possibly miscoded textfile.
The mb_* functions require you to know the character set and encoding of each one of your strings. This is ideal but not always the case.