Performance Zone is brought to you in partnership with:

I specialise MySQL Server performance as well as in performance of application stacks using MySQL, especially LAMP. Web sites handling millions of visitors a day dealing with terabytes of data and hundreds of servers is king of applications I love the most. Peter is a DZone MVB and is not an employee of DZone and has posted 272 posts at DZone. You can read more from them at their website. View Full User Profile

CentOS 5 users: your UTF-8 data is in peril with Perl MySQL

02.25.2013
| 1358 views |
  • submit to reddit

This post comes from at the MySQL Performance Blog.

CentOS 5.8 and earlier use Perl module DBD::mysql v3.0007 which has a bug that causes Perl not to flag UTF-8 data as being UTF-8.  Presuming that the MySQL table/column is using UTF-8, and the Perl MySQL connection is also using UTF-8, then a correct system returns:

PV = 0x9573840 "\343\203\213 \303\250"\0 [UTF8 "\x{30cb} \x{e8}"]

That’s a Devel::Peek inside a Perl scalar variable which clearly shows that Perl has recognized and flagged the data at UTF-8.  With DBD::mysql v3.0007 however, an incorrect system returns:

PV = 0x92df9a8 "\343\203\213 \303\250"\0

Notice that it’s the same byte sequence (in octal), but there’s no UTF-8 flag.  As far as Perl is concerned, this is Latin1 data.

What does this mean for you?  In general, it means that Perl could corrupt the data by treating UTF-8 data as Latin1 data.  If the program doesn’t alter the data, then the problem is “overlooked” and compensated for by the fact that MySQL knows that the data is UTF-8 and treats it as such.  We have found, however, that a program can modify the data without corrupting it, but this is risky and really only works by luck, so you shouldn’t rely on it.

I’d like to clarifying two things.  First, DBD::mysql v3.0007 was released in September 2006, but this very old problem still exists today because CentOS 5 is still a popular Linux distro.  So this isn’t “breaking news”, and Perl and DBD::mysql have handled UTF-8 correctly for nearly the last decade.   Second, just a reminder: all Percona Toolkit tools that connect to MySQL have a –charset option and an “A” DSN option for setting the character set.

In conclusion, if you

  1. Run CentOS 5
  2. Have UTF-8 data in MySQL
  3. Use Perl to access that data
  4. Have not upgraded DBD::mysql (perl-DBD-MySQL)

then your UTF-8 data is in peril with Perl.  The solution is simple: upgrade to any newer version of DBD::mysql (except 4.014).



Published at DZone with permission of Peter Zaitsev, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)