DBD::mysql – all your UTF-8 bugs are belong to us!!□□

2016-12-14

After a couple of years of more or less “maintenance mode” on DBD::mysql – we had a hand full of people contributing occasional fixes and a whole slew of drive-by contributors – we now have a prolific contributor again: Pali Rohár.

It’s great to see some more long-standing issues taken care of!

This time around, in the new development release 4.041_01 that is on CPAN now (https://metacpan.org/release/MICHIELB/DBD-mysql-4.041_01), there are some important fixes for some Unicode-related issues that I would like to point out. The sections below I have distilled based on the descriptions made by Pali.

Automatically converting to UTF-8 for bind parameters

Before this release perl scalars (statements or bind parameters) without UTF8 status flag were not encoded to UTF-8 even if mysql_enable_utf8 was enabled. This caused perl scalars with internal Latin1 encoding to be sent to the mysql server as Latin1 even if mysql_enable_utf8 was enabled.

Now all statements and bind parameters which are not a DBI binary type (SQL_BIT, SQL_BLOB, SQL_BINARY, SQL_VARBINARY or SQL_LONGVARBINARY) are automatically encoded to UTF-8 when mysql_enable_utf8 is enabled.

If mysql_enable_utf8 is not enabled and your statement or bind parameter contains a wide Unicode character then DBD::mysql shows a warning. If a binary parameter contains a wide Unicode character then DBD::mysql shows a warning too, similar like function print without using a :utf8 perlio layer. (“Wide character in…”)

Perl’s SvPV() returns char* from a perl scalar and the following SvUTF8() call for that scalar returns true if SvPV returned the data in UTF-8 or Latin1.

Decoding of UTF-8 fields when mysql_enable_utf8 is enabled

For each fetched field mysql server tells us its charset id. Before this release when mysql_enable_utf8 was enabled DBD::mysql UTF-8 decoded all fields with a charset id different than 63 (which means binary).

Now DBD::mysql UTF-8 decodes only those fields which have their charset set to utf8 or utf8mb4. By default mysql server sends data in encoding specified by SET NAMES command, which is by default Latin1. So any received Latin1 data is not UTF-8 decoded anymore.

The mysql server sends a charset id, not a charset name. Each combination of charset name and collation pairs has its own charset id. A new function charsetnr_is_utf8() has hardcoded all utf8 and utf8mb4 charset ids from mysql (up to 8.0.0) and mariadb (up to 10.2.2) from their source code. So far it looks like those ids are not changing since old mysql 5.0, only new ones are added.

Conclusion

We hope these changes make DBD::mysql a lot more consistent for you. Since the changes are rather big, we’d urge you to test the development release 4.041_01 which is on CPAN and give feedback NOW; this allows us to make changes if needed before we create an actual stable release with these features.

And of course, if you test it with your software and all is good, we’d like to hear that as well!

You can leave your feedback via the DBI-users mailing list, or using our GitHub page.

DBD::mysql – all your UTF-8 bugs are belong to us!!□□

Automatically converting to UTF-8 for bind parameters

Decoding of UTF-8 fields when mysql_enable_utf8 is enabled

Conclusion

michiel