Karl Williamson wrote 2011-06-08 14:52 (-0600): > Hi Juerd, > > In case you aren't following Perl 5 development these days, 5.14.0 > was released, and 5.14.1 is soon to be. Some of your advice I've > seen on your website, for example, is now not really applicable, > since 5.14 has fixed most of the issues with Unicode that you have > brought up. > > 5.14 is the first Perl release where I don't think we have glaring > outages with respect to the standard. We have a Google-funded > student working on making the lexer UTF-8 safe for 5.16. I'm doing > some more work to mostly allow easier security, such as to be able > to restrict the characters matched in regexes to your chosen > scripts; hopefully script runs, so that w+ would only match > characters from the same script as the first one; etc. > > Karl Williamson
Perl Unicode Advice
You may have read my tutorial "perlunitut" that's distributed with Perl. If not, begin there :)
Well, it will be distributed with the next stable versions of Perl, that is: 5.8.9 and 5.10.0. Until then, use these links:
Here's a short summary of my advice, including some stuff that isn't in perlunitut.
- Upgrade your perl
- Upgrade your modules
- Decode all incoming data
- Encode all outgoing data
- Keep text strings and byte strings/semantics fully separated
- Realise that perl can't tell you if a string is a text string
- Realise that text strings are unicode strings, not utf8 strings (!!)
- Pretend that the UTF8 flag doesn't exist
- Don't set the UTF8 flag manually (Encode::_utf8_on)
- Don't remove the UTF8 flag manually (Encode::_utf8_off)
- Don't query the UTF8 flag (Encode::is_utf8, utf8::is_utf8)
- Realise that while strings with UTF8 flag are all text strings, the reverse is not true
- If you *have to* downgrade a string (utf8::downgrade), usually something's wrong earlier on
- Do "use utf8" only to indicate that your source code is UTF-8 encoded.
- Don't "use bytes"
- Don't use functions from the bytes:: namespace
- Don't use the \C escape in regexes
- Don't "use encoding"
- Construct byte strings with pack "C*"
- Don't unpack a text string with a non-"U" template
- Don't unpack a byte string with a "U" template
- Don't use non-"U" (un)pack templates together with "U" templates
- Don't "U0" or "C0" templates for pack
- utf8::upgrade before doing lc/lcfirst/uc
- utf8::upgrade before doing case insensitive matching
- utf8::upgrade before matching predefined character classes like w and s
These are rules of thumb. There may be occassions in which you find that manually messing around with the internals (like the UTF8 flag) is the only solution to your problem, because you work with modules that aren't Unicode-aware, or that are aware of Perl's Unicode support, but misuse it.
The rest of this page is just a collection of notes that may be useful to the reader, and especially to people googling for specific modules. I've collected experiences with modules to have a central place to look things up.
MIME headers should be ASCII text. For using non-ASCII characters, there is RFC 2047 encoding. A naïve but simple explanation is that words are encoded in some character encoding, and then in Quoted Printable or Base 64.
The encoded-word is defined as
"=?" charset "?" encoding "?" encoded-text "?="
Where charset is the full name of the character encoding, and encoding either a Q or a B.
Perl's Encode module can treat RFC 2047 encoding as a single encoding. It is called "MIME-Header". Use this, instead of doing it manually. It will make your life easier.
It is implemented in Encode::MIME::Header. When encoding, it always uses Base64 UTF-8.
use Encode qw(encode decode); my $foo = decode( "MIME-Header", "Stappen in =?iso-8859-1?Q?Belgi=eb?=" ); # $foo is now the unicode string "Stappen in België") my $bar = encode("MIME-Header", $foo); # $bar is now "=?UTF-8?B?U3RhcHBlbiBpbiBCZWxnacOr?="
By default, XML::Parser will always return UTF-8 strings, converting from another encoding when necessary. But it doesn't set the UTF8 flag, so Perl thinks the bytes are ISO-8859-1 characters, and will eventually re-encode. To avoid that you end up with double encoding, you have to decode XML::Parser's strings, or be bold and just set the UTF8 flag yourself. Note that neither are future proof, so you will want to abstract it properly in order to remove this hack later.
Unicode specific modules
Although the name suggests that this downgrades strings, it does not. Downgrading is not always possible, but when it is, it changes the internal representation without changing the actual value. This module instead copies the internal string buffer to your scalar, which is just a rather inefficient way of doing _utf8_off. The result is a variable which is either latin1 or utf8 encoded, without the necessary metadata. Don't use this module.
When you want to actually downgrade, use utf8::downgrade. Note that downgrading is only possible for strings with no character values greater than 255. Downgrading should never be necessary, but can in some cases increase performance.
This module warns when implicit upgrading happens. Do not use this in production code, because implicit upgrades are an important part of Perl's variable model. It happens to numbers and to strings. This module is a nice way to learn about when upgrades happen. If you find yourself downgrading for performance reasons, you will benefit even more if the upgrade never happened in the first place, and this module can tell you when it did. Also, this module lets you find out when you accidentally used your byte string with text semantics.
Be aware that encoding::warnings can cause a lot of output.
Handy module, especially the function is_sane_utf8, which has a misleading name (the utf8 here refers to the internals, not the conceptual string) but is very useful: it returns false when your string is UTF-8 encoded (like the result of encode_utf8), which may mean you should decode_utf8 it. If you did decode it, you might be dealing with double-encoded data.
Intentionally not discussed. Quoting its documentation: This module is intended to provide good Unicode support to versions of Perl prior to 5.8. If you are using Perl 5.8.0 or later, you probably want to be using the Encode module instead. This module does work with Perl 5.8, but Encode is the preferred method in that environment.
Much like Test::utf8::is_sane_utf8, but reversed, and stops after the first hit. Indicates that at least one character sequence is valid non-ASCII UTF-8. Might be useful for quick detection, but can easily report false positives.
Intentionally not discussed. Quoting its documentation: Provides UTF-8 conversion for perl versions from 5.00 and up. It was mainly written for use with perl 5.00 to 5.6.0 because those perl versions do not support Unicode::MapUTF8 or Encode.
Use this module whenever you do case insensitive matching, or character classes, and need stable unicode semantics. It works around a bug in Perl.
Very useful if you have Unicode data and must display it on ASCII only. Can also be a handy way to detect Unicode phishing attempts.
Unicode::String, Unicode::Map, Unicode::CharName, Unicode::Map8, Unicode::Lite
Intentionally not discussed. Intended for old Perl versions.
I'm not sure what this does that Encode doesn't do.
Perfectly Unicode aware. Doesn't restore the UTF-8 flag if the data can fit in ISO-8859-1, but that is allowed and perfectly normal. If you properly encode all your output, as you should, this is no problem.
Perfectly Unicode aware, and even stores and recovers the UTF-8 flag.
These modules are meant for binary data ONLY. Using them on text strings makes no sense whatsoever. If you really want it, you should probably explicitly encode your data first, and then pass it to these module's functions.
The de facto default for URIs is to encode everything to UTF-8, and send it as individual %-encoded bytes. The Perl URI module is not aware of this, and does not decode the strings for you, so you may have to do this yourself. Keep in mind, though, that this data is never guaranteed to be valid UTF-8.
LWP is very Unicode aware in recent versions, but note that HTTP headers are binary strings, with no way of indicating the encoding. To get the message body as a Perl text string, use $mess->decoded_content.
Note that XML documents are NOT text documents, but character encoding information may be in Content-Type header's charset attribute. They are binary documents and should be treated as such: decoded_content cannot be used.
<!-- See also HTTP::Response::Charset, which tries very hard to figure out what the right encoding is, even for XML. -->
Decoding yourself XML is potentially dangerous: possibly the <?xml?> declaration no longer agrees with the actual content, because after decoding, it's a Perl text string, not utf-8 or iso-8859-1 (or whichever encoding). If your XML documents are UTF-8, this bug is unnoticeable because things will appear to be in order. After all, Perl's internal format for text strings is UTF-8 too (but you must never access the bytes individually!).
The CGI module is not Perl Unicode aware at all. It does let you set the charset, but you will still have to decode and encode everything yourself. Is not smart about the Content-Type of received POST data.
Mark Stosberg sent the following feedback in december 2010:
It does have a utf8 pragma, documented like this: "This makes CGI.pm treat all parameters as UTF-8 strings." It also has UTF-8 handling logic in CGI::Util::escape(). I don't profess to say that either of these bits are ideal, but CGI.pm does have some awareness of unicode, even if it may not have correctness yet. :)
The responsibility for Unicode-awareness is in the DBDs (database drivers), and DBI itself doesn't really care.
Ideally, one can specify in the *database* which encodings are to be used for certain columns. Unfortunately, database engines are hardly encoding aware to that extent.
From development version 3.0007_1, has the mysql_enable_utf8 attribute, which will make the module assume that ALL text columns are UTF-8 encoded. This feature is disabled by default.
Appears to not be aware of Perl Unicode, or character encodings in general.
Dominic Mitchell skribis 2008-03-29 7:39 (+0000): > I just noticed your Perl Unicode Tips page. One thing on it is wrong > — DBD::Pg does support UTF-8. I know because I added it. :-) In > fact, it's pretty similar to the MySQL support, which was based upon it. > http://search.cpan.org/~turnstep/DBD-Pg-2.5.0/Pg.pm#pg_enable_utf8 Thanks! I'll add you messager verbatim because currently I have no time to check how well it was implemented. Juerd
Has the unicode attribute, which will make the module assume that ALL text columns are UTF-8 encoded. Has the same treatment for BLOB colums, which is wrong, as they are Binary Large OBjects. See the documentation for a workaround.
Unfortunately, doesn't upgrade strings going into the database. But when they come out, the UTF-8 flag is naively switched on, et voilà: malformed string. So you still need to encode_uf8 or utf8::upgrade for INSERT and UPDATE queries.
Oracle is more character set aware than the other databases, but in a very complicated way. DBD::Oracle plays well with this, but prepare to spend some time reading a lot of documentation, and debating with your DBA.
DB_File, GDBM_File, SDBM_File, ODBM_File, dbm*
Not encoding aware at all. You must decode and encode everything yourself.
This distribution includes HTML::Entities. Recent versions are Unicode aware. If you have used this module, please let me know how well it worked out for you.
Wonderfully Unicode aware, even does RFC 2047 decoding and encoding for text in headers.
Lets you set charsets in Content-Type headers, but doesn't encode for you, so you still have to encode things yourself.