Perl Unicode Tutorial
Encodings are easy!
Data exchange
* Sender and receiver must use the same format
(Sometimes you can guess)
* Avoid guessing at any cost!
Text encoding
* Characters aren't bytes
* You need to convert text from/to bytes
* That's called "decoding"/"encoding"
* A very simple encoding: ASCII
8 bit encodings
* In an 8 bit encoding, one character maps to one byte
* That means you can have at most 256 different values
* Enough for the Latin characters and Cyrillic
* Enough for the Latin characters and Greek
* Not enough for Latin and Cyrillic and Greek
Multi-byte encodings
* More bytes => more characters
* Fixed width, variable width
* Unicode encodings are all multi-byte
* UTF-8 is very popular on the internet
* UTF-16 is the internal encoding in MS Windows
"Character set"
* "character set" is character <-> number
* Unicode is a charset
* "encoding" is number <-> bytes
* UTF-8 is an encoding
* MIME calls them both "charset"
* Perl calls them both "encoding"
Two kinds of strings
* Perl has one string type
* The universe has several
* "text string" and "binary string"
* a.k.a. "character string" and "byte string"
* The computer doesn't know
* You should know
Unicode in Perl
* Text strings are unicode strings, not UTF-8
* ISO-8859-1 maps to 0..255, useful!
* Perl keeps strings at ISO-8859-1 as long as possible.
* If that doesn't work, it upgrades to UTF-8 internally.
* If you mix the two kinds, UTF-8 wins.
Prime rule
Do not mix byte strings with text strings
except if you explicitly convert between them
* decoding: bytes -> characters (binary to text)
* encoding: characters -> bytes (text to binary)
Remember the first slide?
* All communication with "the outside world" is in bytes
* Something has to decode their binary input to text
* Something has to encode your text output to binary
Neat trick
Perl lets you use code points (character numbers) that do not yet officially exist: chr(999999).
In practice part I
use Encode;
my $text = decode("ASCII", $binary_input);
my $output = encode("KOI-8R", $text);
Did I tell you this is not a Unicode tutorial?
It's an encodings tutorial :)
In practice part I
use Encode;
my $text = decode("UTF-8", $binary_input);
my $output = encode("UTF-8", $text);
Did I tell you this is not a Unicode tutorial?
It's an encodings tutorial :)
In practice part II
* Let Perl do the hard work!
binmode STDIN, ":encoding(ISO-8859-1)";
binmode STDOUT, ":encoding(UTF-8)";
print while <>;
Unicode semantics
* Perl has unicode semantics
* lc, uc, lcfirst, ucfirst
* Case insensitivity
* Character classes like \w and \d
ASCII semantics
* Perl also has ASCII semantics :(
* Hard to tell which semantics will be used for some operation
* utf8::upgrade($your_string) to ensure Unicode semantics
* If you want [A-Za-z0-9_], don't use \w
just use [A-Za-z0-9_] (say what you mean, not what happens to work)
Further info
* In Perl 5.9.5: perlunitut
* In Perl 5.9.5: perlunifaq
* http://juerd.nl/perluniadvice
** XML::Parser, Data::Dumper, Storable, Digest::MD5, Digest::Base64, URI, LWP, CGI, DBI, DBD::mysql, DBD::Pg, DBD::SQLite, DBD::Oracle, *DB*_File, HTML::Parser, Mail::Box, MIME::Lite, additions welcome!
One last thing
* DON'T USE encoding.pm
* It is broken and cannot be fixed
* Using it will hurt
* I warned you!
Thank you!
Any questions?
~
~
"Perl Unicode Tutorial" — Juerd Waalboer <#####@juerd.nl>
YAPC::Europe 2007 (Vienna)