Perl Unicode Tutorial

Encodings are easy!

Data exchange

* Sender and receiver must use the same format
(Sometimes you can guess)
* Avoid guessing at any cost!

Text encoding

* Characters aren't bytes
* You need to convert text from/to bytes
* That's called "decoding"/"encoding"
* A very simple encoding: ASCII

8 bit encodings

* In an 8 bit encoding, one character maps to one byte
* That means you can have at most 256 different values
* Enough for the Latin characters and Cyrillic
* Enough for the Latin characters and Greek
* Not enough for Latin and Cyrillic and Greek

Multi-byte encodings

* More bytes => more characters
* Fixed width, variable width
* Unicode encodings are all multi-byte
* UTF-8 is very popular on the internet
* UTF-16 is the internal encoding in MS Windows

"Character set"

* "character set" is character <-> number
* Unicode is a charset
* "encoding" is number <-> bytes
* UTF-8 is an encoding
* MIME calls them both "charset"
* Perl calls them both "encoding"

Two kinds of strings

* Perl has one string type
* The universe has several
* "text string" and "binary string"
* a.k.a. "character string" and "byte string"
* The computer doesn't know
* You should know

Unicode in Perl

* Text strings are unicode strings, not UTF-8
* ISO-8859-1 maps to 0..255, useful!
* Perl keeps strings at ISO-8859-1 as long as possible.
* If that doesn't work, it upgrades to UTF-8 internally.
* If you mix the two kinds, UTF-8 wins.

Prime rule

Do not mix byte strings with text strings
except if you explicitly convert between them
* decoding: bytes -> characters (binary to text)
* encoding: characters -> bytes (text to binary)

Remember the first slide?

* All communication with "the outside world" is in bytes
* Something has to decode their binary input to text
* Something has to encode your text output to binary

Neat trick

Perl lets you use code points (character numbers) that do not yet officially exist: chr(999999).

In practice part I

use Encode;
my $text = decode("ASCII", $binary_input);
my $output = encode("KOI-8R", $text);
Did I tell you this is not a Unicode tutorial?
It's an encodings tutorial :)

In practice part I

use Encode; my $text = decode("UTF-8", $binary_input); my $output = encode("UTF-8", $text); Did I tell you this is not a Unicode tutorial? It's an encodings tutorial :)

In practice part II

* Let Perl do the hard work!
binmode STDIN, ":encoding(ISO-8859-1)";
binmode STDOUT, ":encoding(UTF-8)";
print while <>;

Unicode semantics

* Perl has unicode semantics
* lc, uc, lcfirst, ucfirst
* Case insensitivity
* Character classes like \w and \d

ASCII semantics

* Perl also has ASCII semantics :(
* Hard to tell which semantics will be used for some operation
* utf8::upgrade($your_string) to ensure Unicode semantics
* If you want [A-Za-z0-9_], don't use \w
just use [A-Za-z0-9_] (say what you mean, not what happens to work)

Further info

* In Perl 5.9.5: perlunitut
* In Perl 5.9.5: perlunifaq
* http://juerd.nl/perluniadvice
** XML::Parser, Data::Dumper, Storable, Digest::MD5, Digest::Base64, URI, LWP, CGI, DBI, DBD::mysql, DBD::Pg, DBD::SQLite, DBD::Oracle, *DB*_File, HTML::Parser, Mail::Box, MIME::Lite, additions welcome!

One last thing

* DON'T USE encoding.pm
* It is broken and cannot be fixed
* Using it will hurt
* I warned you!

Thank you!

Any questions?
~
~
"Perl Unicode Tutorial" — Juerd Waalboer <#####@juerd.nl>
YAPC::Europe 2007 (Vienna)