Perl Unicode Tutorial

Encodings are easy!

Data exchange

* Sender and receiver must use the same format

(Sometimes you can guess)

* Avoid guessing at any cost!

Text encoding

* Characters aren't bytes

* You need to convert text from/to bytes

* That's called "decoding"/"encoding"

* A very simple encoding: ASCII

8 bit encodings

* In an 8 bit encoding, one character maps to one byte

* That means you can have at most 256 different values

* Enough for the Latin characters and Cyrillic

* Enough for the Latin characters and Greek

* Not enough for Latin and Cyrillic and Greek

Multi-byte encodings

* More bytes => more characters

* Fixed width, variable width

* Unicode encodings are all multi-byte

* UTF-8 is very popular on the internet

* UTF-16 is the internal encoding in MS Windows

"Character set"

* "character set" is character <-> number

* Unicode is a charset

* "encoding" is number <-> bytes

* UTF-8 is an encoding

* MIME calls them both "charset"

* Perl calls them both "encoding"

Two kinds of strings

* Perl has one string type

* The universe has several

* "text string" and "binary string"

* a.k.a. "character string" and "byte string"

* The computer doesn't know

* You should know

Unicode in Perl

* Text strings are unicode strings, not UTF-8

* ISO-8859-1 maps to 0..255, useful!

* Perl keeps strings at ISO-8859-1 as long as possible.

* If that doesn't work, it upgrades to UTF-8 internally.

* If you mix the two kinds, UTF-8 wins.

Prime rule

Do not mix byte strings with text strings

except if you explicitly convert between them

* decoding: bytes -> characters (binary to text)

* encoding: characters -> bytes (text to binary)

Remember the first slide?

* All communication with "the outside world" is in bytes

* Something has to decode their binary input to text

* Something has to encode your text output to binary

Neat trick

Perl lets you use code points (character numbers) that do not yet officially exist: chr(999999).

In practice part I

use Encode;

my $text = decode("ASCII", $binary_input);

my $output = encode("KOI-8R", $text);

Did I tell you this is not a Unicode tutorial?

It's an encodings tutorial :)

In practice part I

use Encode; my $text = decode("UTF-8", $binary_input); my $output = encode("UTF-8", $text); Did I tell you this is not a Unicode tutorial? It's an encodings tutorial :)

In practice part II

* Let Perl do the hard work!

binmode STDIN, ":encoding(ISO-8859-1)";

binmode STDOUT, ":encoding(UTF-8)";

print while <>;

Unicode semantics

* Perl has unicode semantics

* lc, uc, lcfirst, ucfirst

* Case insensitivity

* Character classes like \w and \d

ASCII semantics

* Perl also has ASCII semantics :(

* Hard to tell which semantics will be used for some operation

* utf8::upgrade($your_string) to ensure Unicode semantics

* If you want [A-Za-z0-9_], don't use \w

just use [A-Za-z0-9_] (say what you mean, not what happens to work)

Further info

* In Perl 5.9.5: perlunitut

* In Perl 5.9.5: perlunifaq

* http://juerd.nl/perluniadvice

** XML::Parser, Data::Dumper, Storable, Digest::MD5, Digest::Base64, URI, LWP, CGI, DBI, DBD::mysql, DBD::Pg, DBD::SQLite, DBD::Oracle, *DB*_File, HTML::Parser, Mail::Box, MIME::Lite, additions welcome!

One last thing

* DON'T USE encoding.pm

* It is broken and cannot be fixed

* Using it will hurt

* I warned you!

Thank you!

Any questions?
~
~ "Perl Unicode Tutorial" — Juerd Waalboer <#####@juerd.nl>
YAPC::Europe 2007 (Vienna)