Working around *the* Unicode bug

(And please fix it before 5.10?)

Perl has unicode strings.

* Yay! *

Let's try that, part 1

* We know that $foo is "œ"
print uc $foo;
* Do we get "Œ" or "œ"?
* Yes, it's "Œ".

Let's try that, part 2

* We know that $foo is "æ"
print uc $foo;
* Do we get "Æ" or "æ"?
* Indeed, it's either "Æ"... or "æ".

Subtle difference

This slide is for Yves :)

Subtle difference

* $foo is "æ" again
$foo .= chr(256);
chop $foo;
print uc $foo;
* So... what does this print?
Æ
* Predictable output: woot!

What's going on?

Perl has two kinds of internal encoding:
* utf8
* latin1
Two kinds of character semantics:
* unicode
* ascii

What's going on?!

If the internal encoding is UTF-8, Perl uses Unicode semantics
If the internal encoding is latin1, Perl uses ASCII semantics

Problem scope

* uc, lc, ucfirst, lcfirst
* \U, \L, \u, \l
* /\w/, /\d/, /\s/
* /\W/, /\D/, /\S/
* /[[:posix:]]/
* /.../i, /(?i:...)/

Work around

use Unicode::Semantics;

us($foo);
$foo =~ /\w/i;

us($foo) =~ /\w/i;
# Predictable semantics!
package Unicode::Semantics;
use base 'Exporter';
@EXPORT = qw(us);

sub us ($) {
    utf8::upgrade($_[0]);
    return $_[0];
}
"Unicode::Semantics" — Juerd Waalboer <#####@juerd.nl>
YAPC::Europe 2007 (Vienna)