Working around the Unicode bug

(And please fix it before 5.10?)

Perl has unicode strings.

* Yay! *

Let's try that, part 1

* We know that $foo is "œ"

print uc $foo;

* Do we get "Œ" or "œ"?

* Yes, it's "Œ".

Let's try that, part 2

* We know that $foo is "æ"

print uc $foo;

* Do we get "Æ" or "æ"?

* Indeed, it's either "Æ"... or "æ".

Subtle difference

This slide is for Yves :)

Subtle difference

* $foo is "æ" again

$foo .= chr(256);

chop $foo;

print uc $foo;

* So... what does this print?

* Predictable output: woot!

What's going on?

Perl has two kinds of internal encoding:
* utf8
* latin1

Two kinds of character semantics:
* unicode
* ascii

What's going on?!

If the internal encoding is UTF-8, Perl uses Unicode semantics

If the internal encoding is latin1, Perl uses ASCII semantics

Problem scope

* uc, lc, ucfirst, lcfirst

* \U, \L, \u, \l

* /\w/, /\d/, /\s/

* /\W/, /\D/, /\S/

* /[[:posix:]]/

* /.../i, /(?i:...)/

Work around

use Unicode::Semantics;

us($foo);
$foo =~ /\w/i;

us($foo) =~ /\w/i;

# Predictable semantics!

package Unicode::Semantics;
use base 'Exporter';
@EXPORT = qw(us);

sub us ($) {
utf8::upgrade($_[0]);
return $_[0];
} "Unicode::Semantics" — Juerd Waalboer <#####@juerd.nl>
YAPC::Europe 2007 (Vienna)

Working around *the* Unicode bug

(And please fix it before 5.10?)

Perl has unicode strings.

* Yay! *

Let's try that, part 1

Let's try that, part 2

Subtle difference

Subtle difference

What's going on?

What's going on?!

Problem scope

Work around

Working around the Unicode bug