Working around *the* Unicode bug
(And please fix it before 5.10?)
Perl has unicode strings.
* Yay! *
Let's try that, part 1
* We know that $foo is "œ"
print uc $foo;
* Do we get "Œ" or "œ"?
* Yes, it's "Œ".
Let's try that, part 2
* We know that $foo is "æ"
print uc $foo;
* Do we get "Æ" or "æ"?
* Indeed, it's either "Æ"... or "æ".
Subtle difference
This slide is for Yves :)
Subtle difference
* $foo is "æ" again
$foo .= chr(256);
chop $foo;
print uc $foo;
* So... what does this print?
Æ
* Predictable output: woot!
What's going on?
Perl has two kinds of internal encoding:
* utf8
* latin1
Two kinds of character semantics:
* unicode
* ascii
What's going on?!
If the internal encoding is UTF-8, Perl uses Unicode semantics
If the internal encoding is latin1, Perl uses semantics
Problem scope
* uc, lc, ucfirst, lcfirst
* \U, \L, \u, \l
* /\w/, /\d/, /\s/
* /\W/, /\D/, /\S/
* /[[:posix:]]/
* /.../i, /(?i:...)/
Work around
use Unicode::Semantics;
us($foo);
$foo =~ /\w/i;
us($foo) =~ /\w/i;
# Predictable semantics!
package Unicode::Semantics;
use base 'Exporter';
@EXPORT = qw(us);
sub us ($) {
utf8::upgrade($_[0]);
return $_[0];
}
"Unicode::Semantics" — Juerd Waalboer <#####@juerd.nl>
YAPC::Europe 2007 (Vienna)