英文:
What is the influence of `use utf8` on `Encode`?
问题
所以我正在调查我们的数据库例程中的一个宽字符问题,并偶然发现以下奇怪之处:
use utf8;
use Encode;
my $x = "ö";
my $decoded = Encode::decode('UTF-8', $x);
my $encoded = Encode::encode('UTF-8', $decoded);
my $redecoded = Encode::decode('UTF-8', $encoded);
{
    use bytes;
    printf "original: %vx\n", $x;
    printf "decoded: %vx\n", $decoded;
    printf "encoded: %vx\n", $encoded;
    printf "redecoded: %vx\n", $redecoded;
}
执行此脚本会产生以下结果:
original: c3.b6
decoded: ef.bf.bd
encoded: ef.bf.bd
redecoded: ef.bf.bd
我认为0xef 0xbf 0xbd不是有效的UTF-8编码。如果我删除use utf8语句,输出如预期:
original: c3.b6
decoded: c3.b6
encoded: c3.b6
redecoded: c3.b6
那么为什么use utf8;会改变解码语义?
英文:
So I am hunting a wide-character problem in our DB routines, and stumbled upon the following oddity:
use utf8;
use Encode;
my $x = "ö";
my $decoded  = Encode::decode('UTF-8', $x);
my $encoded = Encode::encode('UTF-8', $decoded);
my $redecoded = Encode::decode('UTF-8', $encoded);
{
    use bytes;
    printf "original: %vx\n", $x;
    printf "decoded: %vx\n", $decoded;
    printf "encoded: %vx\n", $encoded;
    printf "redecoded: %vx\n", $redecoded;
}
Executing this script gives:
original: c3.b6
decoded: ef.bf.bd
encoded: ef.bf.bd
redecoded: ef.bf.bd
I do not think that 0xef 0xbf 0xbd is valid UTF-8. If I remove the use utf8 statement, the output is, as expected:
original: c3.b6
decoded: c3.b6
encoded: c3.b6
redecoded: c3.b6
So why is use utf8; changing the decoding semantics?
答案1
得分: 3
use utf8 告诉 Perl 使用 UTF-8 解码脚本而不是 ASCII。这对 Encode 没有影响。你得到不同的结果是因为你向 decode 传递了不同的字符串。
你没有注意到这一点,因为你错误地使用了 use bytes;。(明确地说,使用 use bytes; 总是 错误的。)让我们去掉它,然后再运行你的程序。
具体来说,我会使用以下代码:
printf "original: %vx\n", $x;
printf "decode('UTF-8', %vx): %vx\n", $x, $decoded;
printf "encode('UTF-8', %vx): %vx\n", $decoded, $encoded;
printf "decode('UTF-8', %vx): %vx\n", $encoded, $redecoded;
使用 use utf8;:
original: f6
decode('UTF-8', f6): fffd
encode('UTF-8', fffd): ef.bf.bd
decode('UTF-8', ef.bf.bd): fffd
使用 use utf8;,Perl 期望你的程序以 UTF-8 编码,因此你的程序相当于 my $x = "\N{LATIN SMALL LETTER O WITH DIAERESIS}";。换句话说,你的字符串包含一个值为 0xf6 的单个字符。这不是有效的 UTF-8,因此当你将其传递给 decode 'UTF-8' 时,会得到垃圾(U+FFFD 替代字符)。
将 my $decoded  = Encode::decode('UTF-8', $x); 替换为 my $decoded = $x; 以获得预期的结果。
original: f6
decoded: f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6
没有 use utf8;:
original: c3.b6
decode('UTF-8', c3.b6): f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6
没有 use utf8;,Perl 期望你的程序以 ASCII 编码。它不可能包含 ö,因为那不是一个 ASCII 字符。相反,你有相当于 my $x = "\xc3\xb6";(因为字符串字面量是“8 位清洁的”)。这是有效的 UTF-8,因此可以被解码。
英文:
use utf8 tells Perl to decode the script using UTF-8 instead of ASCII. It has no effect on Encode. You are getting different results because you are passing different strings to decode.
You failed to notice this because of your incorrect use of use bytes;. (To be clear, using use bytes; is always incorrect.) Let's remove that and run your program again.
Specifically, I'll use the following:
printf "original: %vx\n", $x;
printf "decode('UTF-8', %vx): %vx\n", $x,       $decoded;
printf "encode('UTF-8', %vx): %vx\n", $decoded, $encoded;
printf "decode('UTF-8', %vx): %vx\n", $encoded, $redecoded;
With use utf8;:
original: f6
decode('UTF-8', f6): fffd
encode('UTF-8', fffd): ef.bf.bd
decode('UTF-8', ef.bf.bd): fffd
With use utf8;, Perl expects your program to be encoded using UTF-8, so your program has the equivalent of my $x = "\N{LATIN SMALL LETTER O WITH DIAERESIS}";. Put differently, your string contains a single char with value 0xf6. This is not valid UTF-8, so you get garbage (U+FFFD REPLACEMENT CHARACTER) when you pass it to decode 'UTF-8'.
Replace my $decoded  = Encode::decode('UTF-8', $x); with my $decoded = $x; to get the expected results.
original: f6
decoded: f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6
Without use utf8;:
original: c3.b6
decode('UTF-8', c3.b6): f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6
Without use utf8;, Perl expects your program to be encoded using ASCII. It can't possibly contain ö as that is not an ASCII character. Instead, you have the equivalent of my $x = "\xc3\xb6"; (because string literals are "8-bit clean"). This is valid UTF-8, and can thus be decoded.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论