`use utf8` 对 `Encode` 有什么影响?

huangapple go评论62阅读模式
英文:

What is the influence of `use utf8` on `Encode`?

问题

所以我正在调查我们的数据库例程中的一个宽字符问题,并偶然发现以下奇怪之处:

use utf8;

use Encode;

my $x = "ö";

my $decoded = Encode::decode('UTF-8', $x);

my $encoded = Encode::encode('UTF-8', $decoded);

my $redecoded = Encode::decode('UTF-8', $encoded);

{
    use bytes;
    printf "original: %vx\n", $x;
    printf "decoded: %vx\n", $decoded;
    printf "encoded: %vx\n", $encoded;
    printf "redecoded: %vx\n", $redecoded;
}

执行此脚本会产生以下结果:

original: c3.b6
decoded: ef.bf.bd
encoded: ef.bf.bd
redecoded: ef.bf.bd

我认为0xef 0xbf 0xbd不是有效的UTF-8编码。如果我删除use utf8语句,输出如预期:

original: c3.b6
decoded: c3.b6
encoded: c3.b6
redecoded: c3.b6

那么为什么use utf8;会改变解码语义?

英文:

So I am hunting a wide-character problem in our DB routines, and stumbled upon the following oddity:

use utf8;

use Encode;

my $x = "ö";

my $decoded  = Encode::decode('UTF-8', $x);

my $encoded = Encode::encode('UTF-8', $decoded);

my $redecoded = Encode::decode('UTF-8', $encoded);

{
    use bytes;
    printf "original: %vx\n", $x;
    printf "decoded: %vx\n", $decoded;
    printf "encoded: %vx\n", $encoded;
    printf "redecoded: %vx\n", $redecoded;
}

Executing this script gives:

original: c3.b6
decoded: ef.bf.bd
encoded: ef.bf.bd
redecoded: ef.bf.bd

I do not think that 0xef 0xbf 0xbd is valid UTF-8. If I remove the use utf8 statement, the output is, as expected:

original: c3.b6
decoded: c3.b6
encoded: c3.b6
redecoded: c3.b6

So why is use utf8; changing the decoding semantics?

答案1

得分: 3

use utf8 告诉 Perl 使用 UTF-8 解码脚本而不是 ASCII。这对 Encode 没有影响。你得到不同的结果是因为你向 decode 传递了不同的字符串。

你没有注意到这一点,因为你错误地使用了 use bytes;。(明确地说,使用 use bytes; 总是 错误的。)让我们去掉它,然后再运行你的程序。

具体来说,我会使用以下代码:

printf "original: %vx\n", $x;
printf "decode('UTF-8', %vx): %vx\n", $x, $decoded;
printf "encode('UTF-8', %vx): %vx\n", $decoded, $encoded;
printf "decode('UTF-8', %vx): %vx\n", $encoded, $redecoded;

使用 use utf8;

original: f6
decode('UTF-8', f6): fffd
encode('UTF-8', fffd): ef.bf.bd
decode('UTF-8', ef.bf.bd): fffd

使用 use utf8;,Perl 期望你的程序以 UTF-8 编码,因此你的程序相当于 my $x = "\N{LATIN SMALL LETTER O WITH DIAERESIS}";。换句话说,你的字符串包含一个值为 0xf6 的单个字符。这不是有效的 UTF-8,因此当你将其传递给 decode 'UTF-8' 时,会得到垃圾(U+FFFD 替代字符)。

my $decoded = Encode::decode('UTF-8', $x); 替换为 my $decoded = $x; 以获得预期的结果。

original: f6
decoded: f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6

没有 use utf8;

original: c3.b6
decode('UTF-8', c3.b6): f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6

没有 use utf8;,Perl 期望你的程序以 ASCII 编码。它不可能包含 ö,因为那不是一个 ASCII 字符。相反,你有相当于 my $x = "\xc3\xb6";(因为字符串字面量是“8 位清洁的”)。这是有效的 UTF-8,因此可以被解码。

英文:

use utf8 tells Perl to decode the script using UTF-8 instead of ASCII. It has no effect on Encode. You are getting different results because you are passing different strings to decode.

You failed to notice this because of your incorrect use of use bytes;. (To be clear, using use bytes; is always incorrect.) Let's remove that and run your program again.

Specifically, I'll use the following:

printf "original: %vx\n", $x;
printf "decode('UTF-8', %vx): %vx\n", $x,       $decoded;
printf "encode('UTF-8', %vx): %vx\n", $decoded, $encoded;
printf "decode('UTF-8', %vx): %vx\n", $encoded, $redecoded;

With use utf8;:

original: f6
decode('UTF-8', f6): fffd
encode('UTF-8', fffd): ef.bf.bd
decode('UTF-8', ef.bf.bd): fffd

With use utf8;, Perl expects your program to be encoded using UTF-8, so your program has the equivalent of my $x = "\N{LATIN SMALL LETTER O WITH DIAERESIS}";. Put differently, your string contains a single char with value 0xf6. This is not valid UTF-8, so you get garbage (U+FFFD REPLACEMENT CHARACTER) when you pass it to decode 'UTF-8'.

Replace my $decoded = Encode::decode('UTF-8', $x); with my $decoded = $x; to get the expected results.

original: f6
decoded: f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6

Without use utf8;:

original: c3.b6
decode('UTF-8', c3.b6): f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6

Without use utf8;, Perl expects your program to be encoded using ASCII. It can't possibly contain ö as that is not an ASCII character. Instead, you have the equivalent of my $x = "\xc3\xb6"; (because string literals are "8-bit clean"). This is valid UTF-8, and can thus be decoded.

huangapple
  • 本文由 发表于 2023年6月9日 01:27:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76434317.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定