英文:
What is the influence of `use utf8` on `Encode`?
问题
所以我正在调查我们的数据库例程中的一个宽字符问题,并偶然发现以下奇怪之处:
use utf8;
use Encode;
my $x = "ö";
my $decoded = Encode::decode('UTF-8', $x);
my $encoded = Encode::encode('UTF-8', $decoded);
my $redecoded = Encode::decode('UTF-8', $encoded);
{
use bytes;
printf "original: %vx\n", $x;
printf "decoded: %vx\n", $decoded;
printf "encoded: %vx\n", $encoded;
printf "redecoded: %vx\n", $redecoded;
}
执行此脚本会产生以下结果:
original: c3.b6
decoded: ef.bf.bd
encoded: ef.bf.bd
redecoded: ef.bf.bd
我认为0xef 0xbf 0xbd不是有效的UTF-8编码。如果我删除use utf8
语句,输出如预期:
original: c3.b6
decoded: c3.b6
encoded: c3.b6
redecoded: c3.b6
那么为什么use utf8;
会改变解码语义?
英文:
So I am hunting a wide-character problem in our DB routines, and stumbled upon the following oddity:
use utf8;
use Encode;
my $x = "ö";
my $decoded = Encode::decode('UTF-8', $x);
my $encoded = Encode::encode('UTF-8', $decoded);
my $redecoded = Encode::decode('UTF-8', $encoded);
{
use bytes;
printf "original: %vx\n", $x;
printf "decoded: %vx\n", $decoded;
printf "encoded: %vx\n", $encoded;
printf "redecoded: %vx\n", $redecoded;
}
Executing this script gives:
original: c3.b6
decoded: ef.bf.bd
encoded: ef.bf.bd
redecoded: ef.bf.bd
I do not think that 0xef 0xbf 0xbd is valid UTF-8. If I remove the use utf8
statement, the output is, as expected:
original: c3.b6
decoded: c3.b6
encoded: c3.b6
redecoded: c3.b6
So why is use utf8;
changing the decoding semantics?
答案1
得分: 3
use utf8
告诉 Perl 使用 UTF-8 解码脚本而不是 ASCII。这对 Encode 没有影响。你得到不同的结果是因为你向 decode
传递了不同的字符串。
你没有注意到这一点,因为你错误地使用了 use bytes;
。(明确地说,使用 use bytes;
总是 错误的。)让我们去掉它,然后再运行你的程序。
具体来说,我会使用以下代码:
printf "original: %vx\n", $x;
printf "decode('UTF-8', %vx): %vx\n", $x, $decoded;
printf "encode('UTF-8', %vx): %vx\n", $decoded, $encoded;
printf "decode('UTF-8', %vx): %vx\n", $encoded, $redecoded;
使用 use utf8;
:
original: f6
decode('UTF-8', f6): fffd
encode('UTF-8', fffd): ef.bf.bd
decode('UTF-8', ef.bf.bd): fffd
使用 use utf8;
,Perl 期望你的程序以 UTF-8 编码,因此你的程序相当于 my $x = "\N{LATIN SMALL LETTER O WITH DIAERESIS}";
。换句话说,你的字符串包含一个值为 0xf6 的单个字符。这不是有效的 UTF-8,因此当你将其传递给 decode 'UTF-8'
时,会得到垃圾(U+FFFD 替代字符)。
将 my $decoded = Encode::decode('UTF-8', $x);
替换为 my $decoded = $x;
以获得预期的结果。
original: f6
decoded: f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6
没有 use utf8;
:
original: c3.b6
decode('UTF-8', c3.b6): f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6
没有 use utf8;
,Perl 期望你的程序以 ASCII 编码。它不可能包含 ö
,因为那不是一个 ASCII 字符。相反,你有相当于 my $x = "\xc3\xb6";
(因为字符串字面量是“8 位清洁的”)。这是有效的 UTF-8,因此可以被解码。
英文:
use utf8
tells Perl to decode the script using UTF-8 instead of ASCII. It has no effect on Encode. You are getting different results because you are passing different strings to decode
.
You failed to notice this because of your incorrect use of use bytes;
. (To be clear, using use bytes;
is always incorrect.) Let's remove that and run your program again.
Specifically, I'll use the following:
printf "original: %vx\n", $x;
printf "decode('UTF-8', %vx): %vx\n", $x, $decoded;
printf "encode('UTF-8', %vx): %vx\n", $decoded, $encoded;
printf "decode('UTF-8', %vx): %vx\n", $encoded, $redecoded;
With use utf8;
:
original: f6
decode('UTF-8', f6): fffd
encode('UTF-8', fffd): ef.bf.bd
decode('UTF-8', ef.bf.bd): fffd
With use utf8;
, Perl expects your program to be encoded using UTF-8, so your program has the equivalent of my $x = "\N{LATIN SMALL LETTER O WITH DIAERESIS}";
. Put differently, your string contains a single char with value 0xf6. This is not valid UTF-8, so you get garbage (U+FFFD REPLACEMENT CHARACTER) when you pass it to decode 'UTF-8'
.
Replace my $decoded = Encode::decode('UTF-8', $x);
with my $decoded = $x;
to get the expected results.
original: f6
decoded: f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6
Without use utf8;
:
original: c3.b6
decode('UTF-8', c3.b6): f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6
Without use utf8;
, Perl expects your program to be encoded using ASCII. It can't possibly contain ö
as that is not an ASCII character. Instead, you have the equivalent of my $x = "\xc3\xb6";
(because string literals are "8-bit clean"). This is valid UTF-8, and can thus be decoded.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论