2023年6月9日 01:27:05go评论84阅读模式

英文:

What is the influence of `use utf8` on `Encode`?

问题

所以我正在调查我们的数据库例程中的一个宽字符问题，并偶然发现以下奇怪之处：

use utf8;
use Encode;
my $x = "&#246;";
my $decoded = Encode::decode('UTF-8', $x);
my $encoded = Encode::encode('UTF-8', $decoded);
my $redecoded = Encode::decode('UTF-8', $encoded);
{
    use bytes;
    printf "original: %vx\n", $x;
    printf "decoded: %vx\n", $decoded;
    printf "encoded: %vx\n", $encoded;
    printf "redecoded: %vx\n", $redecoded;
}

执行此脚本会产生以下结果：

original: c3.b6
decoded: ef.bf.bd
encoded: ef.bf.bd
redecoded: ef.bf.bd

我认为0xef 0xbf 0xbd不是有效的UTF-8编码。如果我删除use utf8语句，输出如预期：

original: c3.b6
decoded: c3.b6
encoded: c3.b6
redecoded: c3.b6

那么为什么use utf8;会改变解码语义？

英文:

So I am hunting a wide-character problem in our DB routines, and stumbled upon the following oddity:

use utf8;
use Encode;
my $x = &quot;&#246;&quot;;
my $decoded  = Encode::decode(&#39;UTF-8&#39;, $x);
my $encoded = Encode::encode(&#39;UTF-8&#39;, $decoded);
my $redecoded = Encode::decode(&#39;UTF-8&#39;, $encoded);
{
    use bytes;
    printf &quot;original: %vx\n&quot;, $x;
    printf &quot;decoded: %vx\n&quot;, $decoded;
    printf &quot;encoded: %vx\n&quot;, $encoded;
    printf &quot;redecoded: %vx\n&quot;, $redecoded;
}

Executing this script gives:

original: c3.b6
decoded: ef.bf.bd
encoded: ef.bf.bd
redecoded: ef.bf.bd

I do not think that 0xef 0xbf 0xbd is valid UTF-8. If I remove the use utf8 statement, the output is, as expected:

original: c3.b6
decoded: c3.b6
encoded: c3.b6
redecoded: c3.b6

So why is use utf8; changing the decoding semantics?

答案1

得分: 3

use utf8 告诉 Perl 使用 UTF-8 解码脚本而不是 ASCII。这对 Encode 没有影响。你得到不同的结果是因为你向 decode 传递了不同的字符串。

你没有注意到这一点，因为你错误地使用了 use bytes;。（明确地说，使用 use bytes; 总是错误的。）让我们去掉它，然后再运行你的程序。

具体来说，我会使用以下代码：

printf "original: %vx\n", $x;
printf "decode('UTF-8', %vx): %vx\n", $x, $decoded;
printf "encode('UTF-8', %vx): %vx\n", $decoded, $encoded;
printf "decode('UTF-8', %vx): %vx\n", $encoded, $redecoded;

使用 use utf8;：

original: f6
decode('UTF-8', f6): fffd
encode('UTF-8', fffd): ef.bf.bd
decode('UTF-8', ef.bf.bd): fffd

使用 use utf8;，Perl 期望你的程序以 UTF-8 编码，因此你的程序相当于 my $x = "\N{LATIN SMALL LETTER O WITH DIAERESIS}";。换句话说，你的字符串包含一个值为 0xf6 的单个字符。这不是有效的 UTF-8，因此当你将其传递给 decode 'UTF-8' 时，会得到垃圾（U+FFFD 替代字符）。

将 my $decoded = Encode::decode('UTF-8', $x); 替换为 my $decoded = $x; 以获得预期的结果。

original: f6
decoded: f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6

没有 use utf8;：

original: c3.b6
decode('UTF-8', c3.b6): f6
encode('UTF-8', f6): c3.b6
decode('UTF-8', c3.b6): f6

没有 use utf8;，Perl 期望你的程序以 ASCII 编码。它不可能包含 ö，因为那不是一个 ASCII 字符。相反，你有相当于 my $x = "\xc3\xb6";（因为字符串字面量是“8 位清洁的”）。这是有效的 UTF-8，因此可以被解码。

英文:

use utf8 tells Perl to decode the script using UTF-8 instead of ASCII. It has no effect on Encode. You are getting different results because you are passing different strings to decode.

You failed to notice this because of your incorrect use of use bytes;. (To be clear, using use bytes; is always incorrect.) Let's remove that and run your program again.

Specifically, I'll use the following:

printf &quot;original: %vx\n&quot;, $x;
printf &quot;decode(&#39;UTF-8&#39;, %vx): %vx\n&quot;, $x,       $decoded;
printf &quot;encode(&#39;UTF-8&#39;, %vx): %vx\n&quot;, $decoded, $encoded;
printf &quot;decode(&#39;UTF-8&#39;, %vx): %vx\n&quot;, $encoded, $redecoded;

With use utf8;:

original: f6
decode(&#39;UTF-8&#39;, f6): fffd
encode(&#39;UTF-8&#39;, fffd): ef.bf.bd
decode(&#39;UTF-8&#39;, ef.bf.bd): fffd

With use utf8;, Perl expects your program to be encoded using UTF-8, so your program has the equivalent of my $x = "\N{LATIN SMALL LETTER O WITH DIAERESIS}";. Put differently, your string contains a single char with value 0xf6. This is not valid UTF-8, so you get garbage (U+FFFD REPLACEMENT CHARACTER) when you pass it to decode 'UTF-8'.

Replace my $decoded = Encode::decode('UTF-8', $x); with my $decoded = $x; to get the expected results.

original: f6
decoded: f6
encode(&#39;UTF-8&#39;, f6): c3.b6
decode(&#39;UTF-8&#39;, c3.b6): f6

Without use utf8;:

original: c3.b6
decode(&#39;UTF-8&#39;, c3.b6): f6
encode(&#39;UTF-8&#39;, f6): c3.b6
decode(&#39;UTF-8&#39;, c3.b6): f6

Without use utf8;, Perl expects your program to be encoded using ASCII. It can't possibly contain ö as that is not an ASCII character. Instead, you have the equivalent of my $x = "\xc3\xb6"; (because string literals are "8-bit clean"). This is valid UTF-8, and can thus be decoded.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

`use utf8` 对 `Encode` 有什么影响？

问题

答案1

如何处理泰文字符串中的组合字符以及\p{L}模式？

获取文件编码 ASCII 或 EBCDIC 使用 Java

Pass MAGICK_THREAD_LIMIT to PerlMagick

Swift 编码：多种 .encode 策略？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。