Perl在Windows上替换UTF-8字符串的问题

huangapple go评论60阅读模式
英文:

Perl problem with substituting UTF-8 string on windows

问题

我正在尝试在Windows 10上使用Perl命令行替换文本文件中的子字符串。

C:\Windows\System32\chcp 65001 & type test.txt | c:\Strawberry\perl\bin\perl -CSD -pe "use open ':std', ':encoding(UTF-8)'; binmode(STDOUT, ':utf8'); binmode(STDIN, ':encoding(utf8)'); s/__compare_loan__/So sánh sản phẩm cho vay/g"

文件 test.txt(保存为UTF-8):

Our benefit: __compare_loan__

输出:

> Active code page: 65001
> Our benefit: So sánh s?n ph?m cho vay

如果我在Perl脚本的开头添加 use utf8;,我会得到:

> Active code page: 65001 Malformed UTF-8 character: \xe1\x6e\x68
> (unexpected non-continuation byte 0x6e, immediately after start byte
> 0xe1; need 3 bytes, got 1) at -e line 1. Malformed UTF-8 character
> (fatal) at -e line 1.

请问如何消除输出中的问号?

英文:

I am trying to substitute substrings in a text file with Perl on Windows 10 using command line.

C:\Windows\System32\chcp 65001 & type test.txt | c:\Strawberry\perl\bin\perl -CSD -pe "use open ':std', ':encoding(UTF-8)'; binmode(STDOUT, ':utf8'); binmode(STDIN, ':encoding(utf8)'); s/__compare_loan__/So sánh sản phẩm cho vay/g" 

File test.txt (saved as UTF-8):

Our benefit: __compare_loan__

Output:

> Active code page: 65001
> Our benefit: So sánh s?n ph?m cho vay

If I add use utf8; at the beginning of the Perl script, I get:

> Active code page: 65001 Malformed UTF-8 character: \xe1\x6e\x68
> (unexpected non-continuation byte 0x6e, immediately after start byte
> 0xe1; need 3 bytes, got 1) at -e line 1. Malformed UTF-8 character
> (fatal) at -e line 1.

Please any idea how do I get rid of the question marks in the output?

答案1

得分: 4

当您将use utf8;添加到您的一行代码时,出现错误,这表明perl的参数是以CP-1252或类似的代码页提供的,而不是UTF-8(在CP-1252中,0xE1对应于á,0x6E对应于n,0x68对应于h)。

一个可移植的修复方法是使用字符转义,而不是直接包含非ASCII字符:

C:\Code\SO> chcp 65001 & type test.txt | perl -CSD -pe "s/__compare_loan__/So s\x{e1}nh s\x{1ea3}n ph\x{1ea9}m cho vay/g"
Active code page: 65001
Our benefit: So sánh sản phẩm cho vay

(在Strawberry Perl 5.32.1和标准的Windows 10命令提示符应用程序中测试通过)

请注意,使用-CSD意味着您不需要所有那些use openbinmode的内容;这些都被-C的参数隐含了。

英文:

That error when you add use utf8; to your one liner suggests that the arguments to perl are being given in CP-1252 or a similar code page, not in UTF-8 (0xE1 in CP-1252 is á, 0x6E is n and 0x68 is h).

One portable fix is to use character escapes instead of trying to include the non-ascii characters directly:

C:\Code\SO> chcp 65001 & type test.txt | perl -CSD -pe "s/__compare_loan__/So s\x{e1}nh s\x{1ea3}n ph\x{1ea9}m cho vay/g"
Active code page: 65001
Our benefit: So sánh sản phẩm cho vay

(Tested with Strawberry Perl 5.32.1 and the standard Windows 10 Command Prompt application)

Note that using -CSD means you don't need all that use open and binmode stuff; it's all implied by the arguments to -C.

huangapple
  • 本文由 发表于 2023年5月28日 23:24:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76352204.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定