2023年1月9日 03:25:33go评论151阅读模式

英文:

Why sed's dot doesn't match ù in latin1 encoding?

问题

我有两个包含文本 aùb 的文件，但其中一个，critic_utf8 以UTF-8编码，另一个，critic_latin1 以Latin1编码，所以它们的内容如下：

$ od -a critic_utf8
0000000   a   C   9   b  nl
0000005
$ od -a critic_latin1
0000000   a   y   b  nl
0000004

现在，暂时不考虑第二个输出中的 y（对应 ù）是什么（我想要理解，所以一个子问题是：那个 y 是什么？），我觉得 Sed 的 . 无法匹配它：

$ sed 's/.*/x/' critic_latin1
x&#249;b
$ sed 's/.*/x/' critic_utf8
x
$ sed 's/./x/g' critic_latin1
x&#249;x
$ sed 's/./x/g' critic_utf8
xxx

这是什么意思？这意味着 Sed 不能处理Latin1编码的文本文件吗？不过，我认为 . 应该匹配除换行符以外的所有字符，但在这里它也没有匹配到其他字符。我知道 ù 不会像 \n 那样对 . 产生反应，如下所示：

$ sed -z 's/.*/x/' critic_latin1
x&#249;b

我在尝试实验这个答案中找到的内容时注意到了这一点，该答案是关于 *.idx 和 *.dat 文件（包含单词和同义词）的。

英文:

I have two files containing the text aùb, but one, critic_utf8 is encoded in UTF-8 and the other, critic_latin1, in latin1, so their content is like this

$ od -a critic_utf8 
0000000   a   C   9   b  nl
0000005
$ od -a critic_latin1 
0000000   a   y   b  nl
0000004

Now, leaving aside that I don't know what that y (which corresponds to ù) in the second output is (and I'd like to understand, so a subquestion is: what is that y?), it seems to me that Sed's . doesn't match it:

$ sed &#39;s/.*/x/&#39; critic_latin1 
x&#249;b
$ sed &#39;s/.*/x/&#39; critic_utf8 
x
$ sed &#39;s/./x/g&#39; critic_latin1 
x&#249;x
$ sed &#39;s/./x/g&#39; critic_utf8 
xxx

What does this mean? That Sed cannot work with latin1-encoded text files? Still, I thought . would match everything but newline, but here it is also not matching something else. And I know that ù is not reacting to . as \n would do, as proved by this:

$ sed -z &#39;s/.*/x/&#39; critic_latin1 
x&#249;b

I've noticed this while playing around with *.idx and *.dat files (those with words and synonyms), when trying to experiment what I found in this answer.

答案1

得分: 1

Two steps:

sed 命令使用 LANG 变量中以 language_COUNTRY.CHARSET 格式排列的内容读取您的文件。
sed 命令的输出根据终端的配置进行解释。

我使用配置了 UTF-8 字符集的 LANG 变量以及配置了 ISO-8859-1（latin1）编码的终端来复制您的输出：

&gt; export LANG=fr_FR.UTF-8; echo &quot;latin1&quot;; sed &#39;s/.*/x/&#39; critic_latin1 ; echo &quot;utf-8&quot;; sed &#39;s/.*/x/&#39; critic_utf8; echo &quot;latin1/g&quot;; sed &#39;s/./x/g&#39; critic_latin1; echo &quot;utf-8/g&quot;; sed &#39;s/./x/g&#39; critic_utf8
latin1
x&#249;b
utf-8
x
latin1/g
x&#249;x
utf-8/g
xxx

使用具有 UTF-8 值的 LANG 变量告诉 sed 使用 UTF-8 字符，但在您的 critic_latin1 中，有一个使用 ISO-8859-1 编码（只有一个字节）的 ù 字符。这个字符在 UTF-8 中不是有效的。所以 sed 不会处理未知（无效）字符。

如果您想要处理与 LANG 变量不同编码的文件，请像这样在您的工作前加上 LANG=... 前缀：

&gt; export LANG=fr_FR.ISO-8859-1; echo &quot;latin1&quot;; sed &#39;s/.*/x/&#39; critic_latin1 ; echo &quot;utf-8&quot;; sed &#39;s/.*/x/&#39; critic_utf8; echo &quot;latin1/g&quot;; sed &#39;s/./x/g&#39; critic_latin1; echo &quot;utf-8/g&quot;; sed &#39;s/./x/g&#39; critic_utf8
latin1
x
utf-8
x
latin1/g
xxx
utf-8/g
xxxx

这在处理文本文件（如 ISAM）时非常有用。

英文:

Two steps:

sed command reads your file with LANG variable content formatted with language_COUNTRY.CHARSET
The sed command output is interpreted by your terminal following its own configuration

I reproduce your output with a LANG variable configured with UTF-8 charset and a terminal configured with ISO-8859-1 (latin1) encoding :

&gt; export LANG=fr_FR.UTF-8; echo &quot;latin1&quot;; sed &#39;s/.*/x/&#39; critic_latin1 ; echo &quot;utf-8&quot;; sed &#39;s/.*/x/&#39; critic_utf8; echo &quot;latin1/g&quot;; sed &#39;s/./x/g&#39; critic_latin1; echo &quot;utf-8/g&quot;; sed &#39;s/./x/g&#39; critic_utf8
latin1
x&#249;b
utf-8
x
latin1/g
x&#249;x
utf-8/g
xxx

A LANG value with UTF-8 said to sed to work with UTF-8 characters but in your critic_latin1 you have a ù character encoded in ISO-8859-1 (only one byte). This character is not valid in UTF-8. So sed does not treat unknown (invalid) characters.

If you want to work with files encoded differently than your LANG variable, prefix you works with LANG=... like this:

&gt; export LANG=fr_FR.ISO-8859-1; echo &quot;latin1&quot;; sed &#39;s/.*/x/&#39; critic_latin1 ; echo &quot;utf-8&quot;; sed &#39;s/.*/x/&#39; critic_utf8; echo &quot;latin1/g&quot;; sed &#39;s/./x/g&#39; critic_latin1; echo &quot;utf-8/g&quot;; sed &#39;s/./x/g&#39; critic_utf8
latin1
x
utf-8
x
latin1/g
xxx
utf-8/g
xxxx

It's really useful with data text files (like ISAM).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么sed的点不匹配Latin1编码中的字符’ù’？

问题

答案1

Perl在Windows上替换UTF-8字符串的问题

Lookahead捕获到不需要的字符。

从字符串中移除破折号，但在被（a-z）包围时不移除破折号。

复杂的正则表达式模式以替换逗号。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论