英文:
Why sed's dot doesn't match ù in latin1 encoding?
问题
我有两个包含文本 aùb
的文件,但其中一个,critic_utf8
以UTF-8编码,另一个,critic_latin1
以Latin1编码,所以它们的内容如下:
$ od -a critic_utf8
0000000 a C 9 b nl
0000005
$ od -a critic_latin1
0000000 a y b nl
0000004
现在,暂时不考虑第二个输出中的 y
(对应 ù
)是什么(我想要理解,所以一个子问题是:那个 y
是什么?),我觉得 Sed 的 .
无法匹配它:
$ sed 's/.*/x/' critic_latin1
xùb
$ sed 's/.*/x/' critic_utf8
x
$ sed 's/./x/g' critic_latin1
xùx
$ sed 's/./x/g' critic_utf8
xxx
这是什么意思?这意味着 Sed 不能处理Latin1编码的文本文件吗?不过,我认为 .
应该匹配除换行符以外的所有字符,但在这里它也没有匹配到其他字符。我知道 ù
不会像 \n
那样对 .
产生反应,如下所示:
$ sed -z 's/.*/x/' critic_latin1
xùb
我在尝试实验 这个答案 中找到的内容时注意到了这一点,该答案是关于 *.idx
和 *.dat
文件(包含单词和同义词)的。
英文:
I have two files containing the text aùb
, but one, critic_utf8
is encoded in UTF-8 and the other, critic_latin1
, in latin1, so their content is like this
$ od -a critic_utf8
0000000 a C 9 b nl
0000005
$ od -a critic_latin1
0000000 a y b nl
0000004
Now, leaving aside that I don't know what that y
(which corresponds to ù
) in the second output is (and I'd like to understand, so a subquestion is: what is that y
?), it seems to me that Sed's .
doesn't match it:
$ sed 's/.*/x/' critic_latin1
xùb
$ sed 's/.*/x/' critic_utf8
x
$ sed 's/./x/g' critic_latin1
xùx
$ sed 's/./x/g' critic_utf8
xxx
What does this mean? That Sed cannot work with latin1-encoded text files? Still, I thought .
would match everything but newline, but here it is also not matching something else. And I know that ù
is not reacting to .
as \n
would do, as proved by this:
$ sed -z 's/.*/x/' critic_latin1
xùb
I've noticed this while playing around with *.idx
and *.dat
files (those with words and synonyms), when trying to experiment what I found in this answer.
答案1
得分: 1
Two steps:
sed
命令使用LANG
变量中以 language_COUNTRY.CHARSET 格式排列的内容读取您的文件。sed
命令的输出根据终端的配置进行解释。
我使用配置了 UTF-8
字符集的 LANG
变量以及配置了 ISO-8859-1(latin1)编码的终端来复制您的输出:
> export LANG=fr_FR.UTF-8; echo "latin1"; sed 's/.*/x/' critic_latin1 ; echo "utf-8"; sed 's/.*/x/' critic_utf8; echo "latin1/g"; sed 's/./x/g' critic_latin1; echo "utf-8/g"; sed 's/./x/g' critic_utf8
latin1
xùb
utf-8
x
latin1/g
xùx
utf-8/g
xxx
使用具有 UTF-8 值的 LANG
变量告诉 sed
使用 UTF-8 字符,但在您的 critic_latin1
中,有一个使用 ISO-8859-1 编码(只有一个字节)的 ù
字符。这个字符在 UTF-8 中不是有效的。所以 sed
不会处理未知(无效)字符。
如果您想要处理与 LANG
变量不同编码的文件,请像这样在您的工作前加上 LANG=...
前缀:
> export LANG=fr_FR.ISO-8859-1; echo "latin1"; sed 's/.*/x/' critic_latin1 ; echo "utf-8"; sed 's/.*/x/' critic_utf8; echo "latin1/g"; sed 's/./x/g' critic_latin1; echo "utf-8/g"; sed 's/./x/g' critic_utf8
latin1
x
utf-8
x
latin1/g
xxx
utf-8/g
xxxx
这在处理文本文件(如 ISAM)时非常有用。
英文:
Two steps:
sed
command reads your file withLANG
variable content formatted with language_COUNTRY.CHARSET- The
sed
command output is interpreted by your terminal following its own configuration
I reproduce your output with a LANG
variable configured with UTF-8
charset and a terminal configured with ISO-8859-1 (latin1) encoding :
> export LANG=fr_FR.UTF-8; echo "latin1"; sed 's/.*/x/' critic_latin1 ; echo "utf-8"; sed 's/.*/x/' critic_utf8; echo "latin1/g"; sed 's/./x/g' critic_latin1; echo "utf-8/g"; sed 's/./x/g' critic_utf8
latin1
xùb
utf-8
x
latin1/g
xùx
utf-8/g
xxx
A LANG
value with UTF-8 said to sed
to work with UTF-8 characters but in your critic_latin1
you have a ù
character encoded in ISO-8859-1 (only one byte). This character is not valid in UTF-8. So sed
does not treat unknown (invalid) characters.
If you want to work with files encoded differently than your LANG
variable, prefix you works with LANG=...
like this:
> export LANG=fr_FR.ISO-8859-1; echo "latin1"; sed 's/.*/x/' critic_latin1 ; echo "utf-8"; sed 's/.*/x/' critic_utf8; echo "latin1/g"; sed 's/./x/g' critic_latin1; echo "utf-8/g"; sed 's/./x/g' critic_utf8
latin1
x
utf-8
x
latin1/g
xxx
utf-8/g
xxxx
It's really useful with data text files (like ISAM).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论