2023年8月8日 23:37:58go评论90阅读模式

英文:

grepping for Russian characters using character ranges

问题

如何使用字符范围从文本文件中grep出包含'Й'和'й'的行？

在Unicode中，俄语大写字母（除了'Ё'）按字母顺序在范围0x410到0x42f之间，小写字母（除了'ё'）在范围0x430到0x44f之间。这意味着[А-ИК-ЯЁ]应该匹配除了'Й'之外的所有俄语字符，而[а-ик-яё]应该匹配除了'й'之外的所有俄语字符。但事实并非如此。

为了进行实验，我创建了一个函数，每行输出一个俄语字符[Ж-Мж-м]：

rus () { for char in Ж З И Й К Л М ж з и й к л м; do echo $char; done; }

我还导出了适当的排序设置：

export LC_COLLATE=ru_RU.UTF-8

没有使用字符范围时，一切都按预期工作：

rus | grep -v "[АБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя]"

和

rus | grep -v "[АБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзиклмнопрстуфхцчшщъыьэюя]"

分别输出'Й'和'й'。

使用字符范围时，[А-ИК-ЯЁа-ик-яё]应该匹配除了'Й'和'й'之外的所有俄语字符，这是正确的。但是当我只想过滤'Й'或只想过滤'й'时，出现了一些有趣的情况：在我的系统上，以下两个命令都没有输出：

rus | grep -v "[А-ЯЁа-ик-яё]"  # 预期输出：'й'

和

rus | grep -v "[А-ИК-ЯЁа-яё]"  # 预期输出：'Й'

'Й'和'й'在这方面并没有特殊性；对字母'П'和'п'进行类似的实验也会产生相同的效果。

grep是否可能以某种原因默认在字符范围中不区分大小写处理俄语或西里尔字母？不，它不会：将--no-ignore-case添加到所有这些grep命令中也没有改变任何内容。

发生了什么？我是否发现了grep中的一个错误？还是我漏掉了什么？

（我使用的是GNU grep 3.11（使用pcre构建）和bash 5.1.16。）

英文:

How to grep for lines with 'Й' and 'й' from text file, using character ranges?

In Unicode, Russian capital characters (except 'Ё') are in the range from 0x410 to 0x42f in alphabetical order, and small characters (except 'ё') are in the range from 0x430 to 0x44f in alphabetical order. This means that [А-ИК-ЯЁ] should match all Russian characters except 'Й', and [а-ик-яё] should match all Russian characters except 'й'. But this turns out to be not quite the case.

For experimenting, I created a function that outputs Russian characters [Ж-Мж-м], one per line:

rus () { for char in Ж З И Й К Л М ж з и й к л м; do echo $char; done; }

I also exported the appropriate collate setting:

export LC_COLLATE=ru_RU.UTF-8

Without character ranges everything worked as expected:

rus | grep -v &quot;[АБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя]&quot;

and

rus | grep -v &quot;[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзиклмнопрстуфхцчшщъыьэюя]&quot;

output 'Й' and 'й' respectively.

With character ranges, [А-ИК-ЯЁа-ик-яё] should match all Russian characters except 'Й' and 'й', and this turned out to be correct. But when I wanted to filter only 'Й' or only 'й', something interesting happened: on my system, both

rus | grep -v &quot;[А-ЯЁа-ик-яё]&quot;  # expected output: &#39;й&#39;

and

rus | grep -v &quot;[А-ИК-ЯЁа-яё]&quot;  # expected output &#39;Й&#39;

output nothing!

'Й' and 'й' are not special in this respect; the analogical experiment with letters 'П' and 'п' showed the same effect.

Is grep maybe, for some reason, handling Russian or Cyrillic characters case-insensitively by default in character ranges? No, it is not: adding --no-ignore-case to all those grep commands changed nothing.

What's going on? Have I found a bug in grep? Or am I missing something?

(I am using GNU grep 3.11 (built with pcre), and bash 5.1.16.)

答案1

得分: 3

首先，你应该引用grep的参数；如果你不这样做，并且当前目录中有一个文件名只有一个俄文字母，那么这个字母将是传递给grep的唯一内容。

但问题是，没有PCRE的grep似乎是按字节处理的，而不考虑区域设置。所以我认为你需要使用-P来打开Perl兼容模式：

$ rus | grep -Pv '[А-ЯЁа-ик-яё]'
й

每当你怀疑解释grep参数时出现问题时，一个很好的健全性检查是回退到发送纯ASCII字符串给它，使用\x{...}语法来表示非ASCII字符（这也是PCRE的一个特性，所以只能在使用-P时起作用）：

$ rus | grep -Pv '[\x{0410}-\x{042f}Ё\x{0430}-\x{0438}\x{043a}-\x{044f}ё]'
й

英文:

First, you should quote the argument to grep; if you don't, and you have a file in the current directory whose name is a single Russian letter, that letter will be the only thing passed to grep.

But the problem is that grep without PCRE appears to work bytewise, regardless of locale settings. So I think you need to turn on Perl-compatible mode with -P:

$ rus | grep -Pv &#39;[А-ЯЁа-ик-яё]&#39;
й

Whenever you suspect problems interpreting the argument to grep, a good sanity check is to fall back to sending it pure-ASCII strings, using the \x{...} syntax for non-ASCII characters (which is also a feature of PCRE, so only works with -P):

$ rus | grep -Pv &#39;[\x{0410}-\x{042f}Ёx\{0430}-\x{0438}\x{043a}-\x{044f}ё]&#39;
й

答案2

得分: 3

如果你想匹配 Cyrillic 字母，你可以使用 Perl Unicode 脚本正则表达式：

rus | grep -P "\p{Cyrillic}"

这样应该只输出 Cyrillic 字符。

英文:

If you want to match characters of the Cyrillic alphabet, you can use a perl Unicode script regex:

rus | grep -P &quot;\p{Cyrillic}&quot;

should output only Cyrillic characters

答案3

得分: 0

如果速度是一个问题，使用这个八进制代码链并在字节级别扫描行以查找 Cyrillic 字符会快得多：

对于 gnu-grep，需要7.115秒，而对于 mawk2，只需要3.137秒（2.27倍）。

（在正则表达式中没有空格/换行符 - 这里只是为了格式化而添加）

（时间（pvE0 < "$m3t" | mawk2 '/([\320-\323]|\352\231)[\200-\277]|\324 [\200-\257]|\341(\262[\200-\210]|\264\253|[\265\267]\270)|\342(\267[\240-\277]|\271\203)|\352 \232[\200-\237]|\357 \270[\256\257]/'）| pvE9）

in0: 1.85GiB 0:00:03 [ 606MiB/s] [ 606MiB/s] [========>] 100%
out9: 60.7KiB 0:00:03 [19.5KiB/s] [19.5KiB/s] [<=> ]

（pvE 0.1 in0 < "$m3t" | mawk2 ;）

3.03秒用户 0.45秒系统 111% CPU 3.137 总计

e23cb29c6564c3da39a67c0cb320f323 stdin

out9: 60.7KiB 0:00:07 [8.56KiB/s] [8.56KiB/s] [<=> ]
in0: 1.85GiB 0:00:07 [ 266MiB/s] [ 266MiB/s] [========>] 100%

（pvE 0.1 in0 < "$m3t" | ggrep -P '\p{Cyrillic}';）

7.02秒用户 0.46秒系统 105% CPU 7.115 总计

e23cb29c6564c3da39a67c0cb320f323 stdin

英文:

if speed is of any concern, it's much faster just to use this chain of octal codes and scan rows for Cyrillic at the byte level :

7.115 secs for gnu-grep vs. 3.137 secs mawk2 (2.27x)

(there are no space/line gaps within the regex - I added them here only for formatting)

( time  ( pvE0 &lt; &quot;$m3t&quot; | 
  mawk2 &#39;/([0-3]|21)[0-7]|
            4     [0-7]|
            1(2[0-0]|43|[57]0)|
            2(7[0-7]|13)|
            2 2[0-7]|
            7 0[67]/&#39; ) | pvE9 )
  in0: 1.85GiB 0:00:03 [ 606MiB/s] [ 606MiB/s] [========&gt;] 100%            
 out9: 60.7KiB 0:00:03 [19.5KiB/s] [19.5KiB/s] [&lt;=&gt; ]
( pvE 0.1 in0 &lt; &quot;$m3t&quot; | mawk2 ; )  
3.03s user 0.45s system 111% cpu 3.137 total
e23cb29c6564c3da39a67c0cb320f323  stdin

 out9: 60.7KiB 0:00:07 [8.56KiB/s] [8.56KiB/s] [ &lt;=&gt; ]
  in0: 1.85GiB 0:00:07 [ 266MiB/s] [ 266MiB/s] [========&gt;] 100% 
       
( pvE 0.1 in0 &lt; &quot;$m3t&quot; | ggrep -P &#39;\p{Cyrillic}&#39;; )  
7.02s user 0.46s system 105% cpu 7.115 total
e23cb29c6564c3da39a67c0cb320f323  stdin

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用字符范围进行搜索俄文字符

问题

答案1

答案2

答案3

保留特殊字符在从HTML中读取的过程中，使用Java如何实现？

我卡在使用ncurses做贪吃蛇游戏时，无法让蛇持续移动。

不使用否定的情况下编写正则表达式。

构建一个ngram频率表并处理多字节符文

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。