使用字符范围进行搜索俄文字符

huangapple go评论74阅读模式
英文:

grepping for Russian characters using character ranges

问题

如何使用字符范围从文本文件中grep出包含'Й'和'й'的行?

在Unicode中,俄语大写字母(除了'Ё')按字母顺序在范围0x410到0x42f之间,小写字母(除了'ё')在范围0x430到0x44f之间。这意味着[А-ИК-ЯЁ]应该匹配除了'Й'之外的所有俄语字符,而[а-ик-яё]应该匹配除了'й'之外的所有俄语字符。但事实并非如此。

为了进行实验,我创建了一个函数,每行输出一个俄语字符[Ж-Мж-м]:

rus () { for char in Ж З И Й К Л М ж з и й к л м; do echo $char; done; }

我还导出了适当的排序设置:

export LC_COLLATE=ru_RU.UTF-8

没有使用字符范围时,一切都按预期工作:

rus | grep -v "[АБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя]"

rus | grep -v "[АБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзиклмнопрстуфхцчшщъыьэюя]"

分别输出'Й'和'й'。

使用字符范围时,[А-ИК-ЯЁа-ик-яё]应该匹配除了'Й'和'й'之外的所有俄语字符,这是正确的。但是当我只想过滤'Й'或只想过滤'й'时,出现了一些有趣的情况:在我的系统上,以下两个命令都没有输出:

rus | grep -v "[А-ЯЁа-ик-яё]"  # 预期输出:'й'

rus | grep -v "[А-ИК-ЯЁа-яё]"  # 预期输出:'Й'

'Й'和'й'在这方面并没有特殊性;对字母'П'和'п'进行类似的实验也会产生相同的效果。

grep是否可能以某种原因默认在字符范围中不区分大小写处理俄语或西里尔字母?不,它不会:将--no-ignore-case添加到所有这些grep命令中也没有改变任何内容。

发生了什么?我是否发现了grep中的一个错误?还是我漏掉了什么?

(我使用的是GNU grep 3.11(使用pcre构建)和bash 5.1.16。)

英文:

How to grep for lines with 'Й' and 'й' from text file, using character ranges?

In Unicode, Russian capital characters (except 'Ё') are in the range from 0x410 to 0x42f in alphabetical order, and small characters (except 'ё') are in the range from 0x430 to 0x44f in alphabetical order. This means that [А-ИК-ЯЁ] should match all Russian characters except 'Й', and [а-ик-яё] should match all Russian characters except 'й'. But this turns out to be not quite the case.

For experimenting, I created a function that outputs Russian characters [Ж-Мж-м], one per line:

rus () { for char in Ж З И Й К Л М ж з и й к л м; do echo $char; done; }

I also exported the appropriate collate setting:

export LC_COLLATE=ru_RU.UTF-8

Without character ranges everything worked as expected:

rus | grep -v "[АБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя]"

and

rus | grep -v "[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзиклмнопрстуфхцчшщъыьэюя]"

output 'Й' and 'й' respectively.

With character ranges, [А-ИК-ЯЁа-ик-яё] should match all Russian characters except 'Й' and 'й', and this turned out to be correct. But when I wanted to filter only 'Й' or only 'й', something interesting happened: on my system, both

rus | grep -v "[А-ЯЁа-ик-яё]"  # expected output: 'й'

and

rus | grep -v "[А-ИК-ЯЁа-яё]"  # expected output 'Й'

output nothing!

'Й' and 'й' are not special in this respect; the analogical experiment with letters 'П' and 'п' showed the same effect.

Is grep maybe, for some reason, handling Russian or Cyrillic characters case-insensitively by default in character ranges? No, it is not: adding --no-ignore-case to all those grep commands changed nothing.

What's going on? Have I found a bug in grep? Or am I missing something?

(I am using GNU grep 3.11 (built with pcre), and bash 5.1.16.)

答案1

得分: 3

首先,你应该引用grep的参数;如果你不这样做,并且当前目录中有一个文件名只有一个俄文字母,那么这个字母将是传递给grep的唯一内容。

但问题是,没有PCRE的grep似乎是按字节处理的,而不考虑区域设置。所以我认为你需要使用-P来打开Perl兼容模式:

$ rus | grep -Pv '[А-ЯЁа-ик-яё]'
й

每当你怀疑解释grep参数时出现问题时,一个很好的健全性检查是回退到发送纯ASCII字符串给它,使用\x{...}语法来表示非ASCII字符(这也是PCRE的一个特性,所以只能在使用-P时起作用):

$ rus | grep -Pv '[\x{0410}-\x{042f}Ё\x{0430}-\x{0438}\x{043a}-\x{044f}ё]'
й
英文:

First, you should quote the argument to grep; if you don't, and you have a file in the current directory whose name is a single Russian letter, that letter will be the only thing passed to grep.

But the problem is that grep without PCRE appears to work bytewise, regardless of locale settings. So I think you need to turn on Perl-compatible mode with -P:

$ rus | grep -Pv '[А-ЯЁа-ик-яё]'
й

Whenever you suspect problems interpreting the argument to grep, a good sanity check is to fall back to sending it pure-ASCII strings, using the \x{...} syntax for non-ASCII characters (which is also a feature of PCRE, so only works with -P):

$ rus | grep -Pv '[\x{0410}-\x{042f}Ёx\{0430}-\x{0438}\x{043a}-\x{044f}ё]'
й

答案2

得分: 3

如果你想匹配 Cyrillic 字母,你可以使用 Perl Unicode 脚本正则表达式:

rus | grep -P "\p{Cyrillic}"

这样应该只输出 Cyrillic 字符。

英文:

If you want to match characters of the Cyrillic alphabet, you can use a perl Unicode script regex:

rus | grep -P "\p{Cyrillic}"

should output only Cyrillic characters

答案3

得分: 0

如果速度是一个问题,使用这个八进制代码链并在字节级别扫描行以查找 Cyrillic 字符会快得多

对于 gnu-grep,需要7.115秒,而对于 mawk2,只需要3.137秒2.27倍)。

(在正则表达式中没有空格/换行符 - 这里只是为了格式化而添加)

(时间(pvE0 < "$m3t" | mawk2 '/([\320-\323]|\352\231)[\200-\277]|\324 [\200-\257]|\341(\262[\200-\210]|\264\253|[\265\267]\270)|\342(\267[\240-\277]|\271\203)|\352 \232[\200-\237]|\357 \270[\256\257]/')| pvE9)

in0: 1.85GiB 0:00:03 [ 606MiB/s] [ 606MiB/s] [========>] 100%
out9: 60.7KiB 0:00:03 [19.5KiB/s] [19.5KiB/s] [<=> ]

(pvE 0.1 in0 < "$m3t" | mawk2 ;)

3.03秒 用户 0.45秒 系统 111% CPU 3.137 总计

e23cb29c6564c3da39a67c0cb320f323 stdin


out9: 60.7KiB 0:00:07 [8.56KiB/s] [8.56KiB/s] [<=> ]
in0: 1.85GiB 0:00:07 [ 266MiB/s] [ 266MiB/s] [========>] 100%

(pvE 0.1 in0 < "$m3t" | ggrep -P '\p{Cyrillic}';)

7.02秒 用户 0.46秒 系统 105% CPU 7.115 总计

e23cb29c6564c3da39a67c0cb320f323 stdin

英文:

if speed is of any concern, it's much faster just to use this chain of octal codes and scan rows for Cyrillic at the byte level :

7.115 secs for gnu-grep vs. 3.137 secs mawk2 (2.27x)

(there are no space/line gaps within the regex - I added them here only for formatting)

( time  ( pvE0 &lt; &quot;$m3t&quot; | 

  mawk2 &#39;/([0-3]|21)[0-7]|
            4     [0-7]|
            1(2[0-0]|43|[57]0)|
            2(7[0-7]|13)|
            2 2[0-7]|
            7 0[67]/&#39; ) | pvE9 )

  in0: 1.85GiB 0:00:03 [ 606MiB/s] [ 606MiB/s] [========&gt;] 100%            
 out9: 60.7KiB 0:00:03 [19.5KiB/s] [19.5KiB/s] [&lt;=&gt; ]
( pvE 0.1 in0 &lt; &quot;$m3t&quot; | mawk2 ; )  

3.03s user 0.45s system 111% cpu 3.137 total

e23cb29c6564c3da39a67c0cb320f323  stdin

 out9: 60.7KiB 0:00:07 [8.56KiB/s] [8.56KiB/s] [ &lt;=&gt; ]
  in0: 1.85GiB 0:00:07 [ 266MiB/s] [ 266MiB/s] [========&gt;] 100% 
       
( pvE 0.1 in0 &lt; &quot;$m3t&quot; | ggrep -P &#39;\p{Cyrillic}&#39;; )  

7.02s user 0.46s system 105% cpu 7.115 total

e23cb29c6564c3da39a67c0cb320f323  stdin

huangapple
  • 本文由 发表于 2023年8月8日 23:37:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76861113.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定