如何使`sort` shell命令比较原始字节?

huangapple go评论71阅读模式
英文:

How to get the `sort` shell command to compare raw bytes?

问题

It seems like you want the following text translated into Chinese without the code parts:

似乎 POSIX sort 命令行实用程序会执行一些基于区域设置的复杂比较操作,但我扫描了手册页面,似乎找不到一种以原始字节值方式进行排序的方法。是否有办法让 sort(我使用的是 GNU coreutils 版本)的行为类似于 C 中的 qsort(array_of_my_strings, N, strcmp)?如果使用除 sort 之外的其他工具来解决也可以。

举个例子,目前我得到的结果如下:

printf "\xC3\xBC\n\x76\n" | sort
ü
v

因为德国变音符 ü 似乎被比作 u,尽管 \xC3 大于 \x76

我想要的是:

printf "\xC3\xBC\n\x76\n" | sort --raw-bytes-please
v
ü
英文:

It seems like the posix sort
command line utility will do some fancy locale based shenanegans to compare the given strings.

I scanned the man page but could not seem to find a way to get it to use the raw byte values instead.
Is there a way to get sort (I have the GNU coreutils version) to behave like
qsort(array_of_my_strings, N, strcmp) would in C? Solutions using another tool then sort would be fine too.

For demonstration, I currently get:

printf "\xC3\xBC\n\x76\n" | sort
ü
v

because the german umlaut ü seems to be compared as u which comes before v, despite \xC3 being larger than \x76.

What i want is

printf "\xC3\xBC\n\x76\n" | sort --raw-bytes-please
v
ü

答案1

得分: 6

Collation order and (multi-byte) character type are influenced by your locale. The locale name for disabling multibyte and locale-aware behaviors is C.

Thus:

LC_COLLATE=C LC_CTYPE=C sort

...will set only the character type and the collation order (assuming LC_ALL isn't set, in which case they would be ignored).


As a big hammer, you can also use:

LC_ALL=C sort

albeit with side effects such as changing the language used for printing error messages &c to the strings originally written by sort's developers with no translation tables in effect.

英文:

Collation order and (multi-byte) character type are influenced by your locale. The locale name for disabling multibyte and locale-aware behaviors is C.

Thus:

LC_COLLATE=C LC_CTYPE=C sort

...will set only the character type and the collation order (assuming LC_ALL isn't set, in which case they would be ignored).


As a big hammer, you can also use:

LC_ALL=C sort

albeit with side effects such as changing the language used for printing error messages &c to the strings originally written by sort's developers with no translation tables in effect.

huangapple
  • 本文由 发表于 2023年5月25日 19:28:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76331773.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定