英文:
Strange behaviour of string:length function
问题
为什么字符 'â' 的存在导致此操作失败。
> Bin = << "â Hello">>.
> string:length(Bin).
** exception error: bad argument: << "â Hello">>
在函数 string:length_1/2 (string.erl, 行 557) 中
然而,如果将其转换为列表,则运行正常。
> Str = binary_to_list(Bin).
> string:length(Str).
7
英文:
Why does the presence of character 'â' cause this to fail.
> Bin = <<"â Hello">>.
> string:length(Bin).
** exception error: bad argument: <<"â Hello">>
in function string:length_1/2 (string.erl, line 557)
Whereas if this is converted to List, it worked fine.
> Str = binary_to_list(Bin).
> string:length(Str).
7
答案1
得分: 4
string:length()
的参数可以是整数列表(其中整数介于0到1114111之间),也可以是二进制,其中整数的组合必须形成UTF-8字符。
有许多方法可以表示字符â
(带抑扬符的小写字母a),其中有两种:
-
Latin-1整数码
226
。 -
UTF-8表示:
195, 162
(或十六进制表示:C3 A2)
当您在二进制中键入时,您的键盘可能会输入“带抑扬符的小写字母a”的Latin-1代码,而226不是任何有效的UTF-8整数代码的开头,因此erlang会给出“bad argument”错误。
接下来,为什么将二进制转换为列表会起作用呢?在erlang中,双引号是创建整数列表的快捷方式:
6 > "abc" =:= [97,98,99]。(完全相等)
true
每当您在erlang中看到双引号时,您应该想:“这是一个列表。”唯一的例外是当您在二进制内使用双引号时:而不是创建列表,您创建了一个以逗号分隔的整数系列:
8 > << "abc", 0 >>.
<<97,98,99,0>>
添加一个0是一个技巧,它迫使shell向您显示实际情况。或者,您可以告诉erlang停止试图愚弄您,只向您显示真相:
18 > shell:strings(false).
true
19 > "abc".
[97,98,99]
20 > <<226, "Hello">>.
<<226,72,101,108,108,111>>
当您将包含整数的二进制转换为列表(二进制只能包含0到255之间的整数),然后您将获得包含相同整数的列表,并且任何整数列表,其中整数介于0到1114111之间,都是string:length()
的有效参数。
最后,请注意,string:length()
不仅仅返回二进制中的字节数:
23 > string:length(<<195,162,97,98,99>>).
4
string:length()
识别出前两个字节,即195, 162,是“带抑扬符的小写字母a”的UTF-8代码,因此它只将这两个整数/字节计为一个字符。另一方面,如果您首先转换为列表,string:length()
将返回列表中的整数数量:
24 > string:length(binary_to_list(<<195,162,97,98,99>>)).
5
...这与byte_size(Binary)
给出的答案相同。
英文:
The argument for string:length()
can be a list of integers (where the integers are between 0...1114111) or a binary where clumps of integers must form UTF-8 characters.
There are many ways to represent the character â
(small letter a with circumflex
), and two of them are:
-
The Latin-1 integer code
226
. -
The UTF-8 representation:
195, 162
(or in hex: C3 A2)5> <<195, 162>>. <<"â"/utf8>>
When you typed in your binary, your keyboard probably entered the Latin-1 code for "small letter a with circumflex", and 226 is not the beginning of any valid UTF-8 integer code, so erlang gave you a bad argument
error.
Next, why does converting the binary to a list work? In erlang, double quotes are a shortcut for creating a list of integers:
6> "abc" =:= [97,98,99]. (exactly equal)
true
Whenever you see double quotes in erlang, you should be thinking: "This is a list." The one exception to that rule is when you use double quotes inside a binary: instead of creating a list, you create a comma separated series of integers:
8> <<"abc", 0>>.
<<97,98,99,0>>
Adding a 0 is a trick that forces the shell to show you what you really have. Or, you can tell erlang to quit trying to fool you with the double quotes and just show you the truth:
18> shell:strings(false).
true
19> "abc".
[97,98,99]
20> <<226, "Hello">>.
<<226,72,101,108,108,111>>
When you convert a binary containing integers to a list (binaries can only contain integers between 0...255), then you get a list containing those same integers, and any list of integers, where the integers are between 0...1114111, is a valid argument for string:length()
.
Finally, note that string:length()
doesn't merely return the number of bytes in a binary:
23> string:length(<<195,162,97,98,99>>).
4
string:length()
recognizes that the first two bytes, i.e. 195, 162, are the UTF-8 code for small letter a with circumflex
, and therefore it only counts the two integers/bytes as one character. On the other hand, if you convert to a list first, string:length()
returns the number of integers in the list:
24> string:length(binary_to_list(<<195,162,97,98,99>>)).
5
...which is the same answer you get with byte_size(Binary)
.
答案2
得分: 2
响应评论:
-module(a).
-compile(export_all).
get_response() ->
[226,97,98,99,195,162]. %% 请记住,双引号括起的字符串只是创建整数列表的快捷方式。
%% ^ ^ ^
%% | | |
%% Latin-1 UTF-8
%% 带环抑符的小写字母 a
count(Bin) ->
count(Bin, 0).
count(<<_Head/utf8, Tail/binary>>, Count) ->
count(Tail, Count+1);
count(<<_Head/integer, Tail/binary>>, Count) ->
count(Tail, Count+1);
count(<<>>, Count) ->
Count.
在Shell中:
32> string:length(list_to_binary(a:get_response())).
** exception error: bad argument: <<226,97,98,99,195,162>>
in function string:length_1/2 (string.erl, line 557)
33> string:length(list_to_binary([97,98,99,195,162])).
4
34> Bin = list_to_binary(a:get_response()).
<<226,97,98,99,195,162>>
35> a:count(Bin).
5
请注意,get_response()
返回一个包含 Latin-1 编码中 带环抑符的小写字母 a
整数代码,ASCII 编码中 a
整数代码,以及 UTF-8 编码中 带环抑符的小写字母 a
的两个整数代码的列表。二进制允许您指定utf8
类型,它将匹配表示UTF-8中的一个字符的整数块。utf8
类型还将匹配范围在0-127之间的单个整数,即ASCII字符的代码。count()
函数的第二个子句指定了integer
类型,它将匹配任何其他整数,例如具有大于127的代码的Latin-1字符。
二进制还允许您指定utf16
和utf32
类型,以匹配表示这些编码中字符的整数块。
英文:
Response to comment:
> I can get ASCII, latin or unicode chars - what is the best way to find
> the length of string I got?
-module(a).
-compile(export_all).
get_response() ->
[226,97,98,99,195,162]. %% Remember a double quoted string is just
%% a shortcut for creating a list of integers.
%% ^ ^ ^
%% | | |
%% latin-1 utf-8
%% small letter a with circumflex
count(Bin) ->
count(Bin, 0).
count(<<_Head/utf8, Tail/binary>>, Count) ->
count(Tail, Count+1);
count(<<_Head/integer, Tail/binary>>, Count) ->
count(Tail, Count+1);
count(<<>>, Count) ->
Count.
In the shell:
32> string:length(list_to_binary(a:get_response())).
** exception error: bad argument: <<226,97,98,99,195,162>>
in function string:length_1/2 (string.erl, line 557)
33> string:length(list_to_binary([97,98,99,195,162])).
4
34> Bin = list_to_binary(a:get_response()).
<<226,97,98,99,195,162>>
35> a:count(Bin).
5
Note that get_response()
returns a list containing the integer code for small letter a with circumflex
in latin-1, the integer code for a
in ascii, and the two integers representing small letter a with circumflex
in UTF-8. Binaries allow you to specify a utf8
type, which will match a clump of integers that represents one character in UTF-8. The utf8
type will also match single integers in the range 0-127, i.e. the codes for ascii characters. The second clause of the count()
function specifies an integer
type, which will match any other integer, e.g. a latin-1 character with a code above 127.
Binaries also allow you to specify the types utf16
and utf32
to match clumps of integers representing characters in those encodings.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论