英文:
How do I return 3 characters either side of a null byte in Elixir?
问题
如果我有一个字符串,例如 hello this isa<<0>>string.
,我要如何返回空字节两侧的三个字符,包括空字节,例如 isa<<0>>str
?
我尝试了以下方式:
~r/(?<=.{0,2})(.{3}).*?<<0>>(.{3})(?=.{0,2})/
英文:
If I have a string, for example, hello this isa<<0>>string.
, how do I return the three characters either side of the null byte, including the null byte, e.g., isa<<0>>str
?
I was trying something like:
~r/(?<=.{0,2})(.{3}).*?<<0>>(.{3})(?=.{0,2})/
答案1
得分: 3
In Elixir,不需要使用正则表达式来实现。使用递归会更好更快(而且更可读)。
defmodule NullByte do
def get_3_around(""), do: ""
def get_3_around(<<pre::binary-size(3), 0, post::binary-size(3), _::binary>>), do: pre <> <<0>> <> post
def get_3_around(<<pre::binary-size(3), 0, post::binary>>), do: pre <> <<0>> <> post
def get_3_around(<<_::binary-size(1), rest::binary>>), do: get_3_around(rest)
def test, do: get_3_around("hello this is a " <> <<0>> <> "string")
end
英文:
In Elixir, one does not need regexp to do that. Recursion would work better and faster (and way more readable).
defmodule NullByte do
def get_3_around(""),
do: ""
def get_3_around(<<pre::binary-size(3), 0, post::binary-size(3), _::binary>>),
do: pre <> <<0>> <> post
def get_3_around(<<pre::binary-size(3), 0, post::binary>>),
do: pre <> <<0>> <> post
def get_3_around(<<_::binary-size(1), rest::binary>>),
do: get_3_around(rest)
def test, do: get_3_around("hello this is a " <> <<0>> <> "string")
end
答案2
得分: 0
Elixir中的字符串使用UTF-8编码,这意味着单个字符可能不止一个字节长,因此最好设计函数以处理UTF-8字符。三个字符的长度可能长达12个字节,所以不要假定每个字符都是1个字节长,你可以使用 var_name::utf8
来匹配单个UTF-8字符。不幸的是,在二进制中不能使用 utf8
类型指定大小,因此无法通过简单地编写 var_name::utf8-size(3)
来匹配多个UTF-8字符,而是必须显式地编写三个不同的 "段"(这是语言中的一个疏忽,应该进行修正),例如:
<<char1::utf8, char2::utf8, char3::utf8, ....>>
接下来,空字节是不可打印字符,elixir不会将空字节打印为 <<0>>
。但是,你可以显式地打印字符串 "<<0>>",例如:
IO.iex(7)> IO.puts "<<0>>"
<<0>>
但是,你应该注意 "<<0>>" 长度为5个字节,而不是1个字节。
在下面的示例中,二进制语法将查找双引号之间每个字符的UTF-8整数字符代码:
iex(17)> str = << "123"::utf8, 0::utf8, "456"::utf8 >>
<<49, 50, 51, 0, 52, 53, 54>>
iex(13)> IO.puts str
123^@456 <-- shell使用 "carrot 符号" 来显示不可打印字符
:ok
iex(14)> IO.inspect str
<<49, 50, 51, 0, 52, 53, 54>>
<<49, 50, 51, 0, 52, 53, 54>>
如果字符串/二进制包含不可打印字符,elixir不会以双引号格式输出字符串:
iex(2)> IO.inspect <<97,98>>
"ab"
"ab"
iex(3)> IO.inspect <<97, 0, 98>>
<<97, 0, 98>>
<<97, 0, 98>>
以下是如何在Elixir中匹配UTF-8字符:
defmodule My do
# 从字符串开头查找匹配:
def grab_3_chars_either_side_of_null(<<char1::utf8,
char2::utf8,
char3::utf8,
0::utf8, # 尝试匹配空字节
char4::utf8,
char5::utf8,
char6::utf8,
_rest::binary>>) do
<<char1::utf8, char2::utf8, char3::utf8,
<<0>> # 你期望的输出,长度为5字节。
# 如果只想要一个字节,将其更改为 0::utf8
char4::utf8, char5::utf8, char6::utf8>>
end
# 如果在上面的字符串开头找不到匹配项,
# 那么删除第一个UTF-8字符,_::utf8,并在其余字符串的开头查找匹配项(递归函数调用):
def grab_3_chars_either_side_of_null(<<_::utf8,
rest::binary>>) do
grab_3_chars_either_side_of_null(rest)
end
end
# 如果所有UTF-8字符都从字符串前面删除了,
# 那么字符串为空,没有找到匹配项,因此返回原子 ":no_match":
def grab_3_chars_either_side_of_null(<<>>), do: :no_match
我将留下定义 grab_3_chars_either_side_of_null/1
的其他分支的工作,根据需要进行定义。
注意:
-
char1、char2等实际上将被分配整数值,为了将整数转换回字符串中的UTF-8字符,你必须编写
<<char1::utf8>>
。 -
rest::binary
就像正则表达式中的贪婪.*
一样:它可以匹配0到无限多个字符,并且只能放在二进制的末尾。
如果所有这些都太复杂,你还可以使用 String.split/3 来在空字节上拆分,然后在每个片段上使用 String.split_at/2 来获取第一个片段的最后三个字符(-3),以及第二个片段的前三个字符(3)。
英文:
Strings in Elixir employ the UTF-8 encoding, which means a single character can be longer than one byte, so it's better to design the function to handle UTF-8 characters. Three characters could be up to 12 bytes long, so rather than assuming every character is 1 byte long, you can match a single UTF-8 character using var_name::utf8
. Unfortunately, you are not able to specify a size with the utf8
type in a binary, so you can't match multiple UTF-8 characters by simply writing var_name::utf8-size(3)
, instead you have to explicitly write out three different "segments" (which is a complete pain in the ass, and it's an oversight in the language that should be corrected), for example:
<<char1::utf8, char2::utf8, char3::utf8, ....>
Next, a null byte is a non-printing character, and elixir won't print a null byte as <<0>>
. However, you can explicitly print the string "<<0>>", e.g.
IO.iex(7)> IO.puts "<<0>>"
<<0>>
But, you should be aware that "<<0>>" is 5 bytes long--not 1 byte.
In the following example, the binary syntax will look up the UTF-8 integer character codes for each character between the double quotes:
iex(17)> str = <<"123"::utf8, 0::utf8, "456"::utf8>>
<<49, 50, 51, 0, 52, 53, 54>>
iex(13)> IO.puts str
123^@456 <--shell uses "carrot notation" to display non printing chars
:ok
iex(14)> IO.inspect str
<<49, 50, 51, 0, 52, 53, 54>>
<<49, 50, 51, 0, 52, 53, 54>>
If a string/binary contains non-printing characters, then elixir won't output strings in double quote format:
iex(2)> IO.inspect <<97,98>>
"ab"
"ab"
iex(3)> IO.inspect <<97, 0, 98>>
<<97, 0, 98>>
<<97, 0, 98>>
Here's how to match UTF-8 characters in Elixir:
defmodule My do
#Look for match starting at beginning of string:
def grab_3_chars_either_side_of_null(<<char1::utf8,
char2::utf8,
char3::utf8,
0::utf8, #Tries to match a null byte
char4::utf8,
char5::utf8,
char6::utf8,
_rest::binary>>
) do
<<char1::utf8, char2::utf8, char3::utf8,
"<<0>>", # Your desired output, which is 5 bytes long.
# Change to 0::utf8 if you only want one byte
char4::utf8, char5::utf8, char6::utf8>>
end
#If a match isn't found at the beginning of the string above,
#then drop the first UTF-8 character, `_::utf8`, and look for a match at
#start of the rest of the string (the recursive function call):
def grab_3_chars_either_side_of_null(<<_::utf8,
rest::binary>>
) do
grab_3_chars_either_side_of_null(rest)
end
end
#If all the UTF-8 characters have been dropped off the front of the string,
#then the string is empty, and no matches were found, so return the atom
#`:no_match`:
def grab_3_chars_either_side_of_null(<<>>), do: :no_match
I'll leave it as an exercise to define other branches of grab_3_chars_either_side_of_null/1
as you see fit.
Note:
-
char1, char2, etc. will actually be assigned integers, and in order to convert an integer back to a UTF-8 character in a string, you have to write
<<char1::utf8>>
. -
rest::binary
is like a greedy.*
in a regex: it will match 0 to an infinite number of characters, and it can only be placed at the end of a binary.
If all that is too confusing, you could also use String.split/3 to split on the null byte, then use String.split_at/2 on each piece to get the last three characters (-3) of the first piece and the first three characters (3) of the second piece.
答案3
得分: 0
以下是您要翻译的代码部分:
<!-- language-all: lang-elixir -->
This is a simplification of Aleksei's answer, modified to return a 2-tuple.
defmodule NullByte do
def get_3_around(<<>>), do: nil
def get_3_around(<<pre::binary-3, 0, post::binary-3, _::binary>>), do: {pre, post}
def get_3_around(<<_::binary-1, rest::binary>>), do: get_3_around(rest)
end
Usage:
```lang-none
iex(1)> NullByte.get_3_around("aaa" <> <<0>> <> "bbb")
{"aaa", "bbb"}
iex(2)> NullByte.get_3_around("aaabbb" <> <<0>> <> "cccddd")
{"bbb", "ccc"}
iex(3)> NullByte.get_3_around("aaa" <> <<0>> <> "b")
nil
iex(4)> NullByte.get_3_around("foo")
nil
英文:
<!-- language-all: lang-elixir -->
This is a simplification of Aleksei's answer, modified to return a 2-tuple.
defmodule NullByte do
def get_3_around(<<>>), do: nil
def get_3_around(<<pre::binary-3, 0, post::binary-3, _::binary>>), do: {pre, post}
def get_3_around(<<_::binary-1, rest::binary>>), do: get_3_around(rest)
end
Usage:
iex(1)> NullByte.get_3_around("aaa" <> <<0>> <> "bbb")
{"aaa", "bbb"}
iex(2)> NullByte.get_3_around("aaabbb" <> <<0>> <> "cccddd")
{"bbb", "ccc"}
iex(3)> NullByte.get_3_around("aaa" <> <<0>> <> "b")
nil
iex(4)> NullByte.get_3_around("foo")
nil
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论