在Elixir中,如何返回空字节两侧的3个字符?

huangapple go评论57阅读模式
英文:

How do I return 3 characters either side of a null byte in Elixir?

问题

如果我有一个字符串,例如 hello this isa<<0>>string.,我要如何返回空字节两侧的三个字符,包括空字节,例如 isa<<0>>str

我尝试了以下方式:

~r/(?<=.{0,2})(.{3}).*?<<0>>(.{3})(?=.{0,2})/
英文:

If I have a string, for example, hello this isa<<0>>string., how do I return the three characters either side of the null byte, including the null byte, e.g., isa<<0>>str?

I was trying something like:

~r/(?<=.{0,2})(.{3}).*?<<0>>(.{3})(?=.{0,2})/

答案1

得分: 3

In Elixir,不需要使用正则表达式来实现。使用递归会更好更快(而且更可读)。

defmodule NullByte do
  def get_3_around(""), do: ""
  def get_3_around(<<pre::binary-size(3), 0, post::binary-size(3), _::binary>>), do: pre <> <<0>> <> post
  def get_3_around(<<pre::binary-size(3), 0, post::binary>>), do: pre <> <<0>> <> post
  def get_3_around(<<_::binary-size(1), rest::binary>>), do: get_3_around(rest)

  def test, do: get_3_around("hello this is a " <> <<0>> <> "string")
end
英文:

In Elixir, one does not need regexp to do that. Recursion would work better and faster (and way more readable).

defmodule NullByte do
  def get_3_around(&quot;&quot;),
    do: &quot;&quot;
  def get_3_around(&lt;&lt;pre::binary-size(3), 0, post::binary-size(3), _::binary&gt;&gt;),
    do: pre &lt;&gt; &lt;&lt;0&gt;&gt; &lt;&gt; post
  def get_3_around(&lt;&lt;pre::binary-size(3), 0, post::binary&gt;&gt;),
    do: pre &lt;&gt; &lt;&lt;0&gt;&gt; &lt;&gt; post
  def get_3_around(&lt;&lt;_::binary-size(1), rest::binary&gt;&gt;),
    do: get_3_around(rest)

  def test, do: get_3_around(&quot;hello this is a &quot; &lt;&gt; &lt;&lt;0&gt;&gt; &lt;&gt; &quot;string&quot;)
end

答案2

得分: 0

Elixir中的字符串使用UTF-8编码,这意味着单个字符可能不止一个字节长,因此最好设计函数以处理UTF-8字符。三个字符的长度可能长达12个字节,所以不要假定每个字符都是1个字节长,你可以使用 var_name::utf8 来匹配单个UTF-8字符。不幸的是,在二进制中不能使用 utf8 类型指定大小,因此无法通过简单地编写 var_name::utf8-size(3) 来匹配多个UTF-8字符,而是必须显式地编写三个不同的 "段"(这是语言中的一个疏忽,应该进行修正),例如:

<<char1::utf8, char2::utf8, char3::utf8, ....>>

接下来,空字节是不可打印字符,elixir不会将空字节打印为 <<0>>。但是,你可以显式地打印字符串 "<<0>>",例如:

IO.iex(7)> IO.puts "&lt;&lt;0&gt;&gt;"
<<0>>

但是,你应该注意 "<<0>>" 长度为5个字节,而不是1个字节。

在下面的示例中,二进制语法将查找双引号之间每个字符的UTF-8整数字符代码:

iex(17)> str = << "123"::utf8, 0::utf8, "456"::utf8 >>
<<49, 50, 51, 0, 52, 53, 54>>

iex(13)> IO.puts str
123^@456   <-- shell使用 "carrot 符号" 来显示不可打印字符
:ok

iex(14)> IO.inspect str
<<49, 50, 51, 0, 52, 53, 54>>
<<49, 50, 51, 0, 52, 53, 54>>

如果字符串/二进制包含不可打印字符,elixir不会以双引号格式输出字符串:

iex(2)> IO.inspect <<97,98>>
"ab"
"ab"

iex(3)> IO.inspect <<97, 0, 98>>
<<97, 0, 98>>
<<97, 0, 98>>

以下是如何在Elixir中匹配UTF-8字符:

defmodule My do

  # 从字符串开头查找匹配:

  def grab_3_chars_either_side_of_null(<<char1::utf8,
                                         char2::utf8,
                                         char3::utf8,
                                         0::utf8,   # 尝试匹配空字节
                                         char4::utf8,
                                         char5::utf8,
                                         char6::utf8,
                                         _rest::binary>>) do

     <<char1::utf8, char2::utf8, char3::utf8,
       <<0>>   # 你期望的输出,长度为5字节。
                # 如果只想要一个字节,将其更改为 0::utf8
       char4::utf8, char5::utf8, char6::utf8>>
  end

  # 如果在上面的字符串开头找不到匹配项,
  # 那么删除第一个UTF-8字符,_::utf8,并在其余字符串的开头查找匹配项(递归函数调用):

  def grab_3_chars_either_side_of_null(<<_::utf8,
                                         rest::binary>>) do

    grab_3_chars_either_side_of_null(rest)
  end

end

# 如果所有UTF-8字符都从字符串前面删除了,
# 那么字符串为空,没有找到匹配项,因此返回原子 ":no_match":

def grab_3_chars_either_side_of_null(<<>>), do: :no_match

我将留下定义 grab_3_chars_either_side_of_null/1 的其他分支的工作,根据需要进行定义。

注意:

  1. char1、char2等实际上将被分配整数值,为了将整数转换回字符串中的UTF-8字符,你必须编写 <<char1::utf8>>

  2. rest::binary 就像正则表达式中的贪婪 .* 一样:它可以匹配0到无限多个字符,并且只能放在二进制的末尾。

如果所有这些都太复杂,你还可以使用 String.split/3 来在空字节上拆分,然后在每个片段上使用 String.split_at/2 来获取第一个片段的最后三个字符(-3),以及第二个片段的前三个字符(3)。

英文:

Strings in Elixir employ the UTF-8 encoding, which means a single character can be longer than one byte, so it's better to design the function to handle UTF-8 characters. Three characters could be up to 12 bytes long, so rather than assuming every character is 1 byte long, you can match a single UTF-8 character using var_name::utf8. Unfortunately, you are not able to specify a size with the utf8 type in a binary, so you can't match multiple UTF-8 characters by simply writing var_name::utf8-size(3), instead you have to explicitly write out three different "segments" (which is a complete pain in the ass, and it's an oversight in the language that should be corrected), for example:

&lt;&lt;char1::utf8, char2::utf8, char3::utf8, ....&gt;

Next, a null byte is a non-printing character, and elixir won't print a null byte as &lt;&lt;0&gt;&gt;. However, you can explicitly print the string "<<0>>", e.g.

IO.iex(7)&gt; IO.puts &quot;&lt;&lt;0&gt;&gt;&quot;
&lt;&lt;0&gt;&gt;

But, you should be aware that "<<0>>" is 5 bytes long--not 1 byte.

In the following example, the binary syntax will look up the UTF-8 integer character codes for each character between the double quotes:

iex(17)&gt; str = &lt;&lt;&quot;123&quot;::utf8, 0::utf8, &quot;456&quot;::utf8&gt;&gt;
&lt;&lt;49, 50, 51, 0, 52, 53, 54&gt;&gt;

iex(13)&gt; IO.puts str
123^@456   &lt;--shell uses &quot;carrot notation&quot; to display non printing chars
:ok

iex(14)&gt; IO.inspect str
&lt;&lt;49, 50, 51, 0, 52, 53, 54&gt;&gt;
&lt;&lt;49, 50, 51, 0, 52, 53, 54&gt;&gt;

If a string/binary contains non-printing characters, then elixir won't output strings in double quote format:

iex(2)&gt; IO.inspect &lt;&lt;97,98&gt;&gt;
&quot;ab&quot;
&quot;ab&quot;

iex(3)&gt; IO.inspect &lt;&lt;97, 0, 98&gt;&gt;
&lt;&lt;97, 0, 98&gt;&gt;
&lt;&lt;97, 0, 98&gt;&gt;

Here's how to match UTF-8 characters in Elixir:

defmodule My do

  #Look for match starting at beginning of string:

  def grab_3_chars_either_side_of_null(&lt;&lt;char1::utf8,
                                         char2::utf8,
                                         char3::utf8,
                                         0::utf8,   #Tries to match a null byte
                                         char4::utf8,
                                         char5::utf8,
                                         char6::utf8,
                                         _rest::binary&gt;&gt;
                                      ) do



     &lt;&lt;char1::utf8, char2::utf8, char3::utf8,
       &quot;&lt;&lt;0&gt;&gt;&quot;,   # Your desired output, which is 5 bytes long.
                  # Change to 0::utf8 if you only want one byte
       char4::utf8, char5::utf8, char6::utf8&gt;&gt;
  end

  #If a match isn&#39;t found at the beginning of the string above,
  #then drop the first UTF-8 character, `_::utf8`, and look for a match at
  #start of the rest of the string (the recursive function call):

  def grab_3_chars_either_side_of_null(&lt;&lt;_::utf8,
                                         rest::binary&gt;&gt;
                                   ) do

    grab_3_chars_either_side_of_null(rest)
  end

end

#If all the UTF-8 characters have been dropped off the front of the string,
#then the string is empty, and no matches were found, so return the atom
#`:no_match`:

def grab_3_chars_either_side_of_null(&lt;&lt;&gt;&gt;), do: :no_match

I'll leave it as an exercise to define other branches of grab_3_chars_either_side_of_null/1 as you see fit.

Note:

  1. char1, char2, etc. will actually be assigned integers, and in order to convert an integer back to a UTF-8 character in a string, you have to write &lt;&lt;char1::utf8&gt;&gt;.

  2. rest::binary is like a greedy .* in a regex: it will match 0 to an infinite number of characters, and it can only be placed at the end of a binary.

If all that is too confusing, you could also use String.split/3 to split on the null byte, then use String.split_at/2 on each piece to get the last three characters (-3) of the first piece and the first three characters (3) of the second piece.

答案3

得分: 0

以下是您要翻译的代码部分:

&lt;!-- language-all: lang-elixir --&gt;

This is a simplification of Aleksei&#39;s answer, modified to return a 2-tuple.

defmodule NullByte do
  def get_3_around(&lt;&lt;&gt;&gt;), do: nil
  def get_3_around(&lt;&lt;pre::binary-3, 0, post::binary-3, _::binary&gt;&gt;), do: {pre, post}
  def get_3_around(&lt;&lt;_::binary-1, rest::binary&gt;&gt;), do: get_3_around(rest)
end

Usage:

```lang-none
iex(1)&gt; NullByte.get_3_around(&quot;aaa&quot; &lt;&gt; &lt;&lt;0&gt;&gt; &lt;&gt; &quot;bbb&quot;)
{&quot;aaa&quot;, &quot;bbb&quot;}
iex(2)&gt; NullByte.get_3_around(&quot;aaabbb&quot; &lt;&gt; &lt;&lt;0&gt;&gt; &lt;&gt; &quot;cccddd&quot;)
{&quot;bbb&quot;, &quot;ccc&quot;}
iex(3)&gt; NullByte.get_3_around(&quot;aaa&quot; &lt;&gt; &lt;&lt;0&gt;&gt; &lt;&gt; &quot;b&quot;)
nil
iex(4)&gt; NullByte.get_3_around(&quot;foo&quot;)
nil
英文:

<!-- language-all: lang-elixir -->

This is a simplification of Aleksei's answer, modified to return a 2-tuple.

defmodule NullByte do
  def get_3_around(&lt;&lt;&gt;&gt;), do: nil
  def get_3_around(&lt;&lt;pre::binary-3, 0, post::binary-3, _::binary&gt;&gt;), do: {pre, post}
  def get_3_around(&lt;&lt;_::binary-1, rest::binary&gt;&gt;), do: get_3_around(rest)
end

Usage:

iex(1)&gt; NullByte.get_3_around(&quot;aaa&quot; &lt;&gt; &lt;&lt;0&gt;&gt; &lt;&gt; &quot;bbb&quot;)
{&quot;aaa&quot;, &quot;bbb&quot;}
iex(2)&gt; NullByte.get_3_around(&quot;aaabbb&quot; &lt;&gt; &lt;&lt;0&gt;&gt; &lt;&gt; &quot;cccddd&quot;)
{&quot;bbb&quot;, &quot;ccc&quot;}
iex(3)&gt; NullByte.get_3_around(&quot;aaa&quot; &lt;&gt; &lt;&lt;0&gt;&gt; &lt;&gt; &quot;b&quot;)
nil
iex(4)&gt; NullByte.get_3_around(&quot;foo&quot;)
nil

huangapple
  • 本文由 发表于 2023年6月6日 00:32:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76408390.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定