如何使一个函数能够检测一个字符串是否是二进制安全的?

huangapple go评论70阅读模式
英文:

How do you make a function detect whether a string is binary safe or not

问题

如何在Go中检测字符串是否是二进制安全的?

一个类似的函数:

IsBinarySafe(str) // 如果安全则返回true,否则返回false。

以下是我思考或尝试解决这个问题的一些注释:


我假设可能已经存在一个已经实现了这个功能的库,但是我很难找到它。如果没有这样的库,你该如何实现呢?

我考虑了一些解决方案,但并不确定它们是否是好的解决方案。
其中一个解决方案是遍历字节,并拥有一个包含所有非法字节序列的哈希映射。
我还考虑过使用包含所有非法字符串的正则表达式,但不确定这是否是一个好的解决方案。
我也不确定来自其他语言的字节序列是否被视为二进制安全。比如典型的Go示例:

世界

以下代码:

IsBinarySafe(世界) // 返回true还是false?

它会返回true还是false?我假设所有的二进制安全字符串应该只使用1个字节。所以可以按照以下方式进行迭代:

const nihongo = "日本語abc日本語"
for i, w := 0, 0; i < len(nihongo); i += w {
    runeValue, width := utf8.DecodeRuneInString(nihongo[i:])
    fmt.Printf("%#U starts at byte position %d\n", runeValue, i)
    w = width
}

并且在宽度大于1时返回false。这些只是我在没有现成库的情况下的一些想法,但我不确定是否可行。

英文:

How does one detect if a string is binary safe or not in Go?

A function like:

IsBinarySafe(str) //returns true if its safe and false if its not.

Any comment after this are just things I have thought or attempted to solve this:


I assumed that there must exist a library that already does this but had a tough time finding it. If there isn't one, how do you implement this?

I was thinking of some solution but wasn't really convinced they were good solutions.
One of them was to iterate over the bytes, and have a hash map of all the illegal byte sequences.
I also thought of maybe writing a regex with all the illegal strings but wasn't sure if that was a good solution.
I also was not sure if a sequence of bytes from other languages counted as binary safe. Say the typical golang example:

世界

Would:

IsBinarySafe(世界) //true or false?

Would it return true or false? I was assuming that all binary safe string should only use 1 byte. So iterating over it in the following way:

const nihongo = &quot;日本語abc日本語&quot;
    for i, w := 0, 0; i &lt; len(nihongo); i += w {
        runeValue, width := utf8.DecodeRuneInString(nihongo[i:])
        fmt.Printf(&quot;%#U starts at byte position %d\n&quot;, runeValue, i)
        w = width
    }

and returning false whenever the width was great than 1. These are just some ideas I had just in case there wasn't a library for something like this already but I wasn't sure.

答案1

得分: 4

二进制安全与字符的宽度无关,主要是用于检查非可打印字符,比如空字节等。

根据维基百科的解释:

二进制安全是一个主要用于字符串处理函数的计算机编程术语。二进制安全函数本质上将其输入视为一串原始数据,没有特定的格式。因此,它应该能够处理字符可能取的所有256个可能值(假设是8位字符)。

我不确定你的目标是什么,几乎所有的编程语言现在都可以很好地处理utf8/16,但是对于你的具体问题,有一个相当简单的解决方案:

// 检查字符串s是否是ASCII字符且可打印,即不包含制表符、退格符等
func IsAsciiPrintable(s string) bool {
    for _, r := range s {
        if r > unicode.MaxASCII || !unicode.IsPrint(r) {
            return false
        }
    }
    return true
}

func main() {
    fmt.Printf("len([]rune(s)) = %d, len([]byte(s)) = %d\n", len([]rune(s)), len([]byte(s)))

    fmt.Println(IsAsciiPrintable(s), IsAsciiPrintable("test"))
}

你可以在playground上运行这段代码。

根据unicode.IsPrint的说明:

IsPrint函数用于判断一个符文是否可以被Go语言定义为可打印字符。这些字符包括字母、标记、数字、标点符号、符号和ASCII空格字符,属于类别L、M、N、P、S以及ASCII空格字符。这个分类与IsGraphic相同,只是唯一的空格字符是ASCII空格(U+0020)。

英文:

Binary safety has nothing to do with how wide a character is, it's mainly to check for non-printable characters more or less, like null bytes and such.

From Wikipedia:

> Binary-safe is a computer programming term mainly used in connection
> with string manipulating functions. A binary-safe function is
> essentially one that treats its input as a raw stream of data without
> any specific format. It should thus work with all 256 possible values
> that a character can take (assuming 8-bit characters).

I'm not sure what your goal is, almost all languages handle utf8/16 just fine now, however for your specific question there's a rather simple solution:

// checks if s is ascii and printable, aka doesn&#39;t include tab, backspace, etc.
func IsAsciiPrintable(s string) bool {
	for _, r := range s {
		if r &gt; unicode.MaxASCII || !unicode.IsPrint(r) {
			return false
		}
	}
	return true
}

func main() {
	fmt.Printf(&quot;len([]rune(s)) = %d, len([]byte(s)) = %d\n&quot;, len([]rune(s)), len([]byte(s)))

	fmt.Println(IsAsciiPrintable(s), IsAsciiPrintable(&quot;test&quot;))
}

<kbd>playground</kbd>

From unicode.IsPrint:

> IsPrint reports whether the rune is defined as printable by Go. Such
> characters include letters, marks, numbers, punctuation, symbols, and
> the ASCII space character, from categories L, M, N, P, S and the ASCII
> space character. This categorization is the same as IsGraphic except
> that the only spacing character is ASCII space, U+0020.

huangapple
  • 本文由 发表于 2014年7月10日 13:29:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/24669084.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定