2023年5月7日 22:53:34go评论99阅读模式

英文:

Detecting multibyte character sequences?

问题

I'm writing a parser which parses UTF-8 strings. Characters outside of the ASCII range can only occur inside of string literals, which begin and end with ' or ". The rest of the language may only contain ASCII characters, so I can simply return an error if I find a byte outside the ASCII range.

The problem I can't seem to figure out is, when I encounter a non-ASCII character inside of a string literal, how can I detect how many bytes to skip for that character? My concern is that if a multi-byte character contains a ' or " as one of the bytes, my parser would end the string literal early.

Perhaps a shorter way to ask this is, if I encounter a byte in the 0x80-0xFF range, how can I detect how many bytes are in that character in a UTF-8 encoded string?

I'm writing this parser in C but I suspect that doesn't matter.

英文:

Perhaps a shorter way to ask this is, if I encounter a byte in the 0x80-0xFF range, how can I detect how many bytes are in that character in a UTF-8 encoded string?

I'm writing this parser in C but I suspect that doesn't matter.

答案1

得分: 6

> 我的担忧是，如果一个多字节字符的其中一个字节包含 ' 或 "，我的解析器将提前结束字符串字面量。

啊，这是你的误解。UTF-8 的精妙之处在于这种情况是不可能发生的。在 UTF-8 中，字节 0x27 只能表示撇号。它永远不会是多字节序列的一部分。这是因为续字节的最高位始终设置为 1。

UTF-8 的一个主要设计目标是，即使在解析包含非 ASCII 字节的 UTF-8 流时，现有和天真的 ASCII 实现也会表现出相同的行为。你可以安全地解析 " 并继续累积字节，直到达到 "（并使用 \ 来转义内部的 "），而无需担心 UTF-8 中是否涉及多字节字符。ASCII 解析器无需理解 UTF-8 或执行任何 UTF-8 解码，就可以正确工作。

此外，如果你真的想知道你问题的答案，第一个字节的前导 1 位数告诉你长度，唯一的例外是零个 1 位是 "1 字节"，一个 1 位是 "续字节"。

0x00 - 0x7F -> 1 字节
0x80 - 0xBF -> （续字节）
0xC0 - 0xDF -> 2 字节
0xE0 - 0xEF -> 3 字节
0xF0 - 0xF7 -> 4 字节

你也可以一直扫描，直到找到范围在 0x00-0x7F 内的内容。

英文:

> My concern is that if a multi-byte character contains a ' or " as one of the bytes, my parser would end the string literal early.

Ah, this is your misunderstanding. The brilliance of UTF-8 is that this cannot happen. In UTF-8, the byte 0x27 can only mean APOSTROPHE. It can never be part of a multi-byte sequence. This is because continuation bytes begin with the high bit set to 1.

A major design goal of UTF-8 is that existing and naïve ASCII implementations will work identically when parsing UTF-8 streams, even if the stream includes non-ASCII bytes. You can safely parse for " and continue to accumulate bytes until you reach " (and use \ to escape internal "), and never have to worry about whether there are multi-byte characters involved with UTF-8. ASCII parsers do not need to understand UTF-8 or perform any UTF-8 decoding in order to work correctly.

Beyond that, if you decide you really do want to know the answer to your question, the first byte's number of leading 1 bits tells you the length, with the exception that zero 1s is "1 byte" and one 1 is "continuation".

0x00 - 0x7F -&gt; 1 byte
0x80 - 0xBF -&gt; (continuation)
0xC0 - 0xDF -&gt; 2 bytes
0xE0 - 0xEF -&gt; 3 bytes
0xF0 - 0xF7 -&gt; 4 bytes

You can also just keep scanning along until you find something in the range 0x00-0x7F.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Detecting multibyte character sequences?

问题

答案1

在C中逐行存储文本文件的正确方法是什么？

golang cast memory to struct

我的用C编写的HTTP Web服务器不显示HTML文件。

Vulkan VK_LAYER_KHRONOS_validation 是错误的位类型

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。