问题

我目前正在根据规范开发一个解析器，但完全找不到关于文本编码的信息在文档中的任何地方。对我来说，文档中一个词法分析库不提及文本编码听起来很奇怪，所以我希望我没有漏掉它的部分。

ocamllex是否强制使用文本编码？
如果不是，如何设置要使用的文本编码？
ocamllex的正则表达式是否在内部使用Unicode代码点？还是特定的编码？

英文:

I'm currently developing a parser according to a specification, and I'm completely unable to find anywhere in the docs information about text encoding. It sounds weird to me that the docs of a lexing library wouldn't mention text encoding at all, so I hope I just didn't miss parts of it.

Does ocamllex force a text encoding?
If no, how to set which one to use?
Do ocamllex regexes work with Unicode codepoints internally? Or a specific encoding?

答案1

得分: 2

Ocamllex 在字节级别工作，将编码的问题留给其用户。

更具体地说，"字节级别" 意味着 ocamllex 将其输入视为一系列 8 位字节的序列。然后，ocamllex 正则表达式引擎分析这个 8 位字节序列。

Unicode 编码可以被视为在这个原始的 8 位字节序列之上的一层解释层。但是 ocamllex 词法分析器不知道这个更高层次的解释层，它只看原始的 8 位字节序列（这并不令人惊讶，因为 ocamllex 和第一个版本的 Unicode 在上世纪九十年代初期同时开发）。
特别是，在词法分析器中的图形字符是使用它们的 ASCII 编码来解释的，因此字符类
let digit = ['0'-'9']
表示的是在 [0x30, 0x39] 范围内的一个字节。

如果你需要一个了解 Unicode 字符类和编码的词法分析器，可以查看 sedlex。

英文:

Ocamllex works at the byte level and leaves all question of encoding to its users.

More precisely, working at the "byte level" means that ocamllex considers that its input is a sequence of 8-bit words. The ocamllex regex engine then analyzes this sequence of 8-bit words.

Unicode encodings can be seen as a layer of interpretation on the top of this raw sequence of 8 bits words. But the ocamllex lexer is unaware of this higher layer of interpretation and just looks at the raw sequence of 8-bit words (which is not that surprising since ocamllex and the first version of unicode were developed around the same time in the beginning of the nineties).
In particular, the graphical character in the lexer are interpreted using their ASCII encoding and thus the character class

let digit = [&#39;0&#39;-&#39;9&#39;]

means one byte in the interval [0x30, 0x39].

If you want a lexer that is aware of unicode character classes and encoding, you can look at sedlex .

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Specifying ocamllex encoding

问题

答案1

OCaml、Scala和Go的结构类型实现

OCaml: This expression has type int but an expression was expected of type MyFoo.ty (except MyFoo.ty is int)

具有相同签名的模块中的类型列表？

生成可执行的.ml测试用例，使用dune从一组纯文本文件中。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论