Specifying ocamllex encoding

huangapple go评论76阅读模式
英文:

Specifying ocamllex encoding

问题

我目前正在根据规范开发一个解析器,但完全找不到关于文本编码的信息在文档中的任何地方。对我来说,文档中一个词法分析库不提及文本编码听起来很奇怪,所以我希望我没有漏掉它的部分。

  • ocamllex是否强制使用文本编码?
  • 如果不是,如何设置要使用的文本编码?
  • ocamllex的正则表达式是否在内部使用Unicode代码点?还是特定的编码?
英文:

I'm currently developing a parser according to a specification, and I'm completely unable to find anywhere in the docs information about text encoding. It sounds weird to me that the docs of a lexing library wouldn't mention text encoding at all, so I hope I just didn't miss parts of it.

  • Does ocamllex force a text encoding?
  • If no, how to set which one to use?
  • Do ocamllex regexes work with Unicode codepoints internally? Or a specific encoding?

答案1

得分: 2

Ocamllex 在字节级别工作,将编码的问题留给其用户。

更具体地说,"字节级别" 意味着 ocamllex 将其输入视为一系列 8 位字节的序列。然后,ocamllex 正则表达式引擎分析这个 8 位字节序列。

Unicode 编码可以被视为在这个原始的 8 位字节序列之上的一层解释层。但是 ocamllex 词法分析器不知道这个更高层次的解释层,它只看原始的 8 位字节序列(这并不令人惊讶,因为 ocamllex 和第一个版本的 Unicode 在上世纪九十年代初期同时开发)。
特别是,在词法分析器中的图形字符是使用它们的 ASCII 编码来解释的,因此字符类
let digit = ['0'-'9']
表示的是在 [0x30, 0x39] 范围内的一个字节。

如果你需要一个了解 Unicode 字符类和编码的词法分析器,可以查看 sedlex

英文:

Ocamllex works at the byte level and leaves all question of encoding to its users.

More precisely, working at the "byte level" means that ocamllex considers that its input is a sequence of 8-bit words. The ocamllex regex engine then analyzes this sequence of 8-bit words.

Unicode encodings can be seen as a layer of interpretation on the top of this raw sequence of 8 bits words. But the ocamllex lexer is unaware of this higher layer of interpretation and just looks at the raw sequence of 8-bit words (which is not that surprising since ocamllex and the first version of unicode were developed around the same time in the beginning of the nineties).
In particular, the graphical character in the lexer are interpreted using their ASCII encoding and thus the character class

let digit = ['0'-'9']

means one byte in the interval [0x30, 0x39].

If you want a lexer that is aware of unicode character classes and encoding, you can look at sedlex .

huangapple
  • 本文由 发表于 2023年6月29日 17:39:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76579864.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定