Tokenization与形式主义、词汇语法和正则语言有何关联?

huangapple go评论76阅读模式
英文:

How does tokenization relates to formalism, lexical grammar, and regular language?

问题

在Bob Nystrom的《Crafting Compiler》一书中,第5章提到:

> 在上一章中,我们用于定义词法语法的形式化——将字符组合成标记的规则——被称为正则语言。这对于我们的扫描器来说是可以的,它发出了一系列平面标记。但正则语言不足以处理可以任意嵌套的表达式。

问题在于,在他所提到的上一章中,实现了一个分词器,其中其核心部分使用了Java中的switch语句。

您可以在这里找到他所提到的实现。

我看不出这样的实现与以下内容之间的关系:

> 我们用于定义词法语法的形式化——将字符组合成标记的规则——被称为正则语言

我的意思是,我没有感觉到已经定义了形式化?我也没有感觉到自己定义了词法语法。而且显然,我在书中实现的东西也以某种方式构成了正则语言?

我想说的核心问题是,我看不出基于switch语句的实现与形式化、词法语法、正则语言等概念之间的联系。希望能进一步阐明这些概念以及它们与书中的代码之间的关系。

英文:

I am reading Bob Nystrom Crafting Compiler's and in the chapter 5 it says this

> In the last chapter, the formalism we used for defining the lexical grammar—
the rules for how characters get grouped into tokens—was called a regular
language. That was fine for our scanner, which emits a flat sequence of tokens.
But regular languages aren’t powerful enough to handle expressions which can
nest arbitrarily deeply.

The thing is, in the last chapter he refers to, a tokenizer was implemented, where the core of it was using switch cases in Java.

You can find the implementation he refers to here

I am failing to see how such an implementation relates to

> the formalism we used for defining the lexical grammar—
the rules for how characters get grouped into tokens—was called a regular
language

I mean, I did not have a feeling a formalism was defined? I also did not have the feeling I defined a lexical grammar. And apparently what I implemented alongside the book also somehow constitutes a regular language?

I guess my core issue is, I am failing to see the connection between the switch statement driven implementation I replicated, with concepts like formalism, lexical grammar, regular language etc.

I would appreciate further elucidation of these concepts and how they relate to the code in the book.

答案1

得分: 3

你已经实现的是一个有效的最大匹配标记化算法,手动实现。

与正则语言的关联在于,标记通常可以定义为属于正则语言的一部分,而正则语言可以使用正则表达式来定义。因此,许多词法分析器生成器将正则表达式作为输入。

正则语言具有可取的属性,即成员测试,“字符串 x 是否属于正则语言 L?”是可决定的。这可以通过构造来证明;对于任何(普通)正则表达式,我们都可以创建一个确定性有限自动机(DFA),当在输入上模拟时,可以终止并告诉我们字符串是否属于语言(如果我们处于“接受”状态)。我强烈推荐Sipser的“计算理论导论”作为正则语言及其属性的入门先决条件。

因此,如果我们知道可以构建某种状态机(模拟确定性有限自动机的想法)来告诉我们字符串是否属于给定的正则语言,我们可以开始在代码中实现这样的状态机。你提到了使用switch语句,这些可以用于实现DFA的状态之间的转移(“delta”)函数(其中每个case是当前输入符号 - 通常是一个character或Unicode代码点,如果你的输入是utf-8)。

例如,假设我们将整数文字的语言定义为正则表达式:[0-9]+。然后,我们可以开始创建这个正则表达式的DFA。值得注意的是,R+实际上是RR*的糖,它模拟了“至少有一个R”的想法。在不描述用于构造相关DFA的算法的麻烦的情况下,我可以说下面的状态机足够了:

Tokenization与形式主义、词汇语法和正则语言有何关联?

节点是状态,它们之间带有标签的边表示有效的转换。双圈节点表示接受状态。换句话说,如果我们用尽了输入并且我们最后所处的状态是1,我们可以接受。

如果我们希望在代码中明确实现这个DFA,我们可以以多种方式实现它。我们可以使用整数显式表示当前状态(其中有效状态为01,如上所示)。然后,为了表示转换(图表中的带有标签的边),我们可以使用硬编码的表格、switch语句(每个'0''1''2'等都有一个case),或者我们可以使用范围检查(例如,来自C标准库的isdigit谓词)。这个转换是直接的,所以我不会在这里做。

在《Crafting Interpreters》中所做的方式更典型于手写词法分析器,但本质上模拟了相同的想法。如果我们查看代码,我们可以找到以下谓词测试(捕获了[0-9]的概念):

// ...
if (isDigit(c)) {
   number();
// ...

如果我们想象一下,代码所在的顶级switch是初始状态的地方,那么对number的调用将是进入期望看到更多相同内容的状态([0-9]*)的转换。

如果我们看一下number(),我们会发现它大致模拟了图表中的第二个状态的预期:

private void number() {
  while (isDigit(peek())) advance();

  // ... 省略了处理十进制/小数点的部分 ...

  addToken(NUMBER,
      Double.parseDouble(source.substring(start, current)));
}

while循环直接模拟了具有[0-9]转换的想法,保持在相同的状态。《Crafting Interpreters》中的number()实际上允许解析双精度数的分数部分,所以我在上面省略了它。然而,正则表达式可以模拟相同的情况。

我建议尝试使用正则表达式,构造DFA(可以参考《编译器:原理、技术和工具》中的算法或关于Brzozowski导数的教程),然后在代码中实现它们(手动)。

当然,标记化算法的算法比仅仅为一个单一的正则表达式实现DFA要复杂得多。编程语言通常由多种类型的标记组成,因此DFA需要模拟许多不同正则表达式的交替。这并不难,只需要将DFA构建算法调整为同时处理多个正则表达式。在实践中,当然,我们必须担心切换词法分析器模式(实际上有多个词法分析器),以处理注释(这可能会因为某些语言允许嵌套注释而变得更加复杂,因此看到/*然后等待下一个*/并不足以识别注释 - 这等同于平衡括号问题,不是正则语言!),字符串(带有字符转义序列),等等。

关于将DFA视为标记化的理想形式的下一个主要问题是DFA的输入范围通常是已知的。在理论上,我们知道模拟DFA需要的输入的_范围_(其长度)。在编程语言的标记化中,我们不知道,出于实际原因

英文:

What you've implemented is effectively a maximal munch tokenisation algorithm, manually.

The connection to regular languages is that tokens can often be defined as belonging to regular languages, which can be defined using regular expressions. As a result, many lexer generators take regular expressions as input.

Regular languages have the desirable property that membership testing, "is the string x a member of regular language L?", is decidable. This can be proved by construction; we can, for any (vanilla) regular expression, create a deterministic finite automaton (DFA) that, when simulated on input, can terminate and tell us whether the string is a member of the language (if we are left in an "accepting" state). I thoroughly recommend Sipser's "Introduction to The Theory of Computation" for an introductory prerequisite for regular languages and their properties.

So, if we know that some kind of state machine (simulating the idea of a deterministic finite automaton) can be constructed to tell us whether a string belongs to a given regular language, we can go about implementing such a state machine in code. You mentioned the usage of switch statements, these can be used to implement the transition ("delta") function, between states, of a DFA (where each case is the current input symbol - typically a single character or unicode codepoint, if your input is, say, utf-8).

For example, suppose we defined the language of integer literals as being the regular expression: [0-9]+. We could then go about creating a DFA for this regular expression. It's worth noting that R+ is really sugar for RR*, which models the idea of "at least one R". Without going through the hassle of describing the algorithms one can use for constructing the related DFA, I can say that the state machine below is sufficient:

Tokenization与形式主义、词汇语法和正则语言有何关联?

The nodes are states and the labelled edges between them represent valid transitions. The double circled node represents an accepting state. In other words, if we exhaust the input and the final state we were in is 1, we can accept.

If we wished to implement this DFA explicitly in code, we could do it in many ways. We could explicitly represent the current state using an integer (where the valid states are 0 and 1, as above). Then, to represent the transitions (labelled edges in the diagram above), we could use a hardcoded table, a switch statement (with a case for each in '0', '1', '2', ..., '9'), or we could use a range check (such as a predicate like isdigit from the C standard library). This transformation is straightforward, so I won't do it here.

The way it's done in Crafting Interpreters is more typical of handwritten lexers, but effectively models the same kind of idea. If we look at the code, we can find the following predicate test (which captures the idea of [0-9] as a regular expression):

// ...
if (isDigit(c)) {
   number();
// ...

If we imagine that the top level switch from which is code is taken is where the initial state is, then the call to number would be a transition into the state expecting to see more of the same ([0-9]*).

If we look at number(), we find that it roughly models what we'd expect of the second state from the diagram:

private void number() {
  while (isDigit(peek())) advance();

  // ... omitted part that handles decimal/fractional point ...

  addToken(NUMBER,
      Double.parseDouble(source.substring(start, current)));
}

The while loop directly models the idea of having a [0-9] transition that stays in the same state. The actual number() in Crafting Interpreters allows for lexing fractional components of doubles, so I've omitted it above. However, a regular expression could model the same.

I recommend playing around with regular expressions, constructing DFAs (can refer to algorithms in the dragon book - "Compilers: Principles, Techniques, and Tools" or tutorials about Brzozowski derivatives), and then implementing them in code (by hand).

Of course, the algorithm of tokenisation is a bit more involved than simply implementing a DFA for a singular regular expression. Programming languages typically consist of more than one type of token, and so the DFA would need to model the alternation of many different regular expressions. This is not very difficult, one need only adapt the DFA construction algorithm to work over many regular expressions at the same time. In practice, of course, we have to worry about switching lexer modes (effectively having many lexers), to deal with comments (which can be further complicated by the fact some languages permit comments to be nested, and so seeing /* and then waiting for the next */ is not sufficient for recognising a comment - this is equivalent to the balanced parentheses problem, which is not a regular language!), strings (with their character escape sequences), etc.

The next major problem with the idea of DFAs as being the ideal formalisation of tokenisation is that the extent of input for DFAs is typically known. In the theory, we know the extent of the input for simulating a DFA (its length). In the tokenisation of programming languages, we don't know and, for practical reasons, want to get the largest possible lexeme each time (this is called "maximal munch tokenisation"). Therefore, the typical approach is to keep simulating the DFA until it fails to yield another state transition (there is no valid transition). In which case, we hope that we've just seen a valid token, reset the DFA to its initial state, and continue going. This is a simplification of how maximal munch tokenisation algorithms operate, but it's not far off. I recommend Reps' Maximal Munch Tokenisation in Linear Time paper for pseudocode and related discussion.

What I hope to have explained is the connection between regular languages, their concrete implementation (simulating a DFA constructed from something describing the regular language - regular expressions), and the kind of skeletal algorithm we place such a transition function into in order to tokenise programming languages (the tokenisation algorithm). The fact is that it's not the ideal model, but it's approximately suitable. Many people use lexer generators with some handwritten, side-effectual, code to deal with the aforementioned problems with lexing real programming languages. Others, such as the author of Crafting Interpreters, prefer to implement it all in code. Either way, the logic of implementing a state machine for a bunch of regular lexemes, with some other bells and whistles for maximal munch and error handling, shines through.

huangapple
  • 本文由 发表于 2023年6月16日 02:51:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76484693.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定