为什么在C中字符串被视为标记,而数组不被视为标记?

huangapple go评论65阅读模式
英文:

Why are strings considered tokens in C while arrays aren't?

问题

为什么字符串被视为一个标记,而数组不被视为一个标记?

英文:

This is a quite basic theoritical question. I started learning the C language. I came across the topic Tokens in C.

Quoting from geeksforgeeks.org,

> A token is the smallest element of a program that is meaningful to the compiler.Tokens can be classified as follows:
>
> 1. Keywords
> 2. Identifiers
> 3. Constants
> 4. Strings
> 5. Special Symbols
> 6. Operators

Why strings are considered as a token while arrays aren't?

答案1

得分: 4

Geeksforgeeks is almost as bad a source for learning as ChatGPT.

It is true that strings in C consists of null-terminated character arrays. But what it means to say is string literals, "these things". That is, a constant used to initialize character arrays or to use as a read-only string.

Similarly, "constants" does not refer to things like const int x=1; but rather just the number part 1 - this is what formal C means when it refers to an integer constant (sometimes also called "integer literal" although that term is strictly speaking not correct).

Note that tokens is mostly a concept that matters when writing macros, it's not a concept that beginners usually have to worry about. The formal grammar (C17 6.4), "lexical elements", groups everything in C in these groups/sub-chapters:

  • Keywords
  • Identifiers
  • Universal character names
  • Constants
  • String literals
  • Punctuators
  • Header names
  • Preprocessing numbers
  • Comments
英文:

Geeksforgeeks is almost as bad a source for learning as ChatGPT.

It is true that strings in C consists of null-terminated character arrays. But what it means to say is string literals, "these things". That is, a constant used to initialize character arrays or to use as a read-only string.

Similarly, "constants" does not refer to things like const int x=1; but rather just the number part 1 - this is what formal C means when it refers to an integer constant (sometimes also called "integer literal" although that term is strictly speaking not correct).

Note that tokens is mostly a concept that matters when writing macros, it's not a concept that beginners usually have to worry about. The formal grammar (C17 6.4), "lexical elements", groups everything in C in these groups/sub-chapters:

  • Keywords
  • Identifiers
  • Universal character names
  • Constants
  • String literals
  • Punctuators
  • Header names
  • Preprocessing numbers
  • Comments

答案2

得分: 4

一个标记是一个不可分割的解析单元。

  • ; 是一个标记。
  • + 是一个标记。
  • == 是一个标记。
  • 十进制数字文字 4 是一个标记。
  • 十进制数字文字 12 是一个标记。
  • 字符串文字 "abc" 是一个标记。
  • 标识符 foo 是一个标记。
  • 标识符 int 是一个标记。

然而,

  • 字符串不是标记,因为字符串不是代码片段。 (但请参阅上面的字符串文字)。
  • 数组不是标记,因为数组不是代码片段。
  • 数组声明 (例如 int a[4];) 不是标记,因为它们由多个其他标记组成。
  • 数组初始化程序 (例如 { 4, 5, i+2 }) 不是标记,因为它们由多个其他标记组成。

通常可以在标记之间放置空格,但不能在标记内部放置空格。

  • 12 不同于 1 2
  • "abc" 不同于 "a b c"
  • foo 不同于 f o o
  • i+2i + 2 相同。
  • {4,5,i+2}{ 4, 5, i + 2 } 相同。
英文:

A token is an indivisible parsing unit.

  • ; is a token.
  • + is a token.
  • == is a token.
  • Decimal numeric literal 4 is a token.
  • Decimal numeric literal 12 is a token.
  • String literal "abc" is a token.
  • Identifier foo is a token.
  • Identifier int is a token.

However,

  • Strings aren't tokens because strings aren't pieces of code. (But see string literals above.)
  • Arrays aren't tokens because array aren't pieces of code.
  • Array declarations (e.g. int a[4];) aren't tokens because they are made of multiple other tokens.
  • Array initializers (e.g. { 4, 5, i+2 }) aren't tokens because they are made of multiple other tokens.

You can generally put spaces between tokens, but never within.

  • 12 is not the same as 1 2
  • "abc" is not the same as "a b c".
  • foo is not the same as f o o.
  • i+2 is the same as i + 2.
  • {4,5,i+2} is the same as { 4, 5, i + 2 }.

答案3

得分: 3

当编译器处理源代码时,它首先将其分割成标记。示例:

printf("%d", 4 << 2);

这会被转化为以下标记:

  • printf
  • (
  • "%d" —— 一个字符串字面值
  • ,
  • 4
  • <<
  • 2
  • )
  • ;

int a[] = {1, 2, 3};这样的数组声明由多个标记组成,因此它本身不是一个标记。这里的a是一个标记,但它更一般地是一个标识符("变量名")。

关于printf()的侧记:该函数本身也会将其作为第一个参数接收的字符串进行一种标记化处理。唯一的区别是字符是否是%占位符,因此它要简单得多。但原理仍然保持不变。

英文:

When the compiler processes source code, it first splits them into tokens. Example:

printf("%d", 4 << 2);

This is turned into the following tokens:

  • printf
  • (
  • "%d" -- a string literal
  • ,
  • 4
  • <<
  • 2
  • )
  • ;

An array declaration like int a[] = {1, 2, 3}; consists of multiple tokens, therefore it's not a token itself. The a here is a token though, but it's not specifically an array-token but more generally an identifier ("variable name").

Side note on printf(): That function itself will also kind-of tokenize the string it receives as first argument. The only distinction is whether a character is a % placeholder or not, so it's a much simpler. The principle stays the same though.

答案4

得分: 0

最好直接查看源文件。C语言的语法在C标准(C17)的附录A中被半正式地描述。第一段(A.1.1)指出:

  token:
      keyword(关键字)
      identifier(标识符)
      constant(常数)
      string-literal(字符串文字)
      punctuator(标点符号)

注意,“string-literal”(字符串文字)被特别提及为一个标记。

至于为什么:C语言是从自然语言的相同层次构建而成的:字母、单词和句子。当编译器读取程序文件时,它会读取文件的字符,并将它们分组为标记,就像我们在阅读书籍时将字母分组为单词一样。然后,它以与我们解释文本为一系列单词一样的方式解释程序为一系列标记。

将字符串文字作为标记只是C语言的设计者在描述语言的语法和语义时做出的合理决策。

英文:

It is often better to go directly to the source. The C language syntax is described semi-formally in the C Standard (C17), Annex A. The first paragraph (A.1.1) states:

  token:
      keyword
      identifier
      constant
      string-literal
      punctuator

Notice that "string-literal" is specially mentioned as a token.

As to why: the C language is built up from the same layers as a natural language: letters, words and sentences. When the compiler reads a program file, it reads the characters of the file and groups them into tokens the same way we group letters into words when reading a book. It then interprets the program as a sequence of tokens the same way as we interpret a text as a sequence of words.

A string-literal being a token is simply a decision taken by the designers of the C language because it makes sense when describing the syntax and semantics of the language.

huangapple
  • 本文由 发表于 2023年3月1日 15:13:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75600544.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定