匹配正则表达式元字符字面上

huangapple go评论65阅读模式
英文:

Matching a regex metacharacter literally

问题

My apologies, but I can't provide translations for code blocks or technical content like regular expressions as requested. If you have any other non-technical text or general questions, feel free to ask, and I'd be happy to assist you.

英文:

My understanding of Regex's in AWK is that in order to match a Regex metacharacter literally (For example: +,$,^,*,etc) You must escape them, like so:

awk -F '\\+' 'program here'

However I've noticed that you don't actually need to do this with certain metacharacters, such as the "+"

Input file:

this|is|a|line
this+is+a+line
this?is?a?line
this*is*a*line
this$is$a$line
this.is.a.line

AWK program:


#!/usr/bin/awk -f
BEGIN { FS = "+|^"}

{print $1,$2,$3,$4 }

Expected output (Due to not escaping the +):

this|is|a|line
this+is+a+line
this?is?a?line
this*is*a*line
this$is$a$line
this.is.a.line

Actual output:


this|| is|| a|| line
his|is|a|line
this is a line
his?is?a?line
his*is*a*line
his$is$a$line
his.is.a.line

I don't understand how this is working. I'm giving AWK blatantly bad code by not escaping the metacharacter (to make it literal) however AWK is matching successfully anyway?

I own a copy of "The AWK programming language" so I went through the section on Regex just to make sure I'm not going mad, and it states the following:

> In a matching expression, a quoted string like "^[0-9]+$" can normally be used interchangeably with a regular expression enclosed in slashes, such as /^[0-9]+$/. There is one exception, however. If the string in quotes is to match a literal occurrence of a regular expression metacharacter, one extra backslash is needed to protect the protecting backslash itself. That is,
>
> $0 ~ /(\+|-)[0-9]+/
>
> and
>
> $0 ~ "(\\+|-)[0-9]+"
>
> are equivalent.
>
> This behavior may seem arcane, but it arises because one level of protecting backslashes is removed when a quoted string is parsed by awk. If a backslash is needed in front of a metacharacter to turn off its special meaning in a regular expression, then that backslash needs a preceding backslash to protect it in a string.

Can someone explain what I'm missing here?

答案1

得分: 1

"+"出现在模式的开头:它不能修改它前面的任何内容(即,允许在它前面存在一个或多个不存在的字符),因此awk将其解释为字面上的+字符,而不是修饰符。

来自gawk手册,关于正则表达式操作符详细信息

在POSIX awk和gawk中,当正则表达式之前没有任何内容时,‘*’、‘+’和‘?’操作符代表它们自己。例如,/+/ 匹配一个字面的加号。然而,许多其他版本的awk将这样的用法视为语法错误。

英文:

The + is at the start of the pattern: it can't modify anything before that (i.e., allowing 1 or more of the non-existing character in front of it), thus awk interprets it as a literal + character, not a modifier.

From the gawk manual, on regex operator details

> In POSIX awk and gawk, the ‘*’, ‘+’, and ‘?’ operators stand for themselves when there is nothing in the regexp that precedes them. For example, /+/ matches a literal plus sign. However, many other versions of awk treat such a usage as a syntax error.

huangapple
  • 本文由 发表于 2023年5月21日 06:22:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76297565.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定