我正在寻找一个在Python中识别C文件中所有函数体的正则表达式。

huangapple go评论78阅读模式
英文:

I am looking for a regular expression in python which will identify all the function bodies in a C file

问题

我正在寻找一个Python中的正则表达式,它将识别所有的C函数。

我想要自动在每个函数的开头插入一些注释,例如一个函数看起来像这样:

static my_struct1* alloc_mem (my_struct2* a)
{

... 

}

我想要插入注释,使其看起来像这样:

static my_struct1* alloc_mem (my_struct2* a)
{
/* my comment */
... 

}

所以我想要识别所有函数体的开头(以{结尾),然后在那里插入注释。

我尝试了下面的代码:

def insert_comment():
    comment = "/* my comment */"
    pattern = r'[^(if|else|switch|for|if\s+|else\s+|switch\s+|for\s+)]\(.*\)(\s|\n)*\{'
    matches = list(re.finditer(pattern1, content))
    for match in matches:
        print('*****')
        print(match.group())

with open(filename, "r") as i:
    content = i.read()
    insert_comment()

但这也匹配了具有嵌套(表达式的ifelse语句。

例如,

if(MACRO(expa) && MACRO(expb)) {

然后匹配的模式将是

O(expb)) {

有什么更好的正则表达式可以用来获取函数体的开头吗?

英文:

I am looking for a RE in python which will identify all the C function

I want to automate the insertion of some comments at the start of each function,
For example a function looks like

static my_struct1* alloc_mem (my_struct2* a)
{

...

}

I want to insert comment and make it look like

static my_struct1* alloc_mem (my_struct2* a)
{
/* my comment */
...

}

So I want to identify all the head (ending with {) of function bodies and insert comment there.

I tried below code:

def insert_comment():
    comment = "/* my comment */"
    pattern = r'[^(if|else|switch|for|if\s+|else\s+|switch\s+|for\s+)]\(.*\)(\s|\n)*\{'
    matches = list(re.finditer(pattern1, content))
    for match in matches:
        print('*****')
        print(match.group())

with open(filename, "r") as i:
        content = i.read()
        insert_comment()

But this is also matching if and else statements which have nested ( expression.

For example,

if(MACRO(expa) && MACRO(expb)) {

Then the matched pattern will be

O(expb)) {

What might be a better RE to get start of function body?

答案1

得分: 0

根据其他人的评论,C语法过于复杂,如果没有一个专门的解析器,就很难进行严格分析。

然而,就你的需求而言,似乎一些较松散的条件下的识别问题可能是可以接受的(无害的),因为它不会导致对原始源代码的缺陷,只是插入注释。

以下是一个简单的代码示例,用于在有限条件下在所需的位置插入注释:

#!/usr/bin/python

import regex

with open(filename) as f:
    s = f.read()
    m = regex.sub(r'\b(?:if|switch|for|while|until)\b(*SKIP)(*FAIL)|(([A-Za-z_]\w*\s*\((?:[^()]+|(?2))*\))\s*{)', r'\n/* my comment */', s)
    print(m)

[解释]

  • regex 是一个带有额外正则表达式功能的 PyPi 模块。
  • (?:if|switch|for|while|until) 是类似于C函数的保留名称(应该被排除在外)。
  • \b(?:if|switch|for|while|until)\b(*SKIP)(*FAIL) 丢弃了这些匹配项。
  • (([A-Za-z_]\w*\s*\((?:[^()]+|(?2))*\))\s*{) 是一个递归正则表达式,用于匹配带有平衡括号的C函数名称。

[编辑]
如果你确定宏名称都是大写字母,你可以通过调整正则表达式来排除它们:

#!/usr/bin/python

import regex

with open(filename) as f:
    s = f.read()
    m = regex.sub(r'\b(?:if|switch|for|while|until|[A-Z_]+)\b(*SKIP)(*FAIL)|(([A-Za-z_]\w*\s*\((?:[^()]+|(?2))*\))\s*{)', r'\n/* my comment */', s)
    print(m)

输入示例:

static my_struct1* alloc_mem (my_struct2* a)
{ 
// function
}

some_function(foo) {
// function
}

if(MACRO(expa) && MACRO(expb)) {
// reserved word
}

TAILQ_FOREACH(entry, hent, next) {
// macro
}

while(true) {
// reserved word
}

输出:

static my_struct1* alloc_mem (my_struct2* a)
{
/* my comment */
// function
}

some_function(foo) {
/* my comment */
// function
}

if(MACRO(expa) && MACRO(expb)) {
// reserved word
}

TAILQ_FOREACH(entry, hent, next) {
// macro
}

while(true) {
// reserved word
}

[解释]
正则表达式 [A-Z_]+ 匹配所有大写字母,将其添加到排除列表中:

(?:if|switch|for|while|until|[A-Z_]+)
英文:

As commented by others, C syntax is too complex to analyze without
a dedicated parser in a strict sense.
As for your requirements, however, it looks some under/over detection
may be acceptable (harmless), because it will not cause defects to
the original source code just to insert comments.

Here is a simple code to start with to insert comments at your desired points
under limited conditions:

#!/usr/bin/python

import regex

with open(filename) as f:
    s = f.read()
    m = regex.sub(r'\b(?:if|switch|for|while|until)\b(*SKIP)(*FAIL)|(([A-Za-z_]\w*\s*\((?:[^()]+|(?2))*\))\s*{)', r'\n/* my comment */', s)
    print(m)

[Explanations]

  • regex is a PyPi module with additional regex functionalities.
  • (?:if|switch|for|while|until) is the reserved names which has
    a syntax similar to C functions (to be excluded).
  • \b(?:if|switch|for|while|until)\b(*SKIP)(*FAIL) discards these matches.
  • (([A-Za-z_]\w*\s*\((?:[^()]+|(?2))*\))\s*{) is a recursive regex which
    matches C function names with balanced parentheses.

[Edit]
If you are sure the macro names are all uppercase letters, you can exclude them by tweaking the regex as:

#!/usr/bin/python

import regex

with open(filename) as f:
    s = f.read()
    m = regex.sub(r'\b(?:if|switch|for|while|until|[A-Z_]+)\b(*SKIP)(*FAIL)|(([A-Za-z_]\w*\s*\((?:[^()]+|(?2))*\))\s*{)', r'\n/* my comment */', s)
    print(m)

Input example:

static my_struct1* alloc_mem (my_struct2* a)
{ 
// function
}

some_function(foo) {
// function
}

if(MACRO(expa) && MACRO(expb)) {
// reserved word
}

TAILQ_FOREACH(entry, hent, next) {
// macro
}

while(true) {
// reserved word
}

Output:

static my_struct1* alloc_mem (my_struct2* a)
{
/* my comment */
// function
}

some_function(foo) {
/* my comment */
// function
}

if(MACRO(expa) && MACRO(expb)) {
// reserved word
}

TAILQ_FOREACH(entry, hent, next) {
// macro
}

while(true) {
// reserved word
}

[Explanation]

The regex [A-Z_]+, which matches the all uppercase letters, is appended to
the list of exclusion as:

(?:if|switch|for|while|until|[A-Z_]+)

huangapple
  • 本文由 发表于 2023年7月7日 03:32:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76632037.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定