Looking for a way to split strings in pandas dataframe depending on OR, AND and parentheses.

huangapple go评论57阅读模式
英文:

Looking for a way to split strings in pandas dataframe depending on OR, AND and parentheses

问题

Here's a translation of your request:

所以我有一个用pandas创建的数据框,其中包含一个包含组件的列和一个包含约束条件的行。这些约束条件决定了组件应该被过滤到哪个类别。现在这些约束条件并不是非常直观的,所以我正在寻找一种将它们拆分成多个更易读的约束条件的方法。例如,如果一个约束条件是'A and (B or C)',我希望将其拆分成两行'A and B'和'A and C'。然而,并非所有约束条件都像这个示例那么简单。

这是数据框的一个小示例:

组件 约束条件
123 A and (B or C)
456 ((MIRROR='ELECTRIC' and MIRRORCAMERA!='NO') or (MIRROR='MANUAL' and (MIRROR_RIGHT!='NO' or MIRROR_LEFT!='NO'))) and STEERWHEEL_LOCK='NO'
789 LENGTH='122' or (LENGTH='135' and BATTERY='551') or LENGTH='149' or (LENGTH='181' and (BATTERY='674' or (BATTERY='551' and CHARGER!='NO')))

或者

import pandas as pd
dataex = {'Component': [123, 
                        456, 
                        789], 
          'Constraint': ["A and (B or C)", 
                         "((MIRROR='ELECTRIC' and MIRRORCAMERA!='NO') or (MIRROR='MANUAL' and (MIRROR_RIGHT!='NO' or MIRROR_LEFT!='NO'))) and STEERWHEEL_LOCK='NO'",
                         "LENGTH='122' or (LENGTH='135' and BATTERY='551') or LENGTH='149' or (LENGTH='181' and (BATTERY='674' or (BATTERY='551' and CHARGER!='NO')))"]}
df_example = pd.DataFrame(data=dataex)

正如我所说,我希望根据约束条件中的and和or以及括号的位置,将所有这些拆分成多行(如果需要的话)。所以我希望得到以下结果:

组件 约束条件
123 A and B
123 A and C
456 STEERWHEEL_LOCK='NO' and MIRROR='ELECTRIC' and MIRRORCAMERA!='NO'
456 STEERWHEEL_LOCK='NO' and MIRROR='MANUAL' and MIRROR_RIGHT!='NO'
456 STEERWHEEL_LOCK='NO' and MIRROR='MANUAL' and MIRROR_LEFT!='NO'
789 LENGTH='122'
789 LENGTH='135' and BATTERY='551'
789 LENGTH='149'
789 LENGTH='181' and BATTERY='674'
789 LENGTH='181' and BATTERY='551' and CHARGER!='NO'

或者

import pandas as pd
datares = {'Component':[123, 123, 456, 456, 456, 789, 789, 789, 789, 789],
           'Constraint':["A and B",
                         "A and C",
                         "STEERWHEEL_LOCK='NO' and MIRROR='ELECTRIC' and MIRRORCAMERA!='NO'",
                         "STEERWHEEL_LOCK='NO' and MIRROR='MANUAL' and MIRROR_RIGHT!='NO'",
                         "STEERWHEEL_LOCK='NO' and MIRROR='MANUAL' and MIRROR_LEFT!='NO'",
                         "LENGTH='122'",
                         "LENGTH='135' and BATTERY='551'",
                         "LENGTH='149'",
                         "LENGTH='181' and BATTERY='674'",
                         "LENGTH='181' and BATTERY='551' and CHARGER!='NO'"
                        ]}
df_result = pd.DataFrame(data=datares)

我已经尝试过在'or'上拆分约束条件并将其分割成数组,然后循环遍历它们以获得结果,但对于一些更复杂的约束条件,你会得到数组内嵌套的数组,然后一段时间后会变得非常混乱。我还尝试过创建一种逻辑树,但尚未在Python中使其正常工作。我希望你们中的一些人可能有一个好的想法或模块来帮助我解决这个问题。谢谢!

英文:

So I have a dataframe in pandas that consists of a column with Components, and a row with Constraints. These constraints decide in what category the components have to be filtered. Now these constraints are not very straight-forward, so I'm looking for a way to split them into multiple smaller, more readable constraints. So for example if a constraint is 'A and (B or C)', I want to split up into two rows 'A and B' and 'A and C'. Not all constraints are as easy as this example though.

Here's what a small selection of the dataframe might look like:

Componenent Constraint
123 A and (B or C)
456 ((MIRROR='ELECTRIC' and MIRRORCAMERA!='NO') or (MIRROR='MANUAL' and (MIRROR_RIGHT!='NO' or MIRROR_LEFT!='NO'))) and STEERWHEEL_LOCK='NO'
789 LENGTH='122' or (LENGTH='135' and BATTERY='551') or LENGTH='149' or (LENGTH='181' and (BATTERY='674' or (BATTERY='551' and CHARGER!='NO')))

or

import pandas as pd
dataex = {'Component': [123, 
                        456, 
                        789], 
          'Constraint': ["A and (B or C)", 
                         "((MIRROR='ELECTRIC' and MIRRORCAMERA!='NO') or (MIRROR='MANUAL' and (MIRROR_RIGHT!='NO' or MIRROR_LEFT!='NO'))) and STEERWHEEL_LOCK='NO'", 
                         "LENGTH='122' or (LENGTH='135' and BATTERY='551') or LENGTH='149' or (LENGTH='181' and (BATTERY='674' or (BATTERY='551' and CHARGER!='NO')))"]}
df_example = pd.DataFrame(data=dataex)

Like I said, I'm hoping to split all these into multiple rows (if needed), depending on the and's and or's and parenthesis in the constraint. So I have the following result in mind:

Component Constraint
123 A and B
123 A and C
456 STEERWHEEL_LOCK='NO' and MIRROR='ELECTRIC' and MIRRORCAMERA!='NO'
456 STEERWHEEL_LOCK='NO' and MIRROR='MANUAL' and MIRROR_RIGHT!='NO'
456 STEERWHEEL_LOCK='NO' and MIRROR='MANUAL' and MIRROR_LEFT!='NO'
789 LENGTH='122'
789 LENGTH='135' and BATTERY='551'
789 LENGTH='149'
789 LENGTH='181' and BATTERY='674'
789 LENGTH='181' and BATTERY='551' and CHARGER!='NO'

or

import pandas as pd
datares = {'Component':[123, 123, 456, 456, 456, 789, 789, 789, 789, 789],
           'Constraint':["A and B",
                         "A and C",
                         "STEERWHEEL_LOCK='NO' and MIRROR='ELECTRIC' and MIRRORCAMERA!='NO'",
                         "STEERWHEEL_LOCK='NO' and MIRROR='MANUAL' and MIRROR_RIGHT!='NO'",
                         "STEERWHEEL_LOCK='NO' and MIRROR='MANUAL' and MIRROR_LEFT!='NO'",
                         "LENGTH='122'",
                         "LENGTH='135' and BATTERY='551'",
                         "LENGTH='149'",
                         "LENGTH='181' and BATTERY='674'",
                         "LENGTH='181' and BATTERY='551' and CHARGER!='NO'"
                        ]}
df_result = pd.DataFrame(data=datares)

I've tried splitting the constraints on 'or' and dividing them into arrays and then looping over them to get the result, but with some of the more difficult constraints, you get arrays inside arrays inside arrays and then it gets very messy after a while.
I've also tried making a sort of logic tree, but I haven't gotten that to work in Python yet.

I'm hoping some of you might have a good idea or module to help my with my problem.
Thanks!

答案1

得分: 0

根据您的描述,我会为您提供代码部分的翻译。以下是翻译好的代码部分:

从您的描述中我认为您需要将表达式转换为析取范式”(DNF),它看起来像这样Or(And(v1,v2), And(v1,v4), ...)下面的代码将使用额外的包来执行此操作DNF在电子设计自动化中很常见)。要安装该包请执行 `pip3 install pyeda`。

将表达式拆分为相应的And表达式的代码如下


* 如果过滤器中匹配正则表达式的部分包含除数字/字母以外的内容例如问号等),则可能需要进行调整
* 您没有提供如何否定独立变量的示例我使用了not(A)”。
* 如果您没有独立变量只有比较),则代码会简单得多
* 如果您有更多的运算符例如小于等于),则代码会稍微复杂一些

总体思路
* 将您的表达式转换为形式为x & y| z 的布尔表达式这是通过将所有比较变为变量来实现的
* 将布尔表达式转换为DNF
* 将变量替换回原始比较

```python
from pyeda.inter import *
import re
import pandas as pd
dataex = {'Component': [123, 
                        456, 
                        789], 
          'Constraint': ["A and (B or C)", 
                         "((MIRROR='ELECTRIC' and MIRRORCAMERA!='NO') or (MIRROR='MANUAL' and (MIRROR_RIGHT!='NO' or MIRROR_LEFT!='NO'))) and STEERWHEEL_LOCK='NO'", 
                         "LENGTH='122' or (LENGTH='135' and BATTERY='551') or LENGTH='149' or (LENGTH='181' and (BATTERY='674' or (BATTERY='551' and CHARGER!='NO')))"]}
df_example = pd.DataFrame(data=dataex)

def transform_expr(input_expression):

    def move_not_before(x):
        if '!=' in x.group(1):
            return '~'+x.group(1).replace("!=","=")
        else:
            return x.group(1)

    expr1 = (re.sub("([a-zA-Z0-9'_]*!=[a-zA-Z0-9'_']*)", move_not_before,input_expression))

    variables = {}
    values = {}
    current_key = 0
    expr2 = ""
    last_idx = 0
    # 进行变换以达到布尔形式
    for idx in re.finditer("([a-zA-Z0-9'_]*=[a-zA-Z0-9'_']*)", expr1):
        expr2 += expr1[last_idx:idx.span(1)[0]]
        if idx[1] in values:
            expr2 += values[idx[1]]
        else:
            expr2 += f'v{current_key}'
            variables[f'v{current_key}']=idx[1]
            values[idx[1]] = f'v{current_key}'
            current_key+=1
        last_idx = idx.span(1)[1]
    expr2 += expr1[last_idx:]

    expr3 = re.sub("and", "&", expr2)
    expr4 = re.sub("or", "|", expr3)
    expr5 = expr(expr4).to_dnf()

    result = []
    # 我们知道expr5的形式类似于Or(And(...),And(...)...),xs有子节点
    for v in expr5.xs:
        # 我们移除And(...)
        if "," in str(v):
            arr = str(v)[4:-1].replace(" ","").split(",")
        else:
            # 对于只有一个变量的情况
            arr = [str(v)]
        r = []
        for x in arr:
            if x[0]=="~":
                variable_name = x[1:]
            else:
                variable_name = x

            if variable_name not in variables:
                # 如何否定独立变量?
                if x[0]=="~":
                    variable_value = f"not({variable_name})"
                else:
                    variable_value = variable_name
            else:
                variable_value = variables[variable_name]

            if x[0]=="~":
                r.append(variable_value.replace("=","!="))
            else:
                r.append(variable_value)

        result.append(" and ".join(r))

    return result

df_example['Constraint'] = df_example['Constraint'].map(transform_expr)

df_result = df_example.explode('Constraint')

pd.set_option("max_colwidth", None)
print(df_result)

它将打印出:

   Component                                                         Constraint
0        123                                                            A and B
0        123                                                            A and C
1        456  MIRROR='ELECTRIC' and MIRRORCAMERA!='NO' and STEERWHEEL_LOCK='NO'
1        456    MIRROR='MANUAL' and MIRROR_RIGHT!='NO' and STEERWHEEL_LOCK='NO'
1        456     MIRROR='MANUAL' and MIRROR_LEFT!='NO' and STEERWHEEL_LOCK='NO'
2        789                                                       LENGTH='122'
2        789                                                       LENGTH='149'
2        789                                     LENGTH='135' and BATTERY='551'
2        789                                     LENGTH='181' and BATTERY='674'
2        789                   BATTERY='551' and LENGTH='181' and CHARGER!='NO'

希望这有助于您的工作!如果您需要进一步的帮助,请告诉我。

英文:

From your description I think you need to put the expression in "disjunctive normal form" (DNF), which looks like Or(And(v1,v2), And(v1,v4), ...). The code below will do this using an extra package (DNF is common in electronics design automation). To install the package do pip3 install pyeda.

The code that splits an expression into the corresponding And expressions is below.

Notes:

  • the regular expressions matching the filter might need adjustments if you have more than numbers/letters in the filters (like question mark, etc.)
  • you have no example on how to negate a standalone variable. I used "not(A)"
  • if you would not have standalone variables (so only comparisons), the code would be much simpler
  • if you have more operators (like less than, etc.) the code will slightly more complex

Overall idea:

  • transform your expression into a boolean expression of the form (x & y) | z . This is done by making all comparisons a variable.
  • transform the boolean expression into DNF
  • replace back the variables with the original comparisons.
from pyeda.inter import *
import re
import pandas as pd
dataex = {'Component': [123, 
456, 
789], 
'Constraint': ["A and (B or C)", 
"((MIRROR='ELECTRIC' and MIRRORCAMERA!='NO') or (MIRROR='MANUAL' and (MIRROR_RIGHT!='NO' or MIRROR_LEFT!='NO'))) and STEERWHEEL_LOCK='NO'", 
"LENGTH='122' or (LENGTH='135' and BATTERY='551') or LENGTH='149' or (LENGTH='181' and (BATTERY='674' or (BATTERY='551' and CHARGER!='NO')))"]}
df_example = pd.DataFrame(data=dataex)
def transform_expr(input_expression):
def move_not_before(x):
if '!=' in x.group(1):
return '~'+x.group(1).replace("!=","=")
else:
return x.group(1)
expr1 = (re.sub("([a-zA-Z0-9'_]*!=[a-zA-Z0-9'_]*)", move_not_before,input_expression))
variables = {}
values = {}
current_key = 0
expr2 = ""
last_idx = 0
# Make transformations to reach a boolean form
for idx in re.finditer("([a-zA-Z0-9'_]*=[a-zA-Z0-9'_]*)", expr1):
expr2 += expr1[last_idx:idx.span(1)[0]]
if idx[1] in values:
expr2 += values[idx[1]]
else:
expr2 += f'v{current_key}'
variables[f'v{current_key}']=idx[1]
values[idx[1]] = f'v{current_key}'
current_key+=1
last_idx = idx.span(1)[1]
expr2 += expr1[last_idx:]
expr3 = re.sub("and", "&", expr2)
expr4 = re.sub("or", "|", expr3)
expr5 = expr(expr4).to_dnf()
result = []
# We know expr5 is like Or(And(...),And(...)...), xs has the children
for v in expr5.xs:
# We remove the And(...)
if "," in str(v):
arr = str(v)[4:-1].replace(" ","").split(",")
else:
# For cases in which you have only one variable
arr = [str(v)]
r = []
for x in arr:
if x[0]=="~":
variable_name = x[1:]
else:
variable_name = x
if variable_name not in variables:
# How do we negate a standalone variable?
if x[0]=="~":
variable_value = f"not({variable_name})"
else:
variable_value = variable_name
else:
variable_value = variables[variable_name]
if x[0]=="~":
r.append(variable_value.replace("=","!="))
else:
r.append(variable_value)
result.append(" and ".join(r))
return result
df_example['Constraint'] = df_example['Constraint'].map(transform_expr)
df_result = df_example.explode('Constraint')
pd.set_option("max_colwidth", None)
print(df_result)

And it will print:

   Component                                                         Constraint
0        123                                                            A and B
0        123                                                            A and C
1        456  MIRROR='ELECTRIC' and MIRRORCAMERA!='NO' and STEERWHEEL_LOCK='NO'
1        456    MIRROR='MANUAL' and MIRROR_RIGHT!='NO' and STEERWHEEL_LOCK='NO'
1        456     MIRROR='MANUAL' and MIRROR_LEFT!='NO' and STEERWHEEL_LOCK='NO'
2        789                                                       LENGTH='122'
2        789                                                       LENGTH='149'
2        789                                     LENGTH='135' and BATTERY='551'
2        789                                     LENGTH='181' and BATTERY='674'
2        789                   BATTERY='551' and LENGTH='181' and CHARGER!='NO'

huangapple
  • 本文由 发表于 2023年5月11日 14:27:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76224708.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定