如何使用未知列名的 f-string 模板在 pandas 数据帧中创建新列?

huangapple go评论56阅读模式
英文:

How can I create a new column in a pandas dataframe using a f-string template with unknown column names?

问题

我想编写一个用于处理一个相当通用的 Pandas 数据框的脚本/函数。数据框是通过pd.merge()操作的结果,其中一个数据框由用户以 CSV 文件的形式提供,具有任意的列。我知道所有的值都是文本,但就这些而言,我了解的不多。

我想在数据框中创建一个新列,该列基于其他列的值的组合而生成。当已知列名时,这是一个相当容易解决的问题,但在这种情况下,我既不知道列名,也不知道它们如何提前组合。

假设数据框 df 如下所示:

print(df)
     col1 col2
row1 abc  def
row2 ghi  jkl

还假设所需的操作是“通过将后缀 _x 添加到 col2 中的值来创建一个新列”。如果用户可以使用 f-strings 表达这个操作的模板,那将很好。在这种情况下,模板可能是...

template = 'f"{col2}_x"'

双引号是有意的 - 我想要延迟模板的评估,直到它应用到数据框上。请注意,用户将会知道至少一些列的名称(他们通过 CSV 文件提供的列),因此他们可以根据这些列名指定模板。

我希望我可以简单地使用 pd.eval(),或更具体地说,df.eval(),它使用由列名提供的命名空间来评估表达式。

类似这样的方式:

df["new_col"] = df.eval(template)

但是这会返回...

AttributeError: 'PythonExprVisitor' object has no attribute 'visit_JoinedStr'

.. 我了解到截止到 2023 年 3 月,pd.eval() 不支持 f-strings:https://github.com/pandas-dev/pandas/issues/52310

我猜想我可以将数据框导出到一个字典,然后遍历行,但我真的希望能找到一个简洁的、Pandas 风格的解决方案。

更新:

我还尝试使用 pd.DataFrame.apply() 来实现类似的功能。

在具体情况下,我可以按如下方式遍历数据框:

ser = pd.Series()
for index, row in df.iterrows():
    locals().update(row.to_dict())
    ser[index] = eval(template)
df["new_col"] = ser

这可能可以用,但locals().update() 感觉像是一个 hack,我读到的所有建议都建议矢量化而不是迭代。我还很难将其封装成一个函数(命名的或 lambda),以传递给 pd.DataFrame.apply(),因为我不知道列名。

英文:

I want to write a script/function to process a fairly generic pandas dataframe. The data frame is the result of a pd.merge() operation between two data frames, one of which is supplied by the user in the form of a CSV file with arbitrary columns. I know all the values are text but that's about it.

I want to create a new column in the dataframe based on the values of a combination of other columns. This is a fairly easy problem when the column names are known, but in this case I know neither the names of the columns nor how they are to be combined in advance.

Suppose the dataframe df looks like this:

print(df)
     col1 col2
row1 abc  def
row2 ghi  jkl

And suppose also that the desired manipulation is "create a new column by adding suffix _x to the values in col2". It would be nice if the use could express a template for this manipulation using f-strings. In this case, the template might be ..

template = 'f"{col2}_x"'

The double-quoting is is intentional -- I want to delay the evaluation of the template until it is applied to the dataframe. Note that the user will know the names of at least some columns (the ones that they supplied via the CSV file) and so they can specify a template based on these column names.

I was hoping that I could simply use pd.eval() or more specifically df.eval() which evaluates an expression using the namespace provided by the column names.

Something like

df["new_col"] = df.eval(template)

But this returns ...

AttributeError: 'PythonExprVisitor' object has no attribute 'visit_JoinedStr'

.. and I gather that pd.eval() does not support f-strings as of March 2023: https://github.com/pandas-dev/pandas/issues/52310

I guess I could export the dataframe to a dictionary and then iterate over the rows, but I was really hoping for a neat, pandas-esque solution.

Update:

I've also been trying to achieve something simular with pd.DataFrame.apply()

In the concrete case, I can iterate over the dataframe as follows:

ser=pd.Series()
for index,row in df.iterrows():
    locals().update(row.to_dict())
    ser[index]=eval(template)
df["new_col"]=ser

This might have to do, but locals().update() feels like a hack and all the advice I have read recommends vectorising over iterating. I'm also having a tough time wrapping this in a function (named or lambda) to pass into pd.DataFrame.apply() because again I don't know the column names ahead of time.

答案1

得分: 0

以下是使用Python内置函数eval执行的一种方法,必须谨慎使用:

import pandas as pd

df = pd.DataFrame({"col1": ["abc", "ghi"], "col2": ["def", "jkl"]})

template = 'df["' + name + '"] + "_x"'

name = "col2"
df["new_col"] = eval(template)

然后:

print(df)
# 输出

  col1 col2 new_col
0  abc  def   def_x
1  ghi  jkl   jkl_x
英文:

Here is one way to do it with Python built-in function eval, which must be used with caution:

import pandas as pd

df = pd.DataFrame({"col1": ["abc", "ghi"], "col2": ["def", "jkl"]})

template = 'df[f"{name}"]+"_x"'

name = "col2"
df["new_col"] = eval(template)

Then:

print(df)
# Output

  col1 col2 new_col
0  abc  def   def_x
1  ghi  jkl   jkl_x

答案2

得分: 0

以下是翻译好的部分:

"我应该补充一下,这是用于命令行工具,并且用户将不会知道内部数据库的名称,就像他们不知道内部列的名称一样。(只知道他们自己在CSV文件中提供的列。)

受到 @Laurent 的启发,我可以要求用户提供一个类似于 f-string 的字符串,然后将其读取为常规字符串。然后,我可以将其整理成适用于 Python eval() 的格式。

因此,如果用户提供了 "{col2}_x",并且此输入可用作名为 template 的变量...

import re

template = re.sub(r'{', r'df[f"', template)
template = re.sub(r'}', r']} "', template)
template = 'f"' + template + '"'

df['new_col'] = eval(template)

相当丑陋,但应该能够工作。

仍然希望有一个基于 pandas 的简洁解决方案,但不抱太大希望。"

英文:

I should have added that this is for a command line tool, and the user won't know the name of the internal database any more than they know the names of the internal columns. (Only the columns that they themselves provide in a CSV file.)

Taking inspiration from @Laurent I can ask the user to supply an "f-string like" string that is read in as a regular string. Then I can wrangle into a format that will work with Python eval()

So, if the user supplies "{col2}_x" and this input is available as a variable called template ...

import re

template=re.sub(r'\{',r'df[f"\{', template)
template=re.sub(r'\}',r'\}]"', template)
template=r'f"' + template + '"'

df['new_col'] = eval(template)

Pretty ugly, but should work.

Still hoping for a neat pandas-based solution, but not going to hold my breath.

huangapple
  • 本文由 发表于 2023年5月29日 17:52:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76356324.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定