Vectorize多个pandas列的加法

huangapple go评论79阅读模式
英文:

Vectorize addition of multiple pandas columns

问题

Sure, here's the translated code portion you requested:

我有一个示例的数据框

```python
   val1  val2  val3  val4  total
0     1     2     3     4     50
1     5     6     4     8     65
2     2     3     6     5     74

我想要基于列的简单加法和除法创建一些新的列。每次都会得到比例/比率,所以我设置了一个快速的函数来完成这个任务:

def vectorize(df, value_cols, total_col):
    
    return df[value_cols] / df[total_col]

现在,我想要获取val1相对于total的比例:

total = 'total'
values = 'val1'

df['result'] = vectorize(df, values, total)

   val1  val2  val3  val4  total    result
0     1     2     3     4     50  0.020000
1     5     6     4     8     65  0.076923
2     2     3     6     5     74  0.027027

这看起来很不错。

但是当我想要在除以total之前添加多个列时,遇到了问题。

例如,我尝试找到val1 + val2相对于total的比例:

total = 'total'
values = ['val1', 'val2']

df['result2'] = vectorize(df, values, total)

ValueError: Wrong number of items passed 5, placement implies 1

这不起作用,因为现在我的列名在一个列表中。它返回了一个包含NaN的数据框。

是否有简单的方法解决这个问题?

我尝试在函数中添加sum(),但我得不到正确的结果:

return df[value_cols].sum() / df[total_col]

我尝试使用enumerate在除以total之前将列相加:

for i, col in enumerate(value_cols):
    
    sums = df[col] += df[col]

    return sums / df[total_col]

但这些方法都不起作用。

我基本上想要在函数中自动化执行以下操作:

df['val1'] + df['val2'] / df['total'] 

但也要允许单列,即只有val1。我的实际数据框有数百列,我想要像这样一次性添加一个或多个列。我可以手动输入所有内容,但我希望通过创建一个向量化函数来加快速度。

我用于复制的数据框:

import pandas as pd

pd.DataFrame({'val1': pd.Series([1, 5, 2], dtype='int64', index=pd.RangeIndex(start=0, stop=3, step=1)), 'val2': pd.Series([2, 6, 3], dtype='int64', index=pd.RangeIndex(start=0, stop=3, step=1)), 'val3': pd.Series([3, 4, 6], dtype='int64', index=pd.RangeIndex(start=0, stop=3, step=1)), 'val4': pd.Series([4, 8, 5], dtype='int64', index=pd.RangeIndex(start=0, stop=3, step=1)), 'total': pd.Series([50, 65, 74], dtype='int64', index=pd.RangeIndex(start=0, stop=3, step=1))}, index=pd.RangeIndex(start=0, stop=3, step=1))

这是您要求的翻译的部分。如果需要更多帮助,请告诉我。

<details>
<summary>英文:</summary>

I have a sample dataframe:

val1 val2 val3 val4 total
0 1 2 3 4 50
1 5 6 4 8 65
2 2 3 6 5 74


And I would like to create some new columns based on simple addition and division of columns. I will be getting a proportion/ratio each time so I set up a quick function to do this:

def vectorize(df,value_cols,total_col):

return df[value_cols] / df[total_col]

Now I would like to get the proportion of `val1` in relation to `total`:

total = 'total'
values = 'val1'

df['result'] = vectorize(df,values,total)

val1 val2 val3 val4 total result
0 1 2 3 4 50 0.020000
1 5 6 4 8 65 0.076923
2 2 3 6 5 74 0.027027

That looks great. 

Now when I want to add multiple columns before dividing by `total`, I run into problems. 

For example, I try to find the proportion of `val1 + val2` in relation to `total`:

total = 'total'
values = ['val1','val2']

df['result2'] = vectorize(df,values,total)

ValueError: Wrong number of items passed 5, placement implies 1

This doesn&#39;t work, because my column names are now in a list. It&#39;s returning a dataframe of `Nans`. 

Is there an easy way around this? 

I tried adding `sum()` to the function, but I&#39;m not getting the correct results:

return df[value_cols].sum() / df[total_col]

I tried using `enumerate` to add columns together before dividing by `total`:

for i,col in enumerate(value_cols):

sums = df[col] += df[col]

return sums / df[total_col]
But none of this is working. 

I am basically trying to automate this in the function:

df['val1'] + df['val2'] / df['total']

But also allow for single columns i.e. just `val1` alone. 

My real world dataframe has hundreds of columns and I would like to to add one or multiple columns together like this. I could type it all out by hand, but I wanted to try speed it up by creating a vectorizing function.

My df for reproducability:

import pandas as pd

pd.DataFrame({'val1': pd.Series([1, 5, 2],dtype='int64',index=pd.RangeIndex(start=0, stop=3, step=1)), 'val2': pd.Series([2, 6, 3],dtype='int64',index=pd.RangeIndex(start=0, stop=3, step=1)), 'val3': pd.Series([3, 4, 6],dtype='int64',index=pd.RangeIndex(start=0, stop=3, step=1)), 'val4': pd.Series([4, 8, 5],dtype='int64',index=pd.RangeIndex(start=0, stop=3, step=1)), 'total': pd.Series([50, 65, 74],dtype='int64',index=pd.RangeIndex(start=0, stop=3, step=1))}, index=pd.RangeIndex(start=0, stop=3, step=1))


</details>


# 答案1
**得分**: 1

以下是翻译好的部分:

```python
这是一种方法:

    def vectorize(df, value_cols, total_col):
        # 对于多列
        if isinstance(value_cols, list):
            return df[value_cols].sum(axis=1) / df[total_col]
        # 对于单列
        return df[value_cols] / df[total_col]
英文:

Here's a way to do:

def vectorize(df,value_cols,total_col):
    # for multiple columns
    if isinstance(value_cols, list):
        return df[value_cols].sum(axis=1) / df[total_col]
    # for single column
    return df[value_cols] / df[total_col]

答案2

得分: 1

这是您要翻译的代码部分:

def vectorize(df, value_cols, total_col):
    if(isinstance(value_cols, list)):
        return df[value_cols].apply(sum, axis=1).div(df[total_col])
    else:
        return df[value_cols].div(df[total_col])

输出部分:

   val1  val2  val3  val4  total    result   result2
0     1     2     3     4     50  0.020000  0.060000
1     5     6     4     8     65  0.076923  0.169231
2     2     3     6     5     74  0.027027  0.067568

请注意,我只提供代码和输出的翻译,不会回答其他问题。

英文:

Just tweak your function:

def vectorize(df,value_cols,total_col):
	if(isinstance(value_cols, list)):
		return df[value_cols].apply(sum, axis=1).div(df[total_col])
	else:
		return df[value_cols].div(df[total_col])

Output:

   val1  val2  val3  val4  total    result   result2
0     1     2     3     4     50  0.020000  0.060000
1     5     6     4     8     65  0.076923  0.169231
2     2     3     6     5     74  0.027027  0.067568

答案3

得分: 0

使用.sum()是正确的做法。但是你需要指定你想要添加的轴。默认情况下,它会添加行,而不是列。这是你需要的:

return df[value_cols].sum(axis=1) / df[total_col]
英文:

Using .sum() was the right thing to do. But you need to specify the axis along which you want to add. By default it adds the rows, not the columns. This is what you need:

return df[value_cols].sum(axis=1) / df[total_col]

答案4

得分: 0

由于您需要执行许多除法操作并且可能需要在求和列之前执行加法运算因此我会稍微修改函数将一个 `DataFrame`、您的分母列以及一个列表传递给它该列表指定了每个子列表基于哪些列进行加法运算

```python
def sum_then_divide(df, total_col, numer_col_list):
    """
    df : pd.DataFrame
    total_col : str,分母列
    numer_col_list: 列表的列表
        在除法之前对每个子列表中的所有列进行求和
    """
    u = pd.concat([df[cols].sum(1).rename('+'.join(cols)) for cols in numer_col_list], axis=1)
    return u.divide(df[total_col], axis=0)

sum_then_divide(df, 'total', [['val1'], ['val1', 'val2'], ['val1', 'val3', 'val4']])
#       val1  val1+val2  val1+val3+val4
#0  0.020000   0.060000        0.160000
#1  0.076923   0.169231        0.261538
#2  0.027027   0.067568        0.175676
英文:

Since you need to perform many divisions, with the possibility of summing columns prior, I would slightly modify the function. Pass it a DataFrame, your denominator column, and then a list of lists that specifies which columns to add based on each sublist.

def sum_then_divide(df, total_col, numer_col_list):
    &quot;&quot;&quot;
    df : pd.DataFrame
    total_col : str, denominator
    numer_col_list: list of lists
        Sum all columns in each sublist before dividing
    &quot;&quot;&quot;
    u = pd.concat([df[cols].sum(1).rename(&#39;+&#39;.join(cols)) for cols in numer_col_list], axis=1)
    return u.divide(df[total_col], axis=0)

sum_then_divide(df, &#39;total&#39;, [[&#39;val1&#39;], [&#39;val1&#39;, &#39;val2&#39;], [&#39;val1&#39;, &#39;val3&#39;, &#39;val4&#39;]])
#       val1  val1+val2  val1+val3+val4
#0  0.020000   0.060000        0.160000
#1  0.076923   0.169231        0.261538
#2  0.027027   0.067568        0.175676

huangapple
  • 本文由 发表于 2020年1月3日 23:31:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/59581226.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定