英文:
Find the sum of values in rows of one column for where the other column has NAN in Pandas
问题
我有一个包含列A和B的数据框。列A中的数据是不连续的,其中一些行是NAN,而B中的数据是连续的。我想创建第三列,对于每组A中的NAN行,它将具有这些相同行中B的值的总和加上B中的下一个有效值。
对于A中的NAN和在有效数字后的行,C中的所有其他值应为NAN。
示例:
data = {
'A': [1, 1, None, None, 2, 5, None, None, 3, 4, 3, None, 5],
'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130]
}
除了需要B的总和加上B中的下一个有效值的行之外,其他行都可以正常工作。
我使用以下代码。但是目前看起来有点混乱。
result = df.groupby(df['A'].isnull().cumsum())['B'].sum().reset_index()
df_result = pd.DataFrame({'C': result['Pumped']})
df_result.loc[1:, 'C'] -= result.loc[0, 'Pumped']
df.loc[~mask, 'C'] = df.loc[~mask, 'Pumped']
valid_rows_after_nan = df['dWL'].notnull() & mask.shift(1).fillna(False)
df.loc[valid_rows_after_nan, 'C'] = df_result
print(df)
我希望输出的结果如下所示:
data = {
'A': [1, 1, None, None, 2, 5, None, None, 3, 4, 3, None, 5],
'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130],
'C': [10, 20, None, None, 120, 60, None, None, 240, 100, 110, None, 5]
}
英文:
I have a dataframe with columns A and B. Column A has non continuous data where some of the rows are NAN and B has continuous data. I would like to create a third column where for each set of A rows with NAN it will have the sum of values in those same rows in B + the next valid value in B.
All other values in C should be NAN for NAN in A AND the value of B for rows following a valid number in A.
Example:
data = {
'A': [1, 1, None, None, 2, 5, None, None,3 ,4, 3, None , 5],
'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130]}
Everything works fine except for the rows where I need the sum of B + next valid value in B.
I use the following code. I have this code but is seems it's a mess by now.
`result = df.groupby(df['A'].isnull().cumsum())['B'].sum().reset_index()
df_result = pd.DataFrame({'C': result['Pumped']})
df_result.loc[1:, 'C'] -= result.loc[0, 'Pumped']
df.loc[~mask, 'C'] = df.loc[~mask, 'Pumped']
valid_rows_after_nan = df['dWL'].notnull() & mask.shift(1).fillna(False)
df.loc[valid_rows_after_nan, 'C'] = df_result
print(df)`
I would like the output to look like this:
`data = {
'A': [1, 1, None, None, 2, 5, None, None,3 ,4, 3, None , 5],
'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130],
'C': [10, 20, None, None, 120, 60, None, None, 240, 100, 110, None, 5]
}
答案1
得分: 4
使用groupby.transform
的简单版本:
# 识别非NA值并反转
m = df.loc[::-1, 'A'].notna()
# 对前面的NA进行分组求和,并在NA处进行掩码
df['C'] = df.groupby(m.cumsum())['B'].transform('sum').where(m)
输出结果:
A B C
0 1.0 10 10.0
1 1.0 20 20.0
2 NaN 30 NaN
3 NaN 40 NaN
4 2.0 50 120.0
5 5.0 60 60.0
6 NaN 70 NaN
7 NaN 80 NaN
8 3.0 90 240.0
9 4.0 100 100.0
10 3.0 110 110.0
11 NaN 120 NaN
12 5.0 130 250.0
英文:
A simple version using groupby.transform
:
# identify the non-NA and reverse
m = df.loc[::-1, 'A'].notna()
# group the preceding NA, sum, mask where NA
df['C'] = df.groupby(m.cumsum())['B'].transform('sum').where(m)
Output:
A B C
0 1.0 10 10.0
1 1.0 20 20.0
2 NaN 30 NaN
3 NaN 40 NaN
4 2.0 50 120.0
5 5.0 60 60.0
6 NaN 70 NaN
7 NaN 80 NaN
8 3.0 90 240.0
9 4.0 100 100.0
10 3.0 110 110.0
11 NaN 120 NaN
12 5.0 130 250.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论