英文:
Iterating on group of columns in a dataframe from custom list - pandas
问题
我有一个名为df的数据框,内容如下:
TxnId TxnDate TxnCount
100 2023-02-01 2
500 2023-02-01 1
400 2023-02-01 4
100 2023-02-02 3
500 2023-02-02 5
100 2023-02-03 3
500 2023-02-03 5
400 2023-02-03 2
我有以下自定义列表:
datelist = [datetime.date(2023,02,03), datetime.date(2023,02,02)]
txnlist = [400,500]
我希望按照以下逻辑迭代df:
for every txn in txnlist:
sum = 0
for every date in datelist:
sum += df[txn][date].TxnCount
我还希望了解如何找到已筛选的TxnIds的TxnCount的平均值。
在Sum步骤之后,基于上述输入和筛选条件:
TxnId TxnCount
400 2
500 10
与TxnId 400
相关的平均值 = (2+0)/2 = 1
与TxnId 500
相关的平均值 = (5+5)/2 = 5
如果平均值大于3,则将数据框中的行添加到breachList:
breachList =[[500,10]]
请帮助我如何在pandas中完成这些操作。
英文:
I have a dataframe df like this
TxnId TxnDate TxnCount
100 2023-02-01 2
500 2023-02-01 1
400 2023-02-01 4
100 2023-02-02 3
500 2023-02-02 5
100 2023-02-03 3
500 2023-02-03 5
400 2023-02-03 2
I have the following custom lists
datelist = [datetime.date(2023,02,03), datetime.date(2023,02,02)]
txnlist = [400,500]
I want to iterate the df as per below logic:
for every txn in txnlist:
sum = 0
for every date in datelist:
sum += df[txn][date].TxnCount
I would also be interested to understand how to find average of TxnCount for filtered TxnIds.
After Sum step based on above input and filters:
TxnId TxnCount
400 2
500 10
Average corresponding to TxnId 400 = (2+0)/2 = 1
Average corresponding to TxnId 500 = (5+5)/2 = 5
If average > 3 , add row from dataframe to breachList
breachList =[[500,10]]
Please help me how to do this in pandas
答案1
得分: 2
使用两个列表首先通过boolean indexing
和Series.isin
来过滤DataFrame:
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
然后,对TxnCount
列按组进行求和:
out = df1.groupby('TxnId', as_index=False)['TxnCount'].sum()
print (out)
TxnId TxnCount
0 400 2
1 500 10
如果需要按TxnId
的平均值进行筛选,使用如下方法:
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
# 按TxnId创建平均值
out = df1.groupby('TxnId')['TxnCount'].mean()
print (out)
TxnId
400 2
500 5
Name: TxnCount, dtype: int64
# 获取TxnId大于4的值
TxnId = out[out > 4].index
print (TxnId)
Int64Index([500], dtype='int64', name='TxnId')
对df
或df1
的行进行筛选:
df2 = df[df['TxnId'].isin(TxnId)]
print(df2)
TxnId TxnDate TxnCount
1 500 2023-02-01 1
4 500 2023-02-02 5
6 500 2023-02-03 5
df3 = df1[df1['TxnId'].isin(TxnId)]
print(df3)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
编辑1:为了获得预期的输出,首先按列表筛选(以避免处理所有行):
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
对TxnDate/TxnId
的所有组合进行数据透视:
out = df1.pivot_table(index='TxnId',
columns='TxnDate',
values='TxnCount',
aggfunc='sum',
fill_value=0)
print (out)
TxnDate 2023-02-02 2023-02-03
TxnId
400 0 2
500 5 5
最后,通过计算每行的平均值并将其转换为列表来筛选和求和:
breachList = out.sum(axis=1)[out.mean(axis=1).gt(3)].reset_index().to_numpy().tolist()
print (breachList)
[[500, 10]]
英文:
Filter DataFrame by both lists first by boolean indexing
with Series.isin
:
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
And then for sum of column TxnCount
per groups:
out = df1.groupby('TxnId', as_index=False)['TxnCount'].sum()
print (out)
TxnId TxnCount
0 400 2
1 500 10
EDIT: If need filter TxnId
by average, here greater like 4
use:
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
#create averages per TxnId
out = df1.groupby('TxnId')['TxnCount'].mean()
print (out)
TxnId
400 2
500 5
Name: TxnCount, dtype: int64
#get TxnId greater like 4
TxnId = out[out > 4].index
print (TxnId)
Int64Index([500], dtype='int64', name='TxnId')
Filter rows in df
or df1
:
df2 = df[df['TxnId'].isin(TxnId)]
print(df2)
TxnId TxnDate TxnCount
1 500 2023-02-01 1
4 500 2023-02-02 5
6 500 2023-02-03 5
df3 = df1[df1['TxnId'].isin(TxnId)]
print(df3)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
EDIT1: For expected ouput use:
First filter by lists (for avoid processig all rows):
df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
TxnId TxnDate TxnCount
4 500 2023-02-02 5
6 500 2023-02-03 5
7 400 2023-02-03 2
Pivoting for all combinations TxnDate/TxnId
:
out = df1.pivot_table(index='TxnId',
columns='TxnDate',
values='TxnCount',
aggfunc='sum',
fill_value=0)
print (out)
TxnDate 2023-02-02 2023-02-03
TxnId
400 0 2
500 5 5
Last filtered summed values by means per rows and convert to lists:
breachList = out.sum(axis=1)[out.mean(axis=1).gt(3)].reset_index().to_numpy().tolist()
print (breachList)
[[500, 10]]
答案2
得分: 0
代码中的部分不需要翻译,以下是已翻译的内容:
这个嵌套循环的使用方式让人想起了2D pivot_table
(或crosstab
):
df['TxnDate'] = pd.to_datetime(df['TxnDate'])
out = (df.pivot_table(index='TxnId', columns='TxnDate',
values='TxnCount', aggfunc='sum',
fill_value=0)
.reindex(txnlist, datelist)
)
输出:
TxnDate 2023-02-03 2023-02-02
TxnId
400 2 0
500 5 5
如果你想进一步按Id(或日期)进行聚合:
out.sum(axis=1)
TxnId
400 2
500 10
dtype: int64
英文:
The fact that your are using a nested loop is reminiscent of a 2D pivot_table
(or crosstab
):
df['TxnDate'] = pd.to_datetime(df['TxnDate'])
out = (df.pivot_table(index='TxnId', columns='TxnDate',
values='TxnCount', aggfunc='sum'
fill_value=0)
.reindex(txnlist, datelist)
)
Output:
TxnDate 2023-02-03 2023-02-02
TxnId
400 2 0
500 5 5
And if you want to further aggregate on Ids (or Date):
out.sum(axis=1)
TxnId
400 2
500 10
dtype: int64
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论