在 pandas 中从自定义列表中迭代数据帧中的一组列 – pandas

huangapple go评论65阅读模式
英文:

Iterating on group of columns in a dataframe from custom list - pandas

问题

我有一个名为df的数据框,内容如下:

    TxnId     TxnDate           TxnCount
      100     2023-02-01      2
      500     2023-02-01      1
      400     2023-02-01      4
      100     2023-02-02      3
      500     2023-02-02      5
      100     2023-02-03      3
      500     2023-02-03      5
      400     2023-02-03      2

我有以下自定义列表:

    datelist = [datetime.date(2023,02,03), datetime.date(2023,02,02)]
    txnlist = [400,500]

我希望按照以下逻辑迭代df:

for every txn in txnlist:
     sum = 0
     for every date in datelist:
           sum += df[txn][date].TxnCount

我还希望了解如何找到已筛选的TxnIds的TxnCount的平均值。

Sum步骤之后,基于上述输入和筛选条件:

     TxnId         TxnCount
      400          2
      500          10

TxnId 400 相关的平均值 = (2+0)/2 = 1

TxnId 500 相关的平均值 = (5+5)/2 = 5

如果平均值大于3,则将数据框中的行添加到breachList:

breachList =[[500,10]]

请帮助我如何在pandas中完成这些操作。

英文:

I have a dataframe df like this

TxnId     TxnDate           TxnCount
  100     2023-02-01      2
  500     2023-02-01      1
  400     2023-02-01      4
  100     2023-02-02      3
  500     2023-02-02      5
  100     2023-02-03      3
  500     2023-02-03      5
  400     2023-02-03      2

I have the following custom lists

datelist = [datetime.date(2023,02,03), datetime.date(2023,02,02)]
txnlist = [400,500]

I want to iterate the df as per below logic:

for every txn in txnlist:
     sum = 0
     for every date in datelist:
           sum += df[txn][date].TxnCount

I would also be interested to understand how to find average of TxnCount for filtered TxnIds.

After Sum step based on above input and filters:

 TxnId         TxnCount
  400          2
  500          10 

Average corresponding to TxnId 400 = (2+0)/2 = 1

Average corresponding to TxnId 500 = (5+5)/2 = 5

If average > 3 , add row from dataframe to breachList

breachList =[[500,10]]

Please help me how to do this in pandas

答案1

得分: 2

使用两个列表首先通过boolean indexingSeries.isin来过滤DataFrame:

df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
   TxnId     TxnDate  TxnCount
4    500  2023-02-02         5
6    500  2023-02-03         5
7    400  2023-02-03         2

然后,对TxnCount列按组进行求和:

out = df1.groupby('TxnId', as_index=False)['TxnCount'].sum()
print (out)
   TxnId  TxnCount
0    400         2
1    500        10

如果需要按TxnId的平均值进行筛选,使用如下方法:

df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
   TxnId     TxnDate  TxnCount
4    500  2023-02-02         5
6    500  2023-02-03         5
7    400  2023-02-03         2

# 按TxnId创建平均值
out = df1.groupby('TxnId')['TxnCount'].mean()
print (out)
TxnId
400    2
500    5
Name: TxnCount, dtype: int64

# 获取TxnId大于4的值
TxnId = out[out > 4].index
print (TxnId)
Int64Index([500], dtype='int64', name='TxnId')

dfdf1的行进行筛选:

df2 = df[df['TxnId'].isin(TxnId)]
print(df2)
   TxnId     TxnDate  TxnCount
1    500  2023-02-01         1
4    500  2023-02-02         5
6    500  2023-02-03         5
df3 = df1[df1['TxnId'].isin(TxnId)]
print(df3)
   TxnId     TxnDate  TxnCount
4    500  2023-02-02         5
6    500  2023-02-03         5

编辑1:为了获得预期的输出,首先按列表筛选(以避免处理所有行):

df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
   TxnId     TxnDate  TxnCount
4    500  2023-02-02         5
6    500  2023-02-03         5
7    400  2023-02-03         2

TxnDate/TxnId的所有组合进行数据透视:

out = df1.pivot_table(index='TxnId', 
                      columns='TxnDate', 
                      values='TxnCount', 
                      aggfunc='sum', 
                      fill_value=0)
print (out)
TxnDate  2023-02-02  2023-02-03
TxnId                          
400               0           2
500               5           5

最后,通过计算每行的平均值并将其转换为列表来筛选和求和:

breachList = out.sum(axis=1)[out.mean(axis=1).gt(3)].reset_index().to_numpy().tolist()
print (breachList)
[[500, 10]]
英文:

Filter DataFrame by both lists first by boolean indexing with Series.isin:

df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
   TxnId     TxnDate  TxnCount
4    500  2023-02-02         5
6    500  2023-02-03         5
7    400  2023-02-03         2

And then for sum of column TxnCount per groups:

out = df1.groupby('TxnId', as_index=False)['TxnCount'].sum()
print (out)
   TxnId  TxnCount
0    400         2
1    500        10

EDIT: If need filter TxnId by average, here greater like 4 use:

df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
   TxnId     TxnDate  TxnCount
4    500  2023-02-02         5
6    500  2023-02-03         5
7    400  2023-02-03         2

#create averages per TxnId
out = df1.groupby('TxnId')['TxnCount'].mean()
print (out)
TxnId
400    2
500    5
Name: TxnCount, dtype: int64

#get TxnId greater like 4
TxnId = out[out > 4].index
print (TxnId)
Int64Index([500], dtype='int64', name='TxnId')

Filter rows in df or df1:

df2 = df[df['TxnId'].isin(TxnId)]
print(df2)
   TxnId     TxnDate  TxnCount
1    500  2023-02-01         1
4    500  2023-02-02         5
6    500  2023-02-03         5

df3 = df1[df1['TxnId'].isin(TxnId)]
print(df3)
   TxnId     TxnDate  TxnCount
4    500  2023-02-02         5
6    500  2023-02-03         5

EDIT1: For expected ouput use:

First filter by lists (for avoid processig all rows):

df1 = df[df['TxnId'].isin(txnlist) & pd.to_datetime(df['TxnDate']).dt.date.isin(datelist)]
print (df1)
   TxnId     TxnDate  TxnCount
4    500  2023-02-02         5
6    500  2023-02-03         5
7    400  2023-02-03         2

Pivoting for all combinations TxnDate/TxnId :

out = df1.pivot_table(index='TxnId', 
                      columns='TxnDate', 
                      values='TxnCount', 
                      aggfunc='sum', 
                      fill_value=0)
print (out)
TxnDate  2023-02-02  2023-02-03
TxnId                          
400               0           2
500               5           5

Last filtered summed values by means per rows and convert to lists:

breachList = out.sum(axis=1)[out.mean(axis=1).gt(3)].reset_index().to_numpy().tolist()
print (breachList)
[[500, 10]]

答案2

得分: 0

代码中的部分不需要翻译,以下是已翻译的内容:

这个嵌套循环的使用方式让人想起了2D pivot_table(或crosstab):

df['TxnDate'] = pd.to_datetime(df['TxnDate'])

out = (df.pivot_table(index='TxnId', columns='TxnDate',
                      values='TxnCount', aggfunc='sum',
                      fill_value=0)
         .reindex(txnlist, datelist)
       )

输出:

TxnDate  2023-02-03  2023-02-02
TxnId                          
400               2           0
500               5           5

如果你想进一步按Id(或日期)进行聚合:

out.sum(axis=1)

TxnId
400     2
500    10
dtype: int64
英文:

The fact that your are using a nested loop is reminiscent of a 2D pivot_table (or crosstab):

df['TxnDate'] = pd.to_datetime(df['TxnDate'])

out = (df.pivot_table(index='TxnId', columns='TxnDate',
                      values='TxnCount', aggfunc='sum'
                      fill_value=0)
         .reindex(txnlist, datelist)
       )

Output:

TxnDate  2023-02-03  2023-02-02
TxnId                          
400               2           0
500               5           5

And if you want to further aggregate on Ids (or Date):

out.sum(axis=1)

TxnId
400     2
500    10
dtype: int64

huangapple
  • 本文由 发表于 2023年2月24日 15:41:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75553784.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定