如何有效地使用pandas根据唯一标识符和一个条件来更改值?

huangapple go评论116阅读模式
英文:

How to effectively use pandas to change values based unique identifier and one condition?

问题

  1. import pandas as pd
  2. from datetime import datetime
  3. data = {
  4. 'identifier': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 13, 7: 14, 8: 15},
  5. 'loan_identifier': {0: 'a111', 1: 'a111', 2: 'a111', 3: 'a111', 4: 'a111', 5: 'b222', 6: 'b222', 7: 'b222', 8: 'b222'},
  6. 'cashflow_date': {0: '15/07/2023', 1: '01/07/2023', 2: '15/07/2023', 3: '20/07/2023', 4: '11/06/2023', 5: '10/07/2023', 6: '10/07/2023', 7: '10/07/2023', 8: '13/06/2023'},
  7. 'cashflow_type': {0: 'funding', 1: 'interest_repayment', 2: 'interest_repayment', 3: 'interest_repayment', 4: 'principal_repayment', 5: 'funding', 6: 'interest_repayment', 7: 'interest_repayment', 8: 'principal_repayment'},
  8. 'amount': {0: -195.71, 1: 3.11, 2: 0.04, 3: 0.04, 4: 195.33, 5: -3915.45, 6: 0.73, 7: 0.73, 8: 3906.5}
  9. }
  10. df = pd.DataFrame(data)
  11. # Convert cashflow_date to datetime
  12. df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], format='%d/%m/%Y')
  13. # Create a dictionary to store the earliest funding dates for each loan_identifier
  14. earliest_funding_dates = df[df['cashflow_type'] == 'funding'].groupby('loan_identifier')['cashflow_date'].min().to_dict()
  15. # Function to update cashflow_date based on the earliest funding date
  16. def update_cashflow_date(row):
  17. if row['cashflow_type'] != 'funding':
  18. earliest_date = earliest_funding_dates.get(row['loan_identifier'])
  19. if earliest_date and row['cashflow_date'] < earliest_date:
  20. return earliest_date
  21. return row['cashflow_date']
  22. # Apply the update_cashflow_date function to the DataFrame
  23. df['cashflow_date'] = df.apply(update_cashflow_date, axis=1)
  24. # Convert cashflow_date back to the original format
  25. df['cashflow_date'] = df['cashflow_date'].dt.strftime('%d/%m/%Y')
  26. # Print the updated DataFrame
  27. print(df)

这是你要的代码的翻译部分。

英文:

I have the following data frame:

  1. identifier loan_identifier cashflow_date cashflow_type amount
  2. 0 1 a111 15/07/2023 funding -195.71
  3. 1 2 a111 01/07/2023 interest_repayment 3.11
  4. 2 3 a111 15/07/2023 interest_repayment 0.04
  5. 3 4 a111 20/07/2023 interest_repayment 0.04
  6. 4 5 a111 11/06/2023 principal_repayment 195.33
  7. 5 6 b222 10/07/2023 funding -3915.45
  8. 6 13 b222 10/07/2023 interest_repayment 0.73
  9. 7 14 b222 10/07/2023 interest_repayment 0.73
  10. 8 15 b222 13/06/2023 principal_repayment 3906.50

cashflow_date of funding cashflow_type should be the earliest for each loan_identifier. Each row with cashflow_type != funding whose date is earlier than funding row of that loan_identifer should be set to funding cashflow_date.

So the above dataframe should look like this:

  1. identifier loan_identifier cashflow_date cashflow_type amount
  2. 0 1 a111 15/07/2023 funding -195.71
  3. 1 2 a111 15/07/2023 interest_repayment 3.11
  4. 2 3 a111 15/07/2023 interest_repayment 0.04
  5. 3 4 a111 20/07/2023 interest_repayment 0.04
  6. 4 5 a111 15/07/2023 principal_repayment 195.33
  7. 5 6 b222 10/07/2023 funding -3915.45
  8. 6 13 b222 10/07/2023 interest_repayment 0.73
  9. 7 14 b222 10/07/2023 interest_repayment 0.73
  10. 8 15 b222 10/07/2023 principal_repayment 3906.50

Here's the data as a dictionary:

  1. data = {
  2. &#39;identifier&#39;: {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 13, 7: 14, 8: 15},
  3. &#39;loan_identifier&#39;: {0: &#39;a111&#39;, 1: &#39;a111&#39;, 2: &#39;a111&#39;, 3: &#39;a111&#39;, 4: &#39;a111&#39;, 5: &#39;b222&#39;, 6: &#39;b222&#39;, 7: &#39;b222&#39;, 8: &#39;b222&#39;},
  4. &#39;cashflow_date&#39;: {0: &#39;15/07/2023&#39;, 1: &#39;01/07/2023&#39;, 2: &#39;15/07/2023&#39;, 3: &#39;20/07/2023&#39;, 4: &#39;11/06/2023&#39;, 5: &#39;10/07/2023&#39;, 6: &#39;10/07/2023&#39;, 7: &#39;10/07/2023&#39;, 8: &#39;13/06/2023&#39;},
  5. &#39;cashflow_type&#39;: {0: &#39;funding&#39;, 1: &#39;interest_repayment&#39;, 2: &#39;interest_repayment&#39;, 3: &#39;interest_repayment&#39;, 4: &#39;principal_repayment&#39;, 5: &#39;funding&#39;, 6: &#39;interest_repayment&#39;, 7: &#39;interest_repayment&#39;, 8: &#39;principal_repayment&#39;},
  6. &#39;amount&#39;: {0: -195.71, 1: 3.11, 2: 0.04, 3: 0.04, 4: 195.33, 5: -3915.45, 6: 0.73, 7: 0.73, 8: 3906.5}
  7. }

How can I do this transformation using full advantage of pandas?

答案1

得分: 4

你可以使用布尔索引:

  1. # 确保日期是datetime格式
  2. df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], dayfirst=True)
  3. # 获取“funding”类型的行
  4. # 注意:如果每个ID有多个“funding”类型的行,需要删除重复项
  5. m1 = df['cashflow_type'].eq('funding')
  6. # 获取每个ID的参考日期
  7. ref = df['loan_identifier'].map(df.loc[m1].set_index('loan_identifier')
  8. ['cashflow_date'])
  9. # 如果日期低于参考日期,则进行更改
  10. m2 = df['cashflow_date'].lt(ref)
  11. df.loc[m2, 'cashflow_date'] = ref[m2]

另一种方法是使用自定义的groupby.applyclip方法:

  1. df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], dayfirst=True)
  2. df['cashflow_date'] = (df.groupby('loan_identifier', group_keys=False)
  3. .apply(lambda g: g['cashflow_date']
  4. .clip(lower=g.loc[g['cashflow_type'].eq('funding'),
  5. 'cashflow_date'].min())
  6. )
  7. )

输出:

  1. identifier loan_identifier cashflow_date cashflow_type amount
  2. 0 1 a111 2023-07-15 funding -195.71
  3. 1 2 a111 2023-07-15 interest_repayment 3.11
  4. 2 3 a111 2023-07-15 interest_repayment 0.04
  5. 3 4 a111 2023-07-20 interest_repayment 0.04
  6. 4 5 a111 2023-07-15 principal_repayment 195.33
  7. 5 6 b222 2023-07-10 funding -3915.45
  8. 6 13 b222 2023-07-10 interest_repayment 0.73
  9. 7 14 b222 2023-07-10 interest_repayment 0.73
  10. 8 15 b222 2023-07-10 principal_repayment 3906.50
英文:

You can use boolean indexing:

  1. # ensure datetime
  2. df[&#39;cashflow_date&#39;] = pd.to_datetime(df[&#39;cashflow_date&#39;], dayfirst=True)
  3. # get rows with &quot;funding&quot;
  4. # NB. if more than one per ID, you need to drop duplicates
  5. m1 = df[&#39;cashflow_type&#39;].eq(&#39;funding&#39;)
  6. # get reference date per ID
  7. ref = df[&#39;loan_identifier&#39;].map(df.loc[m1].set_index(&#39;loan_identifier&#39;)
  8. [&#39;cashflow_date&#39;])
  9. # change if lower than reference
  10. m2 = df[&#39;cashflow_date&#39;].lt(ref)
  11. df.loc[m2, &#39;cashflow_date&#39;] = ref[m2]

Another approach with a custom groupby.apply and clip:

  1. df[&#39;cashflow_date&#39;] = pd.to_datetime(df[&#39;cashflow_date&#39;], dayfirst=True)
  2. df[&#39;cashflow_date&#39;] = (df.groupby(&#39;loan_identifier&#39;, group_keys=False)
  3. .apply(lambda g: g[&#39;cashflow_date&#39;]
  4. .clip(lower=g.loc[g[&#39;cashflow_type&#39;].eq(&#39;funding&#39;),
  5. &#39;cashflow_date&#39;].min())
  6. )
  7. )

Output:

  1. identifier loan_identifier cashflow_date cashflow_type amount
  2. 0 1 a111 2023-07-15 funding -195.71
  3. 1 2 a111 2023-07-15 interest_repayment 3.11
  4. 2 3 a111 2023-07-15 interest_repayment 0.04
  5. 3 4 a111 2023-07-20 interest_repayment 0.04
  6. 4 5 a111 2023-07-15 principal_repayment 195.33
  7. 5 6 b222 2023-07-10 funding -3915.45
  8. 6 13 b222 2023-07-10 interest_repayment 0.73
  9. 7 14 b222 2023-07-10 interest_repayment 0.73
  10. 8 15 b222 2023-07-10 principal_repayment 3906.50

答案2

得分: 2

  1. # 转换为日期时间格式
  2. df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], format='%d/%m/%Y')
  3. # 按贷款标识符分组以识别最早的现金流日期,以便贷款标识符与早期日期分组
  4. earliest_dates = df[df['cashflow_type'] == 'funding'].groupby('loan_identifier')['cashflow_date'].min()
  5. # a111 2023-07-15
  6. # b222 2023-07-10
  7. # 然后使用.loc索引器更新'cashflow_date'
  8. for loan_id, earliest_date in earliest_dates.items():
  9. df.loc[(df['loan_identifier'] == loan_id) & (df['cashflow_type'] != 'funding') & (df['cashflow_date'] < earliest_date), 'cashflow_date'] = earliest_date
  10. print(df)

输出:

  1. identifier loan_identifier cashflow_date cashflow_type amount
  2. 0 1 a111 2023-07-15 funding -195.71
  3. 1 2 a111 2023-07-15 interest_repayment 3.11
  4. 2 3 a111 2023-07-15 interest_repayment 0.04
  5. 3 4 a111 2023-07-20 interest_repayment 0.04
  6. 4 5 a111 2023-07-15 principal_repayment 195.33
  7. 5 6 b222 2023-07-10 funding -3915.45
  8. 6 13 b222 2023-07-10 interest_repayment 0.73
  9. 7 14 b222 2023-07-10 interest_repayment 0.73
  10. 8 15 b222 2023-07-10 principal_repayment 3906.50

请注意,上面的翻译是代码的一部分,用于解释代码的功能。

英文:
  1. # converting to datetime format
  2. df[&#39;cashflow_date&#39;] = pd.to_datetime(df[&#39;cashflow_date&#39;], format=&#39;%d/%m/%Y&#39;)
  3. #groupby loan identifier to identify earliest cashflow date so that loan_identifier gets grouped with early date
  4. earliest_dates = df[df[&#39;cashflow_type&#39;] == &#39;funding&#39;].groupby(&#39;loan_identifier&#39;)[&#39;cashflow_date&#39;].min()
  5. # a111 2023-07-15
  6. # b222 2023-07-10
  7. # then updating &#39;cashflow_date&#39; using .loc indexer
  8. for loan_id, earliest_date in earliest_dates.items():
  9. df.loc[(df[&#39;loan_identifier&#39;] == loan_id) &amp; (df[&#39;cashflow_type&#39;] != &#39;funding&#39;) &amp; (df[&#39;cashflow_date&#39;] &lt; earliest_date), &#39;cashflow_date&#39;] = earliest_date
  10. print(df)

Output:

  1. identifier loan_identifier cashflow_date cashflow_type amount
  2. 0 1 a111 2023-07-15 funding -195.71
  3. 1 2 a111 2023-07-15 interest_repayment 3.11
  4. 2 3 a111 2023-07-15 interest_repayment 0.04
  5. 3 4 a111 2023-07-20 interest_repayment 0.04
  6. 4 5 a111 2023-07-15 principal_repayment 195.33
  7. 5 6 b222 2023-07-10 funding -3915.45
  8. 6 13 b222 2023-07-10 interest_repayment 0.73
  9. 7 14 b222 2023-07-10 interest_repayment 0.73
  10. 8 15 b222 2023-07-10 principal_repayment 3906.50

huangapple
  • 本文由 发表于 2023年8月9日 17:01:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76866138-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定