英文:
How to effectively use pandas to change values based unique identifier and one condition?
问题
import pandas as pd
from datetime import datetime
data = {
'identifier': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 13, 7: 14, 8: 15},
'loan_identifier': {0: 'a111', 1: 'a111', 2: 'a111', 3: 'a111', 4: 'a111', 5: 'b222', 6: 'b222', 7: 'b222', 8: 'b222'},
'cashflow_date': {0: '15/07/2023', 1: '01/07/2023', 2: '15/07/2023', 3: '20/07/2023', 4: '11/06/2023', 5: '10/07/2023', 6: '10/07/2023', 7: '10/07/2023', 8: '13/06/2023'},
'cashflow_type': {0: 'funding', 1: 'interest_repayment', 2: 'interest_repayment', 3: 'interest_repayment', 4: 'principal_repayment', 5: 'funding', 6: 'interest_repayment', 7: 'interest_repayment', 8: 'principal_repayment'},
'amount': {0: -195.71, 1: 3.11, 2: 0.04, 3: 0.04, 4: 195.33, 5: -3915.45, 6: 0.73, 7: 0.73, 8: 3906.5}
}
df = pd.DataFrame(data)
# Convert cashflow_date to datetime
df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], format='%d/%m/%Y')
# Create a dictionary to store the earliest funding dates for each loan_identifier
earliest_funding_dates = df[df['cashflow_type'] == 'funding'].groupby('loan_identifier')['cashflow_date'].min().to_dict()
# Function to update cashflow_date based on the earliest funding date
def update_cashflow_date(row):
if row['cashflow_type'] != 'funding':
earliest_date = earliest_funding_dates.get(row['loan_identifier'])
if earliest_date and row['cashflow_date'] < earliest_date:
return earliest_date
return row['cashflow_date']
# Apply the update_cashflow_date function to the DataFrame
df['cashflow_date'] = df.apply(update_cashflow_date, axis=1)
# Convert cashflow_date back to the original format
df['cashflow_date'] = df['cashflow_date'].dt.strftime('%d/%m/%Y')
# Print the updated DataFrame
print(df)
这是你要的代码的翻译部分。
英文:
I have the following data frame:
identifier loan_identifier cashflow_date cashflow_type amount
0 1 a111 15/07/2023 funding -195.71
1 2 a111 01/07/2023 interest_repayment 3.11
2 3 a111 15/07/2023 interest_repayment 0.04
3 4 a111 20/07/2023 interest_repayment 0.04
4 5 a111 11/06/2023 principal_repayment 195.33
5 6 b222 10/07/2023 funding -3915.45
6 13 b222 10/07/2023 interest_repayment 0.73
7 14 b222 10/07/2023 interest_repayment 0.73
8 15 b222 13/06/2023 principal_repayment 3906.50
cashflow_date
of funding
cashflow_type should be the earliest for each loan_identifier
. Each row with cashflow_type != funding
whose date is earlier than funding
row of that loan_identifer
should be set to funding
cashflow_date.
So the above dataframe should look like this:
identifier loan_identifier cashflow_date cashflow_type amount
0 1 a111 15/07/2023 funding -195.71
1 2 a111 15/07/2023 interest_repayment 3.11
2 3 a111 15/07/2023 interest_repayment 0.04
3 4 a111 20/07/2023 interest_repayment 0.04
4 5 a111 15/07/2023 principal_repayment 195.33
5 6 b222 10/07/2023 funding -3915.45
6 13 b222 10/07/2023 interest_repayment 0.73
7 14 b222 10/07/2023 interest_repayment 0.73
8 15 b222 10/07/2023 principal_repayment 3906.50
Here's the data as a dictionary:
data = {
'identifier': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 13, 7: 14, 8: 15},
'loan_identifier': {0: 'a111', 1: 'a111', 2: 'a111', 3: 'a111', 4: 'a111', 5: 'b222', 6: 'b222', 7: 'b222', 8: 'b222'},
'cashflow_date': {0: '15/07/2023', 1: '01/07/2023', 2: '15/07/2023', 3: '20/07/2023', 4: '11/06/2023', 5: '10/07/2023', 6: '10/07/2023', 7: '10/07/2023', 8: '13/06/2023'},
'cashflow_type': {0: 'funding', 1: 'interest_repayment', 2: 'interest_repayment', 3: 'interest_repayment', 4: 'principal_repayment', 5: 'funding', 6: 'interest_repayment', 7: 'interest_repayment', 8: 'principal_repayment'},
'amount': {0: -195.71, 1: 3.11, 2: 0.04, 3: 0.04, 4: 195.33, 5: -3915.45, 6: 0.73, 7: 0.73, 8: 3906.5}
}
How can I do this transformation using full advantage of pandas?
答案1
得分: 4
你可以使用布尔索引:
# 确保日期是datetime格式
df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], dayfirst=True)
# 获取“funding”类型的行
# 注意:如果每个ID有多个“funding”类型的行,需要删除重复项
m1 = df['cashflow_type'].eq('funding')
# 获取每个ID的参考日期
ref = df['loan_identifier'].map(df.loc[m1].set_index('loan_identifier')
['cashflow_date'])
# 如果日期低于参考日期,则进行更改
m2 = df['cashflow_date'].lt(ref)
df.loc[m2, 'cashflow_date'] = ref[m2]
另一种方法是使用自定义的groupby.apply
和clip
方法:
df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], dayfirst=True)
df['cashflow_date'] = (df.groupby('loan_identifier', group_keys=False)
.apply(lambda g: g['cashflow_date']
.clip(lower=g.loc[g['cashflow_type'].eq('funding'),
'cashflow_date'].min())
)
)
输出:
identifier loan_identifier cashflow_date cashflow_type amount
0 1 a111 2023-07-15 funding -195.71
1 2 a111 2023-07-15 interest_repayment 3.11
2 3 a111 2023-07-15 interest_repayment 0.04
3 4 a111 2023-07-20 interest_repayment 0.04
4 5 a111 2023-07-15 principal_repayment 195.33
5 6 b222 2023-07-10 funding -3915.45
6 13 b222 2023-07-10 interest_repayment 0.73
7 14 b222 2023-07-10 interest_repayment 0.73
8 15 b222 2023-07-10 principal_repayment 3906.50
英文:
You can use boolean indexing:
# ensure datetime
df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], dayfirst=True)
# get rows with "funding"
# NB. if more than one per ID, you need to drop duplicates
m1 = df['cashflow_type'].eq('funding')
# get reference date per ID
ref = df['loan_identifier'].map(df.loc[m1].set_index('loan_identifier')
['cashflow_date'])
# change if lower than reference
m2 = df['cashflow_date'].lt(ref)
df.loc[m2, 'cashflow_date'] = ref[m2]
Another approach with a custom groupby.apply
and clip
:
df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], dayfirst=True)
df['cashflow_date'] = (df.groupby('loan_identifier', group_keys=False)
.apply(lambda g: g['cashflow_date']
.clip(lower=g.loc[g['cashflow_type'].eq('funding'),
'cashflow_date'].min())
)
)
Output:
identifier loan_identifier cashflow_date cashflow_type amount
0 1 a111 2023-07-15 funding -195.71
1 2 a111 2023-07-15 interest_repayment 3.11
2 3 a111 2023-07-15 interest_repayment 0.04
3 4 a111 2023-07-20 interest_repayment 0.04
4 5 a111 2023-07-15 principal_repayment 195.33
5 6 b222 2023-07-10 funding -3915.45
6 13 b222 2023-07-10 interest_repayment 0.73
7 14 b222 2023-07-10 interest_repayment 0.73
8 15 b222 2023-07-10 principal_repayment 3906.50
答案2
得分: 2
# 转换为日期时间格式
df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], format='%d/%m/%Y')
# 按贷款标识符分组以识别最早的现金流日期,以便贷款标识符与早期日期分组
earliest_dates = df[df['cashflow_type'] == 'funding'].groupby('loan_identifier')['cashflow_date'].min()
# a111 2023-07-15
# b222 2023-07-10
# 然后使用.loc索引器更新'cashflow_date'
for loan_id, earliest_date in earliest_dates.items():
df.loc[(df['loan_identifier'] == loan_id) & (df['cashflow_type'] != 'funding') & (df['cashflow_date'] < earliest_date), 'cashflow_date'] = earliest_date
print(df)
输出:
identifier loan_identifier cashflow_date cashflow_type amount
0 1 a111 2023-07-15 funding -195.71
1 2 a111 2023-07-15 interest_repayment 3.11
2 3 a111 2023-07-15 interest_repayment 0.04
3 4 a111 2023-07-20 interest_repayment 0.04
4 5 a111 2023-07-15 principal_repayment 195.33
5 6 b222 2023-07-10 funding -3915.45
6 13 b222 2023-07-10 interest_repayment 0.73
7 14 b222 2023-07-10 interest_repayment 0.73
8 15 b222 2023-07-10 principal_repayment 3906.50
请注意,上面的翻译是代码的一部分,用于解释代码的功能。
英文:
# converting to datetime format
df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], format='%d/%m/%Y')
#groupby loan identifier to identify earliest cashflow date so that loan_identifier gets grouped with early date
earliest_dates = df[df['cashflow_type'] == 'funding'].groupby('loan_identifier')['cashflow_date'].min()
# a111 2023-07-15
# b222 2023-07-10
# then updating 'cashflow_date' using .loc indexer
for loan_id, earliest_date in earliest_dates.items():
df.loc[(df['loan_identifier'] == loan_id) & (df['cashflow_type'] != 'funding') & (df['cashflow_date'] < earliest_date), 'cashflow_date'] = earliest_date
print(df)
Output:
identifier loan_identifier cashflow_date cashflow_type amount
0 1 a111 2023-07-15 funding -195.71
1 2 a111 2023-07-15 interest_repayment 3.11
2 3 a111 2023-07-15 interest_repayment 0.04
3 4 a111 2023-07-20 interest_repayment 0.04
4 5 a111 2023-07-15 principal_repayment 195.33
5 6 b222 2023-07-10 funding -3915.45
6 13 b222 2023-07-10 interest_repayment 0.73
7 14 b222 2023-07-10 interest_repayment 0.73
8 15 b222 2023-07-10 principal_repayment 3906.50
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论