2023年8月9日 17:01:54go评论101阅读模式

英文:

How to effectively use pandas to change values based unique identifier and one condition?

问题

你可以使用Pandas的一些功能来完成这个转换。首先，将数据字典转换为数据帧：

import pandas as pd

data = {
    'identifier': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 13, 7: 14, 8: 15},
    'loan_identifier': {0: 'a111', 1: 'a111', 2: 'a111', 3: 'a111', 4: 'a111', 5: 'b222', 6: 'b222', 7: 'b222', 8: 'b222'},
    'cashflow_date': {0: '15/07/2023', 1: '01/07/2023', 2: '15/07/2023', 3: '20/07/2023', 4: '11/06/2023', 5: '10/07/2023', 6: '10/07/2023', 7: '10/07/2023', 8: '13/06/2023'},
    'cashflow_type': {0: 'funding', 1: 'interest_repayment', 2: 'interest_repayment', 3: 'interest_repayment', 4: 'principal_repayment', 5: 'funding', 6: 'interest_repayment', 7: 'interest_repayment', 8: 'principal_repayment'},
    'amount': {0: -195.71, 1: 3.11, 2: 0.04, 3: 0.04, 4: 195.33, 5: -3915.45, 6: 0.73, 7: 0.73, 8: 3906.5}
}

df = pd.DataFrame(data)

然后，将cashflow_date列转换为日期时间类型：

df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], format='%d/%m/%Y')

接下来，按loan_identifier分组，并将每个组内的cashflow_date最小值应用于整个组的cashflow_date列：

df['cashflow_date'] = df.groupby('loan_identifier')['cashflow_date'].transform('min')

最后，根据条件将cashflow_date列更新为最小值：

df.loc[df['cashflow_type'] != 'funding', 'cashflow_date'] = df['cashflow_date'].min()

完成后，你将得到转换后的数据帧df，它符合你的要求。

英文:

I have the following data frame:

   identifier loan_identifier cashflow_date        cashflow_type   amount
0           1            a111    15/07/2023              funding  -195.71
1           2            a111    01/07/2023   interest_repayment     3.11
2           3            a111    15/07/2023   interest_repayment     0.04
3           4            a111    20/07/2023   interest_repayment     0.04
4           5            a111    11/06/2023  principal_repayment   195.33
5           6            b222    10/07/2023              funding -3915.45
6          13            b222    10/07/2023   interest_repayment     0.73
7          14            b222    10/07/2023   interest_repayment     0.73
8          15            b222    13/06/2023  principal_repayment  3906.50

cashflow_date of funding cashflow_type should be the earliest for each loan_identifier. Each row with cashflow_type != funding whose date is earlier than funding row of that loan_identifer should be set to funding cashflow_date.

So the above dataframe should look like this:

   identifier loan_identifier cashflow_date        cashflow_type   amount
0           1            a111    15/07/2023              funding  -195.71
1           2            a111    15/07/2023   interest_repayment     3.11
2           3            a111    15/07/2023   interest_repayment     0.04
3           4            a111    20/07/2023   interest_repayment     0.04
4           5            a111    15/07/2023  principal_repayment   195.33
5           6            b222    10/07/2023              funding -3915.45
6          13            b222    10/07/2023   interest_repayment     0.73
7          14            b222    10/07/2023   interest_repayment     0.73
8          15            b222    10/07/2023  principal_repayment  3906.50

Here's the data as a dictionary:

data = {
    &#39;identifier&#39;: {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 13, 7: 14, 8: 15},
    &#39;loan_identifier&#39;: {0: &#39;a111&#39;, 1: &#39;a111&#39;, 2: &#39;a111&#39;, 3: &#39;a111&#39;, 4: &#39;a111&#39;, 5: &#39;b222&#39;, 6: &#39;b222&#39;, 7: &#39;b222&#39;, 8: &#39;b222&#39;},
    &#39;cashflow_date&#39;: {0: &#39;15/07/2023&#39;, 1: &#39;01/07/2023&#39;, 2: &#39;15/07/2023&#39;, 3: &#39;20/07/2023&#39;, 4: &#39;11/06/2023&#39;, 5: &#39;10/07/2023&#39;, 6: &#39;10/07/2023&#39;, 7: &#39;10/07/2023&#39;, 8: &#39;13/06/2023&#39;},
    &#39;cashflow_type&#39;: {0: &#39;funding&#39;, 1: &#39;interest_repayment&#39;, 2: &#39;interest_repayment&#39;, 3: &#39;interest_repayment&#39;, 4: &#39;principal_repayment&#39;, 5: &#39;funding&#39;, 6: &#39;interest_repayment&#39;, 7: &#39;interest_repayment&#39;, 8: &#39;principal_repayment&#39;},
    &#39;amount&#39;: {0: -195.71, 1: 3.11, 2: 0.04, 3: 0.04, 4: 195.33, 5: -3915.45, 6: 0.73, 7: 0.73, 8: 3906.5}
}

How can I do this transformation using full advantage of pandas?

答案1

得分: 4

你可以使用布尔索引进行操作：

# 确保日期格式正确
df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], dayfirst=True)
# 获取包含"funding"的行
# 注意：如果一个ID对应多个"funding"，你需要删除重复项
m1 = df['cashflow_type'].eq('funding')

# 获取每个ID对应的参考日期
ref = df['loan_identifier'].map(df.loc[m1].set_index('loan_identifier')
                          ['cashflow_date'])

# 如果日期小于参考日期，则进行更改
m2 = df['cashflow_date'].lt(ref)
df.loc[m2, 'cashflow_date'] = ref[m2]

另一种方法是使用自定义的groupby.apply和clip函数：

df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], dayfirst=True)

df['cashflow_date'] = (df.groupby('loan_identifier', group_keys=False)
                         .apply(lambda g: g['cashflow_date']
                                 .clip(lower=g.loc[g['cashflow_type'].eq('funding'),
                                                   'cashflow_date'].min())
                               )
                      )

输出结果为：

   identifier loan_identifier cashflow_date        cashflow_type   amount
0           1            a111    2023-07-15              funding  -195.71
1           2            a111    2023-07-15   interest_repayment     3.11
2           3            a111    2023-07-15   interest_repayment     0.04
3           4            a111    2023-07-20   interest_repayment     0.04
4           5            a111    2023-07-15  principal_repayment   195.33
5           6            b222    2023-07-10              funding -3915.45
6          13            b222    2023-07-10   interest_repayment     0.73
7          14            b222    2023-07-10   interest_repayment     0.73
8          15            b222    2023-07-10  principal_repayment  3906.50

英文:

You can use boolean indexing:

# ensure datetime
df[&#39;cashflow_date&#39;] = pd.to_datetime(df[&#39;cashflow_date&#39;], dayfirst=True)
# get rows with &quot;funding&quot;
# NB. if more than one per ID, you need to drop duplicates
m1 = df[&#39;cashflow_type&#39;].eq(&#39;funding&#39;)

# get reference date per ID
ref = df[&#39;loan_identifier&#39;].map(df.loc[m1].set_index(&#39;loan_identifier&#39;)
                          [&#39;cashflow_date&#39;])

# change if lower than reference
m2 = df[&#39;cashflow_date&#39;].lt(ref)
df.loc[m2, &#39;cashflow_date&#39;] = ref[m2]

Another approach with a custom groupby.apply and clip:

df[&#39;cashflow_date&#39;] = pd.to_datetime(df[&#39;cashflow_date&#39;], dayfirst=True)

df[&#39;cashflow_date&#39;] = (df.groupby(&#39;loan_identifier&#39;, group_keys=False)
                         .apply(lambda g: g[&#39;cashflow_date&#39;]
                                 .clip(lower=g.loc[g[&#39;cashflow_type&#39;].eq(&#39;funding&#39;),
                                                   &#39;cashflow_date&#39;].min())
                               )
                      )

Output:

   identifier loan_identifier cashflow_date        cashflow_type   amount
0           1            a111    2023-07-15              funding  -195.71
1           2            a111    2023-07-15   interest_repayment     3.11
2           3            a111    2023-07-15   interest_repayment     0.04
3           4            a111    2023-07-20   interest_repayment     0.04
4           5            a111    2023-07-15  principal_repayment   195.33
5           6            b222    2023-07-10              funding -3915.45
6          13            b222    2023-07-10   interest_repayment     0.73
7          14            b222    2023-07-10   interest_repayment     0.73
8          15            b222    2023-07-10  principal_repayment  3906.50

答案2

得分: 2

# 转换为日期时间格式
df['cashflow_date'] = pd.to_datetime(df['cashflow_date'], format='%d/%m/%Y')

# 按贷款标识符分组，以确定最早的现金流日期，以便将贷款标识符与早期日期分组
earliest_dates = df[df['cashflow_type'] == 'funding'].groupby('loan_identifier')['cashflow_date'].min()
# a111   2023-07-15
# b222   2023-07-10

# 然后使用.loc索引器更新'cashflow_date'
for loan_id, earliest_date in earliest_dates.items():
    df.loc[(df['loan_identifier'] == loan_id) & (df['cashflow_type'] != 'funding') & (df['cashflow_date'] < earliest_date), 'cashflow_date'] = earliest_date

print(df)

输出:

identifier loan_identifier cashflow_date        cashflow_type   amount
0           1            a111    2023-07-15              funding  -195.71
1           2            a111    2023-07-15   interest_repayment     3.11
2           3            a111    2023-07-15   interest_repayment     0.04
3           4            a111    2023-07-20   interest_repayment     0.04
4           5            a111    2023-07-15  principal_repayment   195.33
5           6            b222    2023-07-10              funding -3915.45
6          13            b222    2023-07-10   interest_repayment     0.73
7          14            b222    2023-07-10   interest_repayment     0.73
8          15            b222    2023-07-10  principal_repayment  3906.50

英文:

# converting to  datetime format
df[&#39;cashflow_date&#39;] = pd.to_datetime(df[&#39;cashflow_date&#39;], format=&#39;%d/%m/%Y&#39;)

#groupby loan identifier to identify earliest cashflow date so that loan_identifier gets grouped with early date
earliest_dates = df[df[&#39;cashflow_type&#39;] == &#39;funding&#39;].groupby(&#39;loan_identifier&#39;)[&#39;cashflow_date&#39;].min()
# a111   2023-07-15
# b222   2023-07-10

# then updating &#39;cashflow_date&#39; using .loc indexer
for loan_id, earliest_date in earliest_dates.items():
    df.loc[(df[&#39;loan_identifier&#39;] == loan_id) &amp; (df[&#39;cashflow_type&#39;] != &#39;funding&#39;) &amp; (df[&#39;cashflow_date&#39;] &lt; earliest_date), &#39;cashflow_date&#39;] = earliest_date

print(df)

Output:

identifier loan_identifier cashflow_date        cashflow_type   amount
0           1            a111    2023-07-15              funding  -195.71
1           2            a111    2023-07-15   interest_repayment     3.11
2           3            a111    2023-07-15   interest_repayment     0.04
3           4            a111    2023-07-20   interest_repayment     0.04
4           5            a111    2023-07-15  principal_repayment   195.33
5           6            b222    2023-07-10              funding -3915.45
6          13            b222    2023-07-10   interest_repayment     0.73
7          14            b222    2023-07-10   interest_repayment     0.73
8          15            b222    2023-07-10  principal_repayment  3906.50

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何有效地使用pandas根据唯一标识符和一个条件来更改值？

问题

答案1

答案2

Flask应用正在正确运行功能，但未呈现模板。

找到了但未调用的测试装置。

DRF web界面不像Postman那样工作

在Python中使用re.finditer方法提取匹配的片段如何实现？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论