2023年6月27日 16:55:39go评论137阅读模式

英文:

Creating a new column in a Pandas DataFrame based on the previous quarter and the same ID in another DataFrame

问题

在第二个数据框 df_test2 中，我想要插入一个名为 'file' 的列，其值由 df_test 中具有相同 'ident' 的相应 'quarter' 值的 'file' 值来定义。所以，例如，'ident' = 1 在 df_test 中有 '2022q4' 和 '2023q2'，而在 df_test2 中有 '2023q1'。这意味着我希望 df_test2 中的 'file' 列读取 'file_1'，因为这是前一个季度的文件名，而不是 'file1_v2'。最终结果应该是 df_test2 中的一列，如下所示：

['file_1', 'file_2_new', 'file_3']

我的想法是在两个数据框中查找相同的 'ident'，将 df_test 中的 'quarter' 值与 df_test2 中的前一个季度值进行比较，并将文件名设置为相同的值，但我不确定如何做到这一点。非常感谢您的任何帮助！

英文:

What I have is two very large datasets that I want to combine, but before I do, I want to make sure that the same columns with correct values are found in both. One of them is missing a column titled 'file', which should be based on values found in this column in the other dataframe and values found in a list. My code looks something like this:

import pandas as pd
quarters = [&#39;2021q1&#39;, &#39;2021q2&#39;, &#39;2021q3&#39;, &#39;2021q4&#39;, &#39;2022q1&#39;,
            &#39;2022q2&#39;, &#39;2022q3&#39;, &#39;2022q4&#39;, &#39;2023q1&#39;, &#39;2023q2&#39;]
df_test = pd.DataFrame(data=None, columns=[&#39;file&#39;, &#39;quarter&#39;, &#39;ident&#39;])
df_test.file = [&#39;file_1&#39;, &#39;file_1_v2&#39;, &#39;file_2_old&#39;, &#39;file_2_new&#39;, &#39;file_3&#39;]
df_test.quarter = [&#39;2022q4&#39;, &#39;2023q2&#39;, &#39;2022q2&#39;, &#39;2022q3&#39;, &#39;2023q1&#39;]
df.ident = [1, 1, 2, 2, 3]
df_test2 = pd.DataFrame(data=None, columns = [&#39;quarter&#39;, &#39;ident&#39;])
df_test2.quarter = [&#39;2023q1&#39;, &#39;2022q4&#39;, &#39;2023q2&#39;]
df_test2.ident = [1, 2, 3]

In the second dataframe df_test2, I want to insert a column 'file' with values defined by the 'file' values in df_test for the quarter before the one shown in df_test2 for the same id-number 'ident'. So, for example, the 'ident' = 1 has quarter '2022q4' and '2023q2' in df_test and '2023q1' in df_test2. This means that I want the 'file' column to read 'file_1' in df_test2 since this was the file name for the previous quarter, and not 'file1_v2'. The end result should be a column in df_test2 that reads:

[&#39;file_1&#39;, &#39;file_2_new&#39;, &#39;file_3&#39;]

My idea is to look for the same id-number in both dataframes, compare the 'quarter' value in df_test2 with the previous quarter value in df_test and set the file name to be the same, but I'm not sure how to do this. Any help is really appreciated, thanks!

答案1

得分: 1

你可以使用季度周期来简化操作（使用 to_datetime+to_period 进行转换），然后将你的数据框合并 merge：

# 使用季度周期代替字符串
df_test['quarter'] = pd.to_datetime(df_test['quarter']).dt.to_period('Q')
df_test2['quarter'] = pd.to_datetime(df_test2['quarter']).dt.to_period('Q')
# 在前一个周期上合并
out = df_test2.merge(df_test.drop(columns='quarter'), how='left',
                     left_on=['ident', 'quarter'],
                     right_on=['ident', df_test['quarter'].add(1)])

输出:

  quarter  ident        file
0  2023Q1      1      file_1
1  2022Q4      2  file_2_new
2  2023Q2      3      file_3

请注意，你可以保留字符串并将周期作为 merge 中的键（这里为了演示，保留了所有列）：

out = df_test2.merge(df_test, how='left',
                     suffixes=('_1', '_2'),
                     left_on=['ident', pd.to_datetime(df_test2['quarter'])
                                         .dt.to_period('Q')],
                     right_on=['ident', pd.to_datetime(df_test['quarter'])
                                          .dt.to_period('Q').add(1)])

输出:

  quarter_1   key_1  ident        file quarter_2
0    2023q1  2023Q1      1      file_1    2022q4
1    2022q4  2022Q4      2  file_2_new    2022q3
2    2023q2  2023Q2      3      file_3    2023q1

英文:

You can use quarter periods to makes things easier (converting with to_datetime+to_period), and then merge your dataframes:

# use quarter periods instead of strings
df_test[&#39;quarter&#39;] = pd.to_datetime(df_test[&#39;quarter&#39;]).dt.to_period(&#39;Q&#39;)
df_test2[&#39;quarter&#39;] = pd.to_datetime(df_test2[&#39;quarter&#39;]).dt.to_period(&#39;Q&#39;)
# merge on the previous period
out = df_test2.merge(df_test.drop(columns=&#39;quarter&#39;), how=&#39;left&#39;,
                     left_on=[&#39;ident&#39;, &#39;quarter&#39;],
                     right_on=[&#39;ident&#39;, df_test[&#39;quarter&#39;].add(1)])

Output:

  quarter  ident        file
0  2023Q1      1      file_1
1  2022Q4      2  file_2_new
2  2023Q2      3      file_3

Note that you can keep your strings and pass the periods as keys in the merge (keeping all columns here for the demo):

out = df_test2.merge(df_test, how=&#39;left&#39;,
                     suffixes=(&#39;_1&#39;, &#39;_2&#39;),
                     left_on=[&#39;ident&#39;, pd.to_datetime(df_test2[&#39;quarter&#39;])
                                         .dt.to_period(&#39;Q&#39;)],
                     right_on=[&#39;ident&#39;, pd.to_datetime(df_test[&#39;quarter&#39;])
                                          .dt.to_period(&#39;Q&#39;).add(1)])

Output:

  quarter_1   key_1  ident        file quarter_2
0    2023q1  2023Q1      1      file_1    2022q4
1    2022q4  2022Q4      2  file_2_new    2022q3
2    2023q2  2023Q2      3      file_3    2023q1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Creating a new column in a Pandas DataFrame based on the previous quarter and the same ID in another DataFrame

问题

答案1

PyTorch中的数据增强用于CNN。

Snakemake在一个字典上展开，保留通配符。

将 JSON 文件写入 S3 存储桶，无需在本地保存文件。

While循环在使用子进程时不中断。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。