英文:
Creating a new column in a Pandas DataFrame based on the previous quarter and the same ID in another DataFrame
问题
在第二个数据框 df_test2 中,我想要插入一个名为 'file' 的列,其值由 df_test 中具有相同 'ident' 的相应 'quarter' 值的 'file' 值来定义。所以,例如,'ident' = 1 在 df_test 中有 '2022q4' 和 '2023q2',而在 df_test2 中有 '2023q1'。这意味着我希望 df_test2 中的 'file' 列读取 'file_1',因为这是前一个季度的文件名,而不是 'file1_v2'。最终结果应该是 df_test2 中的一列,如下所示:
['file_1', 'file_2_new', 'file_3']
我的想法是在两个数据框中查找相同的 'ident',将 df_test 中的 'quarter' 值与 df_test2 中的前一个季度值进行比较,并将文件名设置为相同的值,但我不确定如何做到这一点。非常感谢您的任何帮助!
英文:
What I have is two very large datasets that I want to combine, but before I do, I want to make sure that the same columns with correct values are found in both. One of them is missing a column titled 'file', which should be based on values found in this column in the other dataframe and values found in a list. My code looks something like this:
import pandas as pd
quarters = ['2021q1', '2021q2', '2021q3', '2021q4', '2022q1',
'2022q2', '2022q3', '2022q4', '2023q1', '2023q2']
df_test = pd.DataFrame(data=None, columns=['file', 'quarter', 'ident'])
df_test.file = ['file_1', 'file_1_v2', 'file_2_old', 'file_2_new', 'file_3']
df_test.quarter = ['2022q4', '2023q2', '2022q2', '2022q3', '2023q1']
df.ident = [1, 1, 2, 2, 3]
df_test2 = pd.DataFrame(data=None, columns = ['quarter', 'ident'])
df_test2.quarter = ['2023q1', '2022q4', '2023q2']
df_test2.ident = [1, 2, 3]
In the second dataframe df_test2, I want to insert a column 'file' with values defined by the 'file' values in df_test for the quarter before the one shown in df_test2 for the same id-number 'ident'. So, for example, the 'ident' = 1 has quarter '2022q4' and '2023q2' in df_test and '2023q1' in df_test2. This means that I want the 'file' column to read 'file_1' in df_test2 since this was the file name for the previous quarter, and not 'file1_v2'. The end result should be a column in df_test2 that reads:
['file_1', 'file_2_new', 'file_3']
My idea is to look for the same id-number in both dataframes, compare the 'quarter' value in df_test2 with the previous quarter value in df_test and set the file name to be the same, but I'm not sure how to do this. Any help is really appreciated, thanks!
答案1
得分: 1
你可以使用季度周期来简化操作(使用 to_datetime
+to_period
进行转换),然后将你的数据框合并 merge
:
# 使用季度周期代替字符串
df_test['quarter'] = pd.to_datetime(df_test['quarter']).dt.to_period('Q')
df_test2['quarter'] = pd.to_datetime(df_test2['quarter']).dt.to_period('Q')
# 在前一个周期上合并
out = df_test2.merge(df_test.drop(columns='quarter'), how='left',
left_on=['ident', 'quarter'],
right_on=['ident', df_test['quarter'].add(1)])
输出:
quarter ident file
0 2023Q1 1 file_1
1 2022Q4 2 file_2_new
2 2023Q2 3 file_3
请注意,你可以保留字符串并将周期作为 merge
中的键(这里为了演示,保留了所有列):
out = df_test2.merge(df_test, how='left',
suffixes=('_1', '_2'),
left_on=['ident', pd.to_datetime(df_test2['quarter'])
.dt.to_period('Q')],
right_on=['ident', pd.to_datetime(df_test['quarter'])
.dt.to_period('Q').add(1)])
输出:
quarter_1 key_1 ident file quarter_2
0 2023q1 2023Q1 1 file_1 2022q4
1 2022q4 2022Q4 2 file_2_new 2022q3
2 2023q2 2023Q2 3 file_3 2023q1
英文:
You can use quarter periods to makes things easier (converting with to_datetime
+to_period
), and then merge
your dataframes:
# use quarter periods instead of strings
df_test['quarter'] = pd.to_datetime(df_test['quarter']).dt.to_period('Q')
df_test2['quarter'] = pd.to_datetime(df_test2['quarter']).dt.to_period('Q')
# merge on the previous period
out = df_test2.merge(df_test.drop(columns='quarter'), how='left',
left_on=['ident', 'quarter'],
right_on=['ident', df_test['quarter'].add(1)])
Output:
quarter ident file
0 2023Q1 1 file_1
1 2022Q4 2 file_2_new
2 2023Q2 3 file_3
Note that you can keep your strings and pass the periods as keys in the merge
(keeping all columns here for the demo):
out = df_test2.merge(df_test, how='left',
suffixes=('_1', '_2'),
left_on=['ident', pd.to_datetime(df_test2['quarter'])
.dt.to_period('Q')],
right_on=['ident', pd.to_datetime(df_test['quarter'])
.dt.to_period('Q').add(1)])
Output:
quarter_1 key_1 ident file quarter_2
0 2023q1 2023Q1 1 file_1 2022q4
1 2022q4 2022Q4 2 file_2_new 2022q3
2 2023q2 2023Q2 3 file_3 2023q1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论