Creating a new column in a Pandas DataFrame based on the previous quarter and the same ID in another DataFrame

huangapple go评论95阅读模式
英文:

Creating a new column in a Pandas DataFrame based on the previous quarter and the same ID in another DataFrame

问题

在第二个数据框 df_test2 中,我想要插入一个名为 'file' 的列,其值由 df_test 中具有相同 'ident' 的相应 'quarter' 值的 'file' 值来定义。所以,例如,'ident' = 1 在 df_test 中有 '2022q4' 和 '2023q2',而在 df_test2 中有 '2023q1'。这意味着我希望 df_test2 中的 'file' 列读取 'file_1',因为这是前一个季度的文件名,而不是 'file1_v2'。最终结果应该是 df_test2 中的一列,如下所示:

['file_1', 'file_2_new', 'file_3']

我的想法是在两个数据框中查找相同的 'ident',将 df_test 中的 'quarter' 值与 df_test2 中的前一个季度值进行比较,并将文件名设置为相同的值,但我不确定如何做到这一点。非常感谢您的任何帮助!

英文:

What I have is two very large datasets that I want to combine, but before I do, I want to make sure that the same columns with correct values are found in both. One of them is missing a column titled 'file', which should be based on values found in this column in the other dataframe and values found in a list. My code looks something like this:

import pandas as pd

quarters = ['2021q1', '2021q2', '2021q3', '2021q4', '2022q1',
            '2022q2', '2022q3', '2022q4', '2023q1', '2023q2']


df_test = pd.DataFrame(data=None, columns=['file', 'quarter', 'ident'])
df_test.file = ['file_1', 'file_1_v2', 'file_2_old', 'file_2_new', 'file_3']
df_test.quarter = ['2022q4', '2023q2', '2022q2', '2022q3', '2023q1']
df.ident = [1, 1, 2, 2, 3]

df_test2 = pd.DataFrame(data=None, columns = ['quarter', 'ident'])
df_test2.quarter = ['2023q1', '2022q4', '2023q2']
df_test2.ident = [1, 2, 3]

In the second dataframe df_test2, I want to insert a column 'file' with values defined by the 'file' values in df_test for the quarter before the one shown in df_test2 for the same id-number 'ident'. So, for example, the 'ident' = 1 has quarter '2022q4' and '2023q2' in df_test and '2023q1' in df_test2. This means that I want the 'file' column to read 'file_1' in df_test2 since this was the file name for the previous quarter, and not 'file1_v2'. The end result should be a column in df_test2 that reads:

['file_1', 'file_2_new', 'file_3']

My idea is to look for the same id-number in both dataframes, compare the 'quarter' value in df_test2 with the previous quarter value in df_test and set the file name to be the same, but I'm not sure how to do this. Any help is really appreciated, thanks!

答案1

得分: 1

你可以使用季度周期来简化操作(使用 to_datetime+to_period 进行转换),然后将你的数据框合并 merge

# 使用季度周期代替字符串
df_test['quarter'] = pd.to_datetime(df_test['quarter']).dt.to_period('Q')
df_test2['quarter'] = pd.to_datetime(df_test2['quarter']).dt.to_period('Q')

# 在前一个周期上合并
out = df_test2.merge(df_test.drop(columns='quarter'), how='left',
                     left_on=['ident', 'quarter'],
                     right_on=['ident', df_test['quarter'].add(1)])

输出:

  quarter  ident        file
0  2023Q1      1      file_1
1  2022Q4      2  file_2_new
2  2023Q2      3      file_3

请注意,你可以保留字符串并将周期作为 merge 中的键(这里为了演示,保留了所有列):

out = df_test2.merge(df_test, how='left',
                     suffixes=('_1', '_2'),
                     left_on=['ident', pd.to_datetime(df_test2['quarter'])
                                         .dt.to_period('Q')],
                     right_on=['ident', pd.to_datetime(df_test['quarter'])
                                          .dt.to_period('Q').add(1)])

输出:

  quarter_1   key_1  ident        file quarter_2
0    2023q1  2023Q1      1      file_1    2022q4
1    2022q4  2022Q4      2  file_2_new    2022q3
2    2023q2  2023Q2      3      file_3    2023q1
英文:

You can use quarter periods to makes things easier (converting with to_datetime+to_period), and then merge your dataframes:

# use quarter periods instead of strings
df_test['quarter'] = pd.to_datetime(df_test['quarter']).dt.to_period('Q')
df_test2['quarter'] = pd.to_datetime(df_test2['quarter']).dt.to_period('Q')

# merge on the previous period
out = df_test2.merge(df_test.drop(columns='quarter'), how='left',
                     left_on=['ident', 'quarter'],
                     right_on=['ident', df_test['quarter'].add(1)])

Output:

  quarter  ident        file
0  2023Q1      1      file_1
1  2022Q4      2  file_2_new
2  2023Q2      3      file_3

Note that you can keep your strings and pass the periods as keys in the merge (keeping all columns here for the demo):

out = df_test2.merge(df_test, how='left',
                     suffixes=('_1', '_2'),
                     left_on=['ident', pd.to_datetime(df_test2['quarter'])
                                         .dt.to_period('Q')],
                     right_on=['ident', pd.to_datetime(df_test['quarter'])
                                          .dt.to_period('Q').add(1)])

Output:

  quarter_1   key_1  ident        file quarter_2
0    2023q1  2023Q1      1      file_1    2022q4
1    2022q4  2022Q4      2  file_2_new    2022q3
2    2023q2  2023Q2      3      file_3    2023q1

huangapple
  • 本文由 发表于 2023年6月27日 16:55:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76563207.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定