Creating a new column in a Pandas DataFrame based on the previous quarter and the same ID in another DataFrame

huangapple go评论110阅读模式
英文:

Creating a new column in a Pandas DataFrame based on the previous quarter and the same ID in another DataFrame

问题

在第二个数据框 df_test2 中,我想要插入一个名为 'file' 的列,其值由 df_test 中具有相同 'ident' 的相应 'quarter' 值的 'file' 值来定义。所以,例如,'ident' = 1 在 df_test 中有 '2022q4' 和 '2023q2',而在 df_test2 中有 '2023q1'。这意味着我希望 df_test2 中的 'file' 列读取 'file_1',因为这是前一个季度的文件名,而不是 'file1_v2'。最终结果应该是 df_test2 中的一列,如下所示:

  1. ['file_1', 'file_2_new', 'file_3']

我的想法是在两个数据框中查找相同的 'ident',将 df_test 中的 'quarter' 值与 df_test2 中的前一个季度值进行比较,并将文件名设置为相同的值,但我不确定如何做到这一点。非常感谢您的任何帮助!

英文:

What I have is two very large datasets that I want to combine, but before I do, I want to make sure that the same columns with correct values are found in both. One of them is missing a column titled 'file', which should be based on values found in this column in the other dataframe and values found in a list. My code looks something like this:

  1. import pandas as pd
  2. quarters = ['2021q1', '2021q2', '2021q3', '2021q4', '2022q1',
  3. '2022q2', '2022q3', '2022q4', '2023q1', '2023q2']
  4. df_test = pd.DataFrame(data=None, columns=['file', 'quarter', 'ident'])
  5. df_test.file = ['file_1', 'file_1_v2', 'file_2_old', 'file_2_new', 'file_3']
  6. df_test.quarter = ['2022q4', '2023q2', '2022q2', '2022q3', '2023q1']
  7. df.ident = [1, 1, 2, 2, 3]
  8. df_test2 = pd.DataFrame(data=None, columns = ['quarter', 'ident'])
  9. df_test2.quarter = ['2023q1', '2022q4', '2023q2']
  10. df_test2.ident = [1, 2, 3]

In the second dataframe df_test2, I want to insert a column 'file' with values defined by the 'file' values in df_test for the quarter before the one shown in df_test2 for the same id-number 'ident'. So, for example, the 'ident' = 1 has quarter '2022q4' and '2023q2' in df_test and '2023q1' in df_test2. This means that I want the 'file' column to read 'file_1' in df_test2 since this was the file name for the previous quarter, and not 'file1_v2'. The end result should be a column in df_test2 that reads:

  1. ['file_1', 'file_2_new', 'file_3']

My idea is to look for the same id-number in both dataframes, compare the 'quarter' value in df_test2 with the previous quarter value in df_test and set the file name to be the same, but I'm not sure how to do this. Any help is really appreciated, thanks!

答案1

得分: 1

你可以使用季度周期来简化操作(使用 to_datetime+to_period 进行转换),然后将你的数据框合并 merge

  1. # 使用季度周期代替字符串
  2. df_test['quarter'] = pd.to_datetime(df_test['quarter']).dt.to_period('Q')
  3. df_test2['quarter'] = pd.to_datetime(df_test2['quarter']).dt.to_period('Q')
  4. # 在前一个周期上合并
  5. out = df_test2.merge(df_test.drop(columns='quarter'), how='left',
  6. left_on=['ident', 'quarter'],
  7. right_on=['ident', df_test['quarter'].add(1)])

输出:

  1. quarter ident file
  2. 0 2023Q1 1 file_1
  3. 1 2022Q4 2 file_2_new
  4. 2 2023Q2 3 file_3

请注意,你可以保留字符串并将周期作为 merge 中的键(这里为了演示,保留了所有列):

  1. out = df_test2.merge(df_test, how='left',
  2. suffixes=('_1', '_2'),
  3. left_on=['ident', pd.to_datetime(df_test2['quarter'])
  4. .dt.to_period('Q')],
  5. right_on=['ident', pd.to_datetime(df_test['quarter'])
  6. .dt.to_period('Q').add(1)])

输出:

  1. quarter_1 key_1 ident file quarter_2
  2. 0 2023q1 2023Q1 1 file_1 2022q4
  3. 1 2022q4 2022Q4 2 file_2_new 2022q3
  4. 2 2023q2 2023Q2 3 file_3 2023q1
英文:

You can use quarter periods to makes things easier (converting with to_datetime+to_period), and then merge your dataframes:

  1. # use quarter periods instead of strings
  2. df_test['quarter'] = pd.to_datetime(df_test['quarter']).dt.to_period('Q')
  3. df_test2['quarter'] = pd.to_datetime(df_test2['quarter']).dt.to_period('Q')
  4. # merge on the previous period
  5. out = df_test2.merge(df_test.drop(columns='quarter'), how='left',
  6. left_on=['ident', 'quarter'],
  7. right_on=['ident', df_test['quarter'].add(1)])

Output:

  1. quarter ident file
  2. 0 2023Q1 1 file_1
  3. 1 2022Q4 2 file_2_new
  4. 2 2023Q2 3 file_3

Note that you can keep your strings and pass the periods as keys in the merge (keeping all columns here for the demo):

  1. out = df_test2.merge(df_test, how='left',
  2. suffixes=('_1', '_2'),
  3. left_on=['ident', pd.to_datetime(df_test2['quarter'])
  4. .dt.to_period('Q')],
  5. right_on=['ident', pd.to_datetime(df_test['quarter'])
  6. .dt.to_period('Q').add(1)])

Output:

  1. quarter_1 key_1 ident file quarter_2
  2. 0 2023q1 2023Q1 1 file_1 2022q4
  3. 1 2022q4 2022Q4 2 file_2_new 2022q3
  4. 2 2023q2 2023Q2 3 file_3 2023q1

huangapple
  • 本文由 发表于 2023年6月27日 16:55:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76563207.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定