将两个数据框基于它们的分组/ID 进行映射,使用更接近的数值。

huangapple go评论91阅读模式
英文:

Map two dataframes, based on their group/id, with closer values

问题

I have two dataframes as such:

  1. # 加载所需的库
  2. import pandas as pd
  3. import matplotlib.pyplot as plt
  4. # 创建数据集_1
  5. data_set_1 = {'id': [1, 2, 3, 4, 5],
  6. 'Available_Salary': [10, 20, 30, 40, 50]}
  7. # 转换为dataframe_1
  8. df_1 = pd.DataFrame(data_set_1)
  9. print("\n df_1 = \n", df_1)
  10. # 创建数据集_2
  11. data_set_2 = {'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
  12. 'Expected_Salary': [9, 49, 18, 19, 29, 41, 4, 57, 42, 3]}
  13. # 转换为dataframe_2
  14. df_2 = pd.DataFrame(data_set_2)
  15. print("\n df_2 = \n", df_2)

在这里,从视觉上可以说,'Expected_Salary' 9(id=1)、'Expected_Salary' 4(id=7)和'Expected_Salary' 3(id=10)更接近'Available_Salary' 10(id=1)。

同样,'Expected_Salary' 49(id=2)和'Expected_Salary' 57(id=8)更接近'Available_Salary' 50(id=5),依此类推。

这可以在下面的图像文件中更好地表示:

将两个数据框基于它们的分组/ID 进行映射,使用更接近的数值。

现在,我需要在df_2中生成新的列'Salary_from_df_1'和'id_from_df_1',它们将与df_1的id相对应,表示更接近的工资。

例如,由于'Expected_Salary' 9(id=1)、'Expected_Salary' 4(id=7)和'Expected_Salary' 3(id=10)更接近'Available_Salary' 10(id=1),因此它们将具有'Salary_from_df_1'为10和'id_from_df_1'为1。这看起来如下:

将两个数据框基于它们的分组/ID 进行映射,使用更接近的数值。

对于df_2的其他id,相同的逻辑也适用于与df_1的映射。

有人能告诉我如何在Python中完成这个任务吗?

英文:

I have two datafames as such:

  1. #Load the required libraries
  2. import pandas as pd
  3. import matplotlib.pyplot as plt
  4. #Create dataset_1
  5. data_set_1 = {'id': [1,
  6. 2,
  7. 3,
  8. 4,
  9. 5,
  10. ],
  11. 'Available_Salary': [10,
  12. 20,
  13. 30,
  14. 40,
  15. 50,
  16. ],
  17. }
  18. #Convert to dataframe_1
  19. df_1 = pd.DataFrame(data_set_1)
  20. print("\n df_1 = \n",df_1)
  21. #Create dataset_2
  22. data_set_2 = {'id': [1,
  23. 2,
  24. 3,
  25. 4,
  26. 5,
  27. 6,
  28. 7,
  29. 8,
  30. 9,
  31. 10,
  32. ],
  33. 'Expected_Salary': [9,
  34. 49,
  35. 18,
  36. 19,
  37. 29,
  38. 41,
  39. 4,
  40. 57,
  41. 42,
  42. 3,
  43. ],
  44. }
  45. #Convert to dataframe_2
  46. df_2 = pd.DataFrame(data_set_2)
  47. print("\n df_2 = \n",df_2)

Here, visually I can say, 'Expected_Salary' 9 (with id=1), 'Expected_Salary' 4 (with id=7) and 'Expected_Salary' 3 (with id=10) is closer to 'Available_Salary' 10 (with id=1).

Likewise, 'Expected_Salary' of 49 (with id=2) and 'Expected_Salary' 57 (with id=8) is closer to 'Available_Salary' 50 (with id=5), and so on.

This can be shown in below image file for better representation:

将两个数据框基于它们的分组/ID 进行映射,使用更接近的数值。

Now, I need to generate a new columns 'Salary_from_df_1' and 'id_from_df_1' in df_2 that will map with the id's of df_1 that signifies the closer salary.

For example, since the 'Expected_Salary' 9 (with id=1), 'Expected_Salary' 4 (with id=7) and 'Expected_Salary' 3 (with id=10) is closer to 'Available_Salary' 10 (with id=1), so they will have 'Salary_from_df_1' as 10 and 'id_from_df_1' as 1. This looks as such:

将两个数据框基于它们的分组/ID 进行映射,使用更接近的数值。

The same logic follows for other id's of df_2 to map with df_1.

Can somebody please let me know how to achieve this task in Python?

答案1

得分: 1

计算df_2中每个薪水与df_1中每个可用薪水的绝对差值,然后使用argmin来获取最小绝对差值的索引,然后使用这个索引从df_1中提取id值到df_2中。

  1. i = np.abs(np.subtract.outer(df_2['Expected_Salary'].values, df_1['Available Salary'].values)).argmin(axis=1)
  2. df_2['id_df1'] = df_1['id'].values[i]

  1. id Expected_Salary id_df1
  2. 0 1 9 1
  3. 1 2 49 5
  4. 2 3 18 2
  5. 3 4 19 2
  6. 4 5 29 3
  7. 5 6 41 4
  8. 6 7 4 1
  9. 7 8 57 5
  10. 8 9 42 4
  11. 9 10 3 1
英文:

Calculate absolute difference of each salary in df_2 from every available salary in df_1, then use argmin to get the index of minimum absolute diff, then use this index to yank the id values from df_1 to df_2

  1. i = np.abs(np.subtract.outer(df_2['Expected_Salary'].values, df_1['Available Salary'].values)).argmin(axis=1)
  2. df_2['id_df1'] = df_1['id'].values[i]

  1. id Expected_Salary id_df1
  2. 0 1 9 1
  3. 1 2 49 5
  4. 2 3 18 2
  5. 3 4 19 2
  6. 4 5 29 3
  7. 5 6 41 4
  8. 6 7 4 1
  9. 7 8 57 5
  10. 8 9 42 4
  11. 9 10 3 1

答案2

得分: 1

你可以使用 pandas 的 merge_asof 函数:

  1. pd.merge_asof(df1.sort_values('Expected_Salary'), df2.sort_values('avail_salary'), left_on='Expected_Salary', right_on='avail_salary', direction='nearest').sort_values('id_x')
id_x Expected_Salary id_y avail_salary
1 9 1 10
2 49 5 50
3 18 2 20
4 19 2 20
5 29 3 30
6 41 4 40
7 4 1 10
8 57 5 50
9 42 4 40
10 3 1 10
英文:

You can use pandas merge_asof

  1. pd.merge_asof(df1.sort_values('Expected_Salary'),df2.sort_values('avail_salary'),left_on='Expected_Salary',right_on='avail_salary',direction='nearest').sort_values('id_x')
id_x Expected_Salary id_y avail_salary
1 9 1 10
2 49 5 50
3 18 2 20
4 19 2 20
5 29 3 30
6 41 4 40
7 4 1 10
8 57 5 50
9 42 4 40
10 3 1 10

huangapple
  • 本文由 发表于 2023年6月18日 20:39:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76500590.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定