将两个数据框基于它们的分组/ID 进行映射,使用更接近的数值。

huangapple go评论69阅读模式
英文:

Map two dataframes, based on their group/id, with closer values

问题

I have two dataframes as such:

# 加载所需的库
import pandas as pd
import matplotlib.pyplot as plt

# 创建数据集_1
data_set_1 = {'id': [1, 2, 3, 4, 5],
              'Available_Salary': [10, 20, 30, 40, 50]}

# 转换为dataframe_1
df_1 = pd.DataFrame(data_set_1)
print("\n df_1 = \n", df_1)

# 创建数据集_2
data_set_2 = {'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
              'Expected_Salary': [9, 49, 18, 19, 29, 41, 4, 57, 42, 3]}

# 转换为dataframe_2
df_2 = pd.DataFrame(data_set_2)
print("\n df_2 = \n", df_2)

在这里,从视觉上可以说,'Expected_Salary' 9(id=1)、'Expected_Salary' 4(id=7)和'Expected_Salary' 3(id=10)更接近'Available_Salary' 10(id=1)。

同样,'Expected_Salary' 49(id=2)和'Expected_Salary' 57(id=8)更接近'Available_Salary' 50(id=5),依此类推。

这可以在下面的图像文件中更好地表示:

将两个数据框基于它们的分组/ID 进行映射,使用更接近的数值。

现在,我需要在df_2中生成新的列'Salary_from_df_1'和'id_from_df_1',它们将与df_1的id相对应,表示更接近的工资。

例如,由于'Expected_Salary' 9(id=1)、'Expected_Salary' 4(id=7)和'Expected_Salary' 3(id=10)更接近'Available_Salary' 10(id=1),因此它们将具有'Salary_from_df_1'为10和'id_from_df_1'为1。这看起来如下:

将两个数据框基于它们的分组/ID 进行映射,使用更接近的数值。

对于df_2的其他id,相同的逻辑也适用于与df_1的映射。

有人能告诉我如何在Python中完成这个任务吗?

英文:

I have two datafames as such:

#Load the required libraries
import pandas as pd
import matplotlib.pyplot as plt

#Create dataset_1
data_set_1 = {'id': [1,
                     2,
                     3, 
                     4,
                     5,
                     ],
              'Available_Salary': [10,
                                   20,
                                   30,
                                   40,
                                   50,
                                   ],
              }

#Convert to dataframe_1
df_1 = pd.DataFrame(data_set_1)
print("\n df_1 = \n",df_1)



#Create dataset_2
data_set_2 = {'id': [1, 
                     2, 
                     3, 
                     4,
                     5,
                     6,
                     7,
                     8,
                     9,
                     10,
                     ],
              'Expected_Salary': [9,
                                  49,
                                  18,
                                  19,
                                  29,
                                  41,
                                  4,
                                  57,
                                  42,
                                  3,
                                  ],
        }

#Convert to dataframe_2
df_2 = pd.DataFrame(data_set_2)
print("\n df_2 = \n",df_2)

Here, visually I can say, 'Expected_Salary' 9 (with id=1), 'Expected_Salary' 4 (with id=7) and 'Expected_Salary' 3 (with id=10) is closer to 'Available_Salary' 10 (with id=1).

Likewise, 'Expected_Salary' of 49 (with id=2) and 'Expected_Salary' 57 (with id=8) is closer to 'Available_Salary' 50 (with id=5), and so on.

This can be shown in below image file for better representation:

将两个数据框基于它们的分组/ID 进行映射,使用更接近的数值。

Now, I need to generate a new columns 'Salary_from_df_1' and 'id_from_df_1' in df_2 that will map with the id's of df_1 that signifies the closer salary.

For example, since the 'Expected_Salary' 9 (with id=1), 'Expected_Salary' 4 (with id=7) and 'Expected_Salary' 3 (with id=10) is closer to 'Available_Salary' 10 (with id=1), so they will have 'Salary_from_df_1' as 10 and 'id_from_df_1' as 1. This looks as such:

将两个数据框基于它们的分组/ID 进行映射,使用更接近的数值。

The same logic follows for other id's of df_2 to map with df_1.

Can somebody please let me know how to achieve this task in Python?

答案1

得分: 1

计算df_2中每个薪水与df_1中每个可用薪水的绝对差值,然后使用argmin来获取最小绝对差值的索引,然后使用这个索引从df_1中提取id值到df_2中。

i = np.abs(np.subtract.outer(df_2['Expected_Salary'].values, df_1['Available Salary'].values)).argmin(axis=1)
df_2['id_df1'] = df_1['id'].values[i]

   id  Expected_Salary  id_df1
0   1                9       1
1   2               49       5
2   3               18       2
3   4               19       2
4   5               29       3
5   6               41       4
6   7                4       1
7   8               57       5
8   9               42       4
9  10                3       1
英文:

Calculate absolute difference of each salary in df_2 from every available salary in df_1, then use argmin to get the index of minimum absolute diff, then use this index to yank the id values from df_1 to df_2

i = np.abs(np.subtract.outer(df_2['Expected_Salary'].values, df_1['Available Salary'].values)).argmin(axis=1)
df_2['id_df1'] = df_1['id'].values[i]

   id  Expected_Salary  id_df1
0   1                9       1
1   2               49       5
2   3               18       2
3   4               19       2
4   5               29       3
5   6               41       4
6   7                4       1
7   8               57       5
8   9               42       4
9  10                3       1

答案2

得分: 1

你可以使用 pandas 的 merge_asof 函数:

pd.merge_asof(df1.sort_values('Expected_Salary'), df2.sort_values('avail_salary'), left_on='Expected_Salary', right_on='avail_salary', direction='nearest').sort_values('id_x')
id_x Expected_Salary id_y avail_salary
1 9 1 10
2 49 5 50
3 18 2 20
4 19 2 20
5 29 3 30
6 41 4 40
7 4 1 10
8 57 5 50
9 42 4 40
10 3 1 10
英文:

You can use pandas merge_asof

pd.merge_asof(df1.sort_values('Expected_Salary'),df2.sort_values('avail_salary'),left_on='Expected_Salary',right_on='avail_salary',direction='nearest').sort_values('id_x')
id_x Expected_Salary id_y avail_salary
1 9 1 10
2 49 5 50
3 18 2 20
4 19 2 20
5 29 3 30
6 41 4 40
7 4 1 10
8 57 5 50
9 42 4 40
10 3 1 10

huangapple
  • 本文由 发表于 2023年6月18日 20:39:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76500590.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定