Pandas Vlookup True

huangapple go评论71阅读模式
英文:

Pandas Vlookup True

问题

我试图将在Excel中进行的模型转移到Python,以便未来使用。

我的Python经验大约是在工作需要时断断续续的6个月,我相信我已经把问题弄得比必要复杂,因为尝试了不同的途径。

我基本上正在尝试复制以下内容:

=vlookup('Sheet1'!B2, 'Sheet2'!A1:D100, if('Sheet1'!C2='A',2, if('Sheet1'!C2='B',3,4), TRUE))

所以Sheet1是客户和距离的列表。

Python模拟数据:

df1 = pd.DataFrame({'Client': ['ABC', 'XYZ', 'KLM'], 'Distance': [0.137, 0.103, 0.205], 'Type':['A','B','C']})

  Client  Distance Type
0    ABC     0.137    A
1    XYZ     0.103    B
2    KLM     0.205    C
df2 = pd.DataFrame({'Distance': [0.05, 0.1, 0.15, 0.20, 0.25],'A': [1,2,3,4,5], 'B':[0.5,1,1.5,2,2.5], 'C': [2,2.2,2.4,2.6,2.8]})

   Distance  A    B    C
0      0.05  1  0.5  2.0
1      0.10  2  1.0  2.2
2      0.15  3  1.5  2.4
3      0.20  4  2.0  2.6
4      0.25  5  2.5  2.8

预期输出:

df1 = pd.DataFrame({'Client': ['ABC', 'XYZ', 'KLM'], 'Distance': [0.137, 0.154, 0.205], 'Type':['A','B','C'], 'Df2val':[3,1, 2.6 ]})

  Client  Distance Type  Df2val
0    ABC     0.137    A     3.0
1    XYZ     0.154    B     1.0
2    KLM     0.205    C     2.6

原始数据大约有25,000行,我已经将其减少到约500行,通过删除距离参数之外的行来减少计算。

我有一个参考点列表,因此df['Distance']将在运行网格叠加时重新计算441次。

但是,我希望将这个vlookup和后续的计算嵌套在一个循环/lambda下,因为它在这些参考点上运行。

我尝试使用np.argmin(),但是一直出现形状错误(一个维度[Df2['val']列和两个维度[df2[['Distance', 'A']]。

我还尝试使用np.select,将'Type'中的唯一值列表作为条件,然后将参数设置为.loc,但是它一直出错,因为.loc似乎无法正确过滤系列。

我目前的思路是使用.loc找到最近距离的索引,然后使用另一个.loc[索引号,np.select用于列)。

英文:

i'm trying to convert a model undertaken in excel over to python for future proofing.

my python experience is about 6months on and off when required for work purposes and believe i've made the issue more complex then it needs to be, due to trying different avenues.

i'm essentially trying to replicate the below:

=vlookup('Sheet1'!B2, 'Sheet2'!A1:D100, if('Sheet1'!C2='A',2, if('Sheet1'!C2='B',3,4), TRUE))

so sheet 1 is a list of clients and distances.

python mock of data:

df1 = pd.DataFrame({'Client': ['ABC', 'XYZ', 'KLM'],'Distance': [0.137, 0.103, 0.205], 'Type':['A','B','C']})

  Client  Distance Type
0    ABC     0.137    A
1    XYZ     0.103    B
2    KLM     0.205    C
df2 = pd.DataFrame({'Distance': [0.05, 0.1, 0.15, 0.20, 0.25],'A': [1,2,3,4,5], 'B':[0.5,1,1.5,2,2.5], 'C': [2,2.2,2.4,2.6,2.8]})

   Distance  A    B    C
0      0.05  1  0.5  2.0
1      0.10  2  1.0  2.2
2      0.15  3  1.5  2.4
3      0.20  4  2.0  2.6
4      0.25  5  2.5  2.8

Expected output:

df1 = pd.DataFrame({'Client': ['ABC', 'XYZ', 'KLM'],'Distance': [0.137, 0.154, 0.205], 'Type':['A','B','C'], 'Df2val':[3,1, 2.6 ]})

  Client  Distance Type  Df2val
0    ABC     0.137    A     3.0
1    XYZ     0.154    B     1.0
2    KLM     0.205    C     2.6

the originally data is ~25k rows, i've reduced this to ~500 based dropping rows that are outside the distance parameters to reduce calculations.

i do have a list of reference points so df['Distance'] will be recalculated 441 times as it runs through the grid overlay.

but am hoping to nest this vlookup and subsequent calculation under a loop/lambda as it runs through these reference points.

i have tried using np.argmin() however kept getting a shape error (one dimension [Df2['val'] column and two dimension [df2[['Distance', 'A']]

i have also looked at np.select using a list of unique values in 'Type' as the conditions and then the arguments to be .loc but that kept erroring as the .loc didnt seem to filter the series correctly.

my current thought process is to use .loc to find the index of the nearest distance and then use another .loc[index number, np.select for the column)

答案1

得分: 1

你需要结合 melt 来将 df2 重塑成长格式,以及 merge_asof 来根据 Type 上的最近值进行合并:

out = pd.merge_asof(df1.reset_index().sort_values(by='Distance'),
                    df2.melt('Distance', var_name='Type', value_name='Df2val')
                       .sort_values(by='Distance'),
                    on='Distance', by='Type', direction='nearest'
                    ).set_index('index').reindex(df1.index)

输出结果:

  Client  Distance Type  Df2val
0    ABC     0.137    A     3.0
1    XYZ     0.103    B     1.0
2    KLM     0.205    C     2.6
英文:

You need a combination of melt to reshape df2 to a long format, and merge_asof to merge on the nearest value by Type:

out = pd.merge_asof(df1.reset_index().sort_values(by='Distance'),
                    df2.melt('Distance', var_name='Type', value_name='Df2val')
                       .sort_values(by='Distance'),
                    on='Distance', by='Type', direction='nearest'
                    ).set_index('index').reindex(df1.index)

Output:

  Client  Distance Type  Df2val
0    ABC     0.137    A     3.0
1    XYZ     0.103    B     1.0
2    KLM     0.205    C     2.6

huangapple
  • 本文由 发表于 2023年7月13日 20:22:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76679332.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定