英文:
Pandas Vlookup True
问题
我试图将在Excel中进行的模型转移到Python,以便未来使用。
我的Python经验大约是在工作需要时断断续续的6个月,我相信我已经把问题弄得比必要复杂,因为尝试了不同的途径。
我基本上正在尝试复制以下内容:
=vlookup('Sheet1'!B2, 'Sheet2'!A1:D100, if('Sheet1'!C2='A',2, if('Sheet1'!C2='B',3,4), TRUE))
所以Sheet1是客户和距离的列表。
Python模拟数据:
df1 = pd.DataFrame({'Client': ['ABC', 'XYZ', 'KLM'], 'Distance': [0.137, 0.103, 0.205], 'Type':['A','B','C']})
Client Distance Type
0 ABC 0.137 A
1 XYZ 0.103 B
2 KLM 0.205 C
df2 = pd.DataFrame({'Distance': [0.05, 0.1, 0.15, 0.20, 0.25],'A': [1,2,3,4,5], 'B':[0.5,1,1.5,2,2.5], 'C': [2,2.2,2.4,2.6,2.8]})
Distance A B C
0 0.05 1 0.5 2.0
1 0.10 2 1.0 2.2
2 0.15 3 1.5 2.4
3 0.20 4 2.0 2.6
4 0.25 5 2.5 2.8
预期输出:
df1 = pd.DataFrame({'Client': ['ABC', 'XYZ', 'KLM'], 'Distance': [0.137, 0.154, 0.205], 'Type':['A','B','C'], 'Df2val':[3,1, 2.6 ]})
Client Distance Type Df2val
0 ABC 0.137 A 3.0
1 XYZ 0.154 B 1.0
2 KLM 0.205 C 2.6
原始数据大约有25,000行,我已经将其减少到约500行,通过删除距离参数之外的行来减少计算。
我有一个参考点列表,因此df['Distance']将在运行网格叠加时重新计算441次。
但是,我希望将这个vlookup和后续的计算嵌套在一个循环/lambda下,因为它在这些参考点上运行。
我尝试使用np.argmin(),但是一直出现形状错误(一个维度[Df2['val']列和两个维度[df2[['Distance', 'A']]。
我还尝试使用np.select,将'Type'中的唯一值列表作为条件,然后将参数设置为.loc,但是它一直出错,因为.loc似乎无法正确过滤系列。
我目前的思路是使用.loc找到最近距离的索引,然后使用另一个.loc[索引号,np.select用于列)。
英文:
i'm trying to convert a model undertaken in excel over to python for future proofing.
my python experience is about 6months on and off when required for work purposes and believe i've made the issue more complex then it needs to be, due to trying different avenues.
i'm essentially trying to replicate the below:
=vlookup('Sheet1'!B2, 'Sheet2'!A1:D100, if('Sheet1'!C2='A',2, if('Sheet1'!C2='B',3,4), TRUE))
so sheet 1 is a list of clients and distances.
python mock of data:
df1 = pd.DataFrame({'Client': ['ABC', 'XYZ', 'KLM'],'Distance': [0.137, 0.103, 0.205], 'Type':['A','B','C']})
Client Distance Type
0 ABC 0.137 A
1 XYZ 0.103 B
2 KLM 0.205 C
df2 = pd.DataFrame({'Distance': [0.05, 0.1, 0.15, 0.20, 0.25],'A': [1,2,3,4,5], 'B':[0.5,1,1.5,2,2.5], 'C': [2,2.2,2.4,2.6,2.8]})
Distance A B C
0 0.05 1 0.5 2.0
1 0.10 2 1.0 2.2
2 0.15 3 1.5 2.4
3 0.20 4 2.0 2.6
4 0.25 5 2.5 2.8
Expected output:
df1 = pd.DataFrame({'Client': ['ABC', 'XYZ', 'KLM'],'Distance': [0.137, 0.154, 0.205], 'Type':['A','B','C'], 'Df2val':[3,1, 2.6 ]})
Client Distance Type Df2val
0 ABC 0.137 A 3.0
1 XYZ 0.154 B 1.0
2 KLM 0.205 C 2.6
the originally data is ~25k rows, i've reduced this to ~500 based dropping rows that are outside the distance parameters to reduce calculations.
i do have a list of reference points so df['Distance'] will be recalculated 441 times as it runs through the grid overlay.
but am hoping to nest this vlookup and subsequent calculation under a loop/lambda as it runs through these reference points.
i have tried using np.argmin() however kept getting a shape error (one dimension [Df2['val'] column and two dimension [df2[['Distance', 'A']]
i have also looked at np.select using a list of unique values in 'Type' as the conditions and then the arguments to be .loc but that kept erroring as the .loc didnt seem to filter the series correctly.
my current thought process is to use .loc to find the index of the nearest distance and then use another .loc[index number, np.select for the column)
答案1
得分: 1
你需要结合 melt
来将 df2
重塑成长格式,以及 merge_asof
来根据 Type 上的最近值进行合并:
out = pd.merge_asof(df1.reset_index().sort_values(by='Distance'),
df2.melt('Distance', var_name='Type', value_name='Df2val')
.sort_values(by='Distance'),
on='Distance', by='Type', direction='nearest'
).set_index('index').reindex(df1.index)
输出结果:
Client Distance Type Df2val
0 ABC 0.137 A 3.0
1 XYZ 0.103 B 1.0
2 KLM 0.205 C 2.6
英文:
You need a combination of melt
to reshape df2
to a long format, and merge_asof
to merge on the nearest value by Type:
out = pd.merge_asof(df1.reset_index().sort_values(by='Distance'),
df2.melt('Distance', var_name='Type', value_name='Df2val')
.sort_values(by='Distance'),
on='Distance', by='Type', direction='nearest'
).set_index('index').reindex(df1.index)
Output:
Client Distance Type Df2val
0 ABC 0.137 A 3.0
1 XYZ 0.103 B 1.0
2 KLM 0.205 C 2.6
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论