2023年7月13日 20:22:24go评论95阅读模式

英文:

Pandas Vlookup True

问题

我试图将在Excel中进行的模型转移到Python，以便未来使用。

我的Python经验大约是在工作需要时断断续续的6个月，我相信我已经把问题弄得比必要复杂，因为尝试了不同的途径。

我基本上正在尝试复制以下内容：

=vlookup('Sheet1'!B2, 'Sheet2'!A1:D100, if('Sheet1'!C2='A',2, if('Sheet1'!C2='B',3,4), TRUE))

所以Sheet1是客户和距离的列表。

Python模拟数据：

df1 = pd.DataFrame({'Client': ['ABC', 'XYZ', 'KLM'], 'Distance': [0.137, 0.103, 0.205], 'Type':['A','B','C']})
  Client  Distance Type
0    ABC     0.137    A
1    XYZ     0.103    B
2    KLM     0.205    C

df2 = pd.DataFrame({'Distance': [0.05, 0.1, 0.15, 0.20, 0.25],'A': [1,2,3,4,5], 'B':[0.5,1,1.5,2,2.5], 'C': [2,2.2,2.4,2.6,2.8]})
   Distance  A    B    C
0      0.05  1  0.5  2.0
1      0.10  2  1.0  2.2
2      0.15  3  1.5  2.4
3      0.20  4  2.0  2.6
4      0.25  5  2.5  2.8

预期输出：

df1 = pd.DataFrame({'Client': ['ABC', 'XYZ', 'KLM'], 'Distance': [0.137, 0.154, 0.205], 'Type':['A','B','C'], 'Df2val':[3,1, 2.6 ]})
  Client  Distance Type  Df2val
0    ABC     0.137    A     3.0
1    XYZ     0.154    B     1.0
2    KLM     0.205    C     2.6

原始数据大约有25,000行，我已经将其减少到约500行，通过删除距离参数之外的行来减少计算。

我有一个参考点列表，因此df['Distance']将在运行网格叠加时重新计算441次。

但是，我希望将这个vlookup和后续的计算嵌套在一个循环/lambda下，因为它在这些参考点上运行。

我尝试使用np.argmin()，但是一直出现形状错误（一个维度[Df2['val']列和两个维度[df2[['Distance', 'A']]。

我还尝试使用np.select，将'Type'中的唯一值列表作为条件，然后将参数设置为.loc，但是它一直出错，因为.loc似乎无法正确过滤系列。

我目前的思路是使用.loc找到最近距离的索引，然后使用另一个.loc[索引号，np.select用于列)。

英文:

i'm trying to convert a model undertaken in excel over to python for future proofing.

my python experience is about 6months on and off when required for work purposes and believe i've made the issue more complex then it needs to be, due to trying different avenues.

i'm essentially trying to replicate the below:

=vlookup(&#39;Sheet1&#39;!B2, &#39;Sheet2&#39;!A1:D100, if(&#39;Sheet1&#39;!C2=&#39;A&#39;,2, if(&#39;Sheet1&#39;!C2=&#39;B&#39;,3,4), TRUE))

so sheet 1 is a list of clients and distances.

python mock of data:

df1 = pd.DataFrame({&#39;Client&#39;: [&#39;ABC&#39;, &#39;XYZ&#39;, &#39;KLM&#39;],&#39;Distance&#39;: [0.137, 0.103, 0.205], &#39;Type&#39;:[&#39;A&#39;,&#39;B&#39;,&#39;C&#39;]})
  Client  Distance Type
0    ABC     0.137    A
1    XYZ     0.103    B
2    KLM     0.205    C

df2 = pd.DataFrame({&#39;Distance&#39;: [0.05, 0.1, 0.15, 0.20, 0.25],&#39;A&#39;: [1,2,3,4,5], &#39;B&#39;:[0.5,1,1.5,2,2.5], &#39;C&#39;: [2,2.2,2.4,2.6,2.8]})
   Distance  A    B    C
0      0.05  1  0.5  2.0
1      0.10  2  1.0  2.2
2      0.15  3  1.5  2.4
3      0.20  4  2.0  2.6
4      0.25  5  2.5  2.8

Expected output:

df1 = pd.DataFrame({&#39;Client&#39;: [&#39;ABC&#39;, &#39;XYZ&#39;, &#39;KLM&#39;],&#39;Distance&#39;: [0.137, 0.154, 0.205], &#39;Type&#39;:[&#39;A&#39;,&#39;B&#39;,&#39;C&#39;], &#39;Df2val&#39;:[3,1, 2.6 ]})
  Client  Distance Type  Df2val
0    ABC     0.137    A     3.0
1    XYZ     0.154    B     1.0
2    KLM     0.205    C     2.6

the originally data is ~25k rows, i've reduced this to ~500 based dropping rows that are outside the distance parameters to reduce calculations.

i do have a list of reference points so df['Distance'] will be recalculated 441 times as it runs through the grid overlay.

but am hoping to nest this vlookup and subsequent calculation under a loop/lambda as it runs through these reference points.

i have tried using np.argmin() however kept getting a shape error (one dimension [Df2['val'] column and two dimension [df2[['Distance', 'A']]

i have also looked at np.select using a list of unique values in 'Type' as the conditions and then the arguments to be .loc but that kept erroring as the .loc didnt seem to filter the series correctly.

my current thought process is to use .loc to find the index of the nearest distance and then use another .loc[index number, np.select for the column)

答案1

得分: 1

你需要结合 melt 来将 df2 重塑成长格式，以及 merge_asof 来根据 Type 上的最近值进行合并：

out = pd.merge_asof(df1.reset_index().sort_values(by='Distance'),
                    df2.melt('Distance', var_name='Type', value_name='Df2val')
                       .sort_values(by='Distance'),
                    on='Distance', by='Type', direction='nearest'
                    ).set_index('index').reindex(df1.index)

输出结果：

  Client  Distance Type  Df2val
0    ABC     0.137    A     3.0
1    XYZ     0.103    B     1.0
2    KLM     0.205    C     2.6

英文:

You need a combination of melt to reshape df2 to a long format, and merge_asof to merge on the nearest value by Type:

out = pd.merge_asof(df1.reset_index().sort_values(by=&#39;Distance&#39;),
                    df2.melt(&#39;Distance&#39;, var_name=&#39;Type&#39;, value_name=&#39;Df2val&#39;)
                       .sort_values(by=&#39;Distance&#39;),
                    on=&#39;Distance&#39;, by=&#39;Type&#39;, direction=&#39;nearest&#39;
                    ).set_index(&#39;index&#39;).reindex(df1.index)

Output:

  Client  Distance Type  Df2val
0    ABC     0.137    A     3.0
1    XYZ     0.103    B     1.0
2    KLM     0.205    C     2.6

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas Vlookup True

问题

答案1

ValueError: 形状 (None, 20, 9) 和 (None, 9) 不兼容

Langchain agents

在Polars中创建一个新列，将函数应用于一个列。

Python 中从静态类引用父类

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。