2023年6月26日 19:26:08go评论87阅读模式

英文:

Vectorization of column search by dynamic value in pandas

问题

I understand your request. Here is the translated code without any additional information:

for col in df.columns:
     df['name'] = np.where(col == 'a' + (df['a'].astype('Int16').astype(str)) + '_b' + (df['b'].astype('Int16').astype(str)) + '_name', df[col].values, df['name'])

英文:

I am starting to learn Pandas. And which day I can not solve the fastest way to calculate. How to get for each row the value of a column by a unique name, composed of column 'a', 'b' values?

Below is an example of the initial data.

index	a	b	a1_b1_name	a1_b1_foo_bar	a2_b1_name	a2_b1_foo_bar	a1_b2_name	a1_b2_foo_bar	a2_b2_name	a2_b2_foo_bar	a1_b3_name	a1_b3_foo_bar	a2_b3_name	a2_b3_foo_bar
0	1	2	value1	value2	value3	value4	value5	value6	value7	value8	value9	value10	value11	value12
1	2	1	value13	value14	value15	value16	value17	value18	value19	value20	value21	value22	value23	value24
2	2	2	value25	value26	value27	value28	value29	value30	value31	value32	value33	value34	value35	value36
3	1	1	value37	value38	value39	value40	value41	value42	value43	value44	value45	value46	value47	value48
4	2	3	value49	value50	value51	value52	value53	value54	value55	value56	value57	value58	value59	value60

The number of columns with the values "a _b _name " is planned to be much larger, about 40. The number of rows will be in the tens of thousands.

I need to create a new column 'name' based on the data of the table as quickly as possible and preferably without loops, using the power of pandas vectorization.

Like this one:

index	name	foo_bar
0	value5	value6
1	value15	value16
2	value31	value32
3	value37	value38
4	value59	value60

I was only able to do this by looping through the columns. But it takes more time than I'd like:

for col in df.columns:
     df[&#39;name&#39;] = np.where(col == &#39;a&#39; + (df[&#39;a&#39;].astype(&#39;Int16&#39;).astype(str)) + &#39;_b&#39; + (df[&#39;b&#39;].astype(&#39;Int16&#39;).astype(str)) + &#39;_name&#39;, df[col].values, df[&#39;name&#39;])

答案1

得分: 1

这是关于索引查找的一种变体，首先需要预处理输入列a/b以匹配列名：

target = 'a' + df['a'].astype(str) + '_b' + df['b'].astype(str) + '_name'

idx, cols = pd.factorize(target)

out = pd.DataFrame({'index': df['index'],
                    'values': df.reindex(cols, axis=1).to_numpy()
                              [np.arange(len(df)), idx],
                    })

# 或者，对于原始DataFrame中的新列
# df['new'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

输出:

   index   values
0      0   value3
1      1   value8
2      2  value16
3      3  value19
4      4  value30

中间的target:

0    a1_b2_name
1    a2_b1_name
2    a2_b2_name
3    a1_b1_name
4    a2_b3_name
dtype: object

多列的情况:
一种选项是重新塑造和合并：

target = 'a' + df['a'].astype(str) + '_b' + df['b'].astype(str)

tmp = df.drop(columns=['index', 'a', 'b'])
tmp.columns = tmp.columns.str.rsplit('_', n=1, expand=True)

out = (df
   .reset_index()
   .merge(tmp.stack(level=0), left_on=['index', target], right_index=True)
   .set_index('index')[['name', 'foo']]
)

输出:

          name      foo
index                  
0       value5   value6
1      value15  value16
2      value31  value32
3      value37  value38
4      value59  value60

请注意，这些代码示例中包含了链接，可以点击查看原始问题或了解更多信息。

英文:

original question

Cf. first version of the question

This is a variant on an indexing lookup, you first need to pre-process your input columns a/b to match the column names:

target = &#39;a&#39;+df[&#39;a&#39;].astype(str)+&#39;_b&#39;+df[&#39;b&#39;].astype(str)+&#39;_name&#39;

idx, cols = pd.factorize(target)

out = pd.DataFrame({&#39;index&#39;: df[&#39;index&#39;],
                    &#39;values&#39;: df.reindex(cols, axis=1).to_numpy()
                              [np.arange(len(df)), idx],
                    })

# or, for a new column in the original DataFrame
# df[&#39;new&#39;] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Output:

   index   values
0      0   value3
1      1   value8
2      2  value16
3      3  value19
4      4  value30

Intermediate target:

0    a1_b2_name
1    a2_b1_name
2    a2_b2_name
3    a1_b1_name
4    a2_b3_name
dtype: object

multiple columns:

One option is to reshape and merge:

target = &#39;a&#39;+df[&#39;a&#39;].astype(str)+&#39;_b&#39;+df[&#39;b&#39;].astype(str)

tmp = df.drop(columns=[&#39;index&#39;, &#39;a&#39;, &#39;b&#39;])
tmp.columns = tmp.columns.str.rsplit(&#39;_&#39;, n=1, expand=True)

out = (df
   .reset_index()
   .merge(tmp.stack(level=0), left_on=[&#39;index&#39;, target], right_index=True)
   .set_index(&#39;index&#39;)[[&#39;name&#39;, &#39;foo&#39;]]
)

Output:

          name      foo
index                  
0       value5   value6
1      value15  value16
2      value31  value32
3      value37  value38
4      value59  value60

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas 中根据动态值进行列搜索的向量化处理

问题

答案1

original question

multiple columns:

使用Mystic如何一次约束超过10个变量？

CSV文件显示一个字符串输入

Sympy计算零空间的数值计算速度较慢。

检查Python字典中的任一键。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论