2023年6月2日 05:04:17go评论71阅读模式

英文:

How to efficiently broadcast the same index across an entire dataframe

问题

以下是翻译好的部分：

"original_names = out_df.index.names
mi = pd.MultiIndex.from_tuples(queries_tuples, names=names)
out_df.set_index(mi, append=True, inplace=True)
names_2 = names + original_names
out_df = out_df.reorder_levels(names_2)"

"然而，我感觉我在性能方面可能错过了一些东西，因为我所有的查询元组中的条目都是相同的。在我的具体情况下，数据框的大小相当大（几百万个条目），因此，我的当前解决方案速度相当慢。是否有一种方法可以利用所有查询元组中的所有条目都相同这一点，因此可以在整个数据框上“广播”它们？或者有没有更好的方法来做到这一点？请帮忙，谢谢！"

英文:

Say I have these parts

out_df: a pandas dataframe with a simple rangeindex as index and then some columns

Could look like this

&gt;&gt; out_df
    Col1  Col2
0    2.3  9.7
1    103  6.5
2    54   3.5

queries_tuples: a list of tuples, with same length as out_df

Could look like this

&gt;&gt; queries_tuples
[(&#39;something_A&#39;, &#39;Something_B&#39;, &#39;Something_C&#39;), 
(&#39;something_A&#39;, &#39;Something_B&#39;, &#39;Something_C&#39;),
(&#39;something_A&#39;, &#39;Something_B&#39;, &#39;Something_C&#39;)]

names: a list of names for the index names

Could look like

[index_name_1, index_name_2, index_name_3]

Now I want to combine these three to end up with a dataframe, which has a multiindex.
Could look like this

                                                Col1  Col2
index_name_1,  index_name_2,   index_name_3        
&#39;something_A&#39;  &#39;Something_B&#39;  &#39;Something_C&#39;  0   2.3  9.7
                                             1   103  6.5
                                             2   3.5  3.5

Now, this is my current solution

original_names = out_df.index.names
mi = pd.MultiIndex.from_tuples(queries_tuples, names=names)
out_df.set_index(mi, append=True, inplace=True)
names_2 = names + original_names
out_df = out_df.reorder_levels(names_2)

However, I feel like I am really missing out on some performance since all my entries in queries_tuples are identical. In my concrete case the size of the dataframe is quite large (a few million entries) and thus, my current solution is quite slow. Is there some way of exploiting that all entries in queries_tuples are identical and thus, can be "broadcasted" across the entire dataframe?
Or is there an even better way of doing this?
Please help, thanks !

答案1

得分: 1

使用 pd.concat：

&gt;&gt;&gt; out_df
    Col1  Col2
0    2.3   9.7
1  103.0   6.5
2   54.0   3.5

&gt;&gt;&gt; queries_tuple
[('something_A', 'Something_B', 'Something_C'),
 ('something_A', 'Something_B', 'Something_C'),
 ('something_A', 'Something_B', 'Something_C')]

&gt;&gt;&gt; names
['index_name_1', 'index_name_2', 'index_name_3']

输出:

&gt;&gt;&gt; pd.concat([out_df], keys=[queries_tuples[0]], names=names)

                                           Col1  Col2
index_name_1 index_name_2 index_name_3               
something_A  Something_B  Something_C  0    2.3   9.7
                                       1  103.0   6.5
                                       2   54.0   3.5

1,000,000 条记录的性能:

%timeit pd.concat([out_df], keys=[queries_tuples[0]], names=names)
16.8 ms ± 605 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

英文:

Use pd.concat:

&gt;&gt;&gt; out_df
    Col1  Col2
0    2.3   9.7
1  103.0   6.5
2   54.0   3.5

&gt;&gt;&gt; queries_tuple
[(&#39;something_A&#39;, &#39;Something_B&#39;, &#39;Something_C&#39;),
 (&#39;something_A&#39;, &#39;Something_B&#39;, &#39;Something_C&#39;),
 (&#39;something_A&#39;, &#39;Something_B&#39;, &#39;Something_C&#39;)]

&gt;&gt;&gt; names
[&#39;index_name_1&#39;, &#39;index_name_2&#39;, &#39;index_name_3&#39;]

Output:

&gt;&gt;&gt; pd.concat([out_df], keys=[queries_tuples[0]], names=names)

                                           Col1  Col2
index_name_1 index_name_2 index_name_3               
something_A  Something_B  Something_C  0    2.3   9.7
                                       1  103.0   6.5
                                       2   54.0   3.5

Performance for 1,000,000 records:

%timeit pd.concat([out_df], keys=[queries_tuples[0]], names=names)
16.8 ms &#177; 605 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

答案2

得分: 0

我发现MultiIndex.from_product()的速度大约是MultiIndex.from_tuples()的2到10倍，具体取决于数据框中的行数。

import pandas as pd
import numpy as np

N = 1000000
out_df = pd.DataFrame(np.random.random((N, len(names))))
names = ['index_name_1', 'index_name_2', 'index_name_3']

queries_tuples = ('something_A', 'Something_B', 'Something_C')
product = [[elem] for elem in queries_tuples]
product.append(out_df.index)
mi = pd.MultiIndex.from_product(product, names=names + [out_df.index.name])
out_df.index = mi
print(out_df)

我怀疑速度差异是因为from_product() 知道每个命名列中只能出现一个可能值，因此大部分的分解步骤都被跳过了。

英文:

I found that MultiIndex.from_product() ended up being about 2x-10x faster than MultiIndex.from_tuples(), depending on the number of rows in the dataframe.

import pandas as pd
import numpy as np


N = 1000000
out_df = pd.DataFrame(np.random.random((N, len(names))))
names = [&#39;index_name_1&#39;, &#39;index_name_2&#39;, &#39;index_name_3&#39;]

queries_tuples = (&#39;something_A&#39;, &#39;Something_B&#39;, &#39;Something_C&#39;)
product = [[elem] for elem in queries_tuples]
product.append(out_df.index)
mi = pd.MultiIndex.from_product(product, names=names + [out_df.index.name])
out_df.index = mi
print(out_df)

I suspect the speed difference is because from_product() knows that only one possible value can occur in each of the three named columns, so most of the factorization step is skipped.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何高效地在整个数据框中广播相同的索引

问题

答案1

答案2

Python递归二分搜索函数不起作用，但无法看出原因。

Python with -c flag, input from user and if/else inside – shows syntax error

supabase python – i'm trying to update an entry in a supabase table but postgrest returns 'Failed to parse [("id","Filters.EQ.1")]'

鼠标在SpinBox中的高亮触发了未知事件。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论