英文:
How to efficiently broadcast the same index across an entire dataframe
问题
以下是翻译好的部分:
"original_names = out_df.index.names
mi = pd.MultiIndex.from_tuples(queries_tuples, names=names)
out_df.set_index(mi, append=True, inplace=True)
names_2 = names + original_names
out_df = out_df.reorder_levels(names_2)"
"然而,我感觉我在性能方面可能错过了一些东西,因为我所有的查询元组中的条目都是相同的。在我的具体情况下,数据框的大小相当大(几百万个条目),因此,我的当前解决方案速度相当慢。是否有一种方法可以利用所有查询元组中的所有条目都相同这一点,因此可以在整个数据框上“广播”它们?或者有没有更好的方法来做到这一点?请帮忙,谢谢!"
英文:
Say I have these parts
- out_df: a pandas dataframe with a simple rangeindex as index and then some columns
Could look like this
>> out_df
Col1 Col2
0 2.3 9.7
1 103 6.5
2 54 3.5
- queries_tuples: a list of tuples, with same length as out_df
Could look like this
>> queries_tuples
[('something_A', 'Something_B', 'Something_C'),
('something_A', 'Something_B', 'Something_C'),
('something_A', 'Something_B', 'Something_C')]
- names: a list of names for the index names
Could look like
[index_name_1, index_name_2, index_name_3]
Now I want to combine these three to end up with a dataframe, which has a multiindex.
Could look like this
Col1 Col2
index_name_1, index_name_2, index_name_3
'something_A' 'Something_B' 'Something_C' 0 2.3 9.7
1 103 6.5
2 3.5 3.5
Now, this is my current solution
original_names = out_df.index.names
mi = pd.MultiIndex.from_tuples(queries_tuples, names=names)
out_df.set_index(mi, append=True, inplace=True)
names_2 = names + original_names
out_df = out_df.reorder_levels(names_2)
However, I feel like I am really missing out on some performance since all my entries in queries_tuples are identical. In my concrete case the size of the dataframe is quite large (a few million entries) and thus, my current solution is quite slow. Is there some way of exploiting that all entries in queries_tuples are identical and thus, can be "broadcasted" across the entire dataframe?
Or is there an even better way of doing this?
Please help, thanks !
答案1
得分: 1
使用 pd.concat
:
>>> out_df
Col1 Col2
0 2.3 9.7
1 103.0 6.5
2 54.0 3.5
>>> queries_tuple
[('something_A', 'Something_B', 'Something_C'),
('something_A', 'Something_B', 'Something_C'),
('something_A', 'Something_B', 'Something_C')]
>>> names
['index_name_1', 'index_name_2', 'index_name_3']
输出:
>>> pd.concat([out_df], keys=[queries_tuples[0]], names=names)
Col1 Col2
index_name_1 index_name_2 index_name_3
something_A Something_B Something_C 0 2.3 9.7
1 103.0 6.5
2 54.0 3.5
1,000,000 条记录的性能:
%timeit pd.concat([out_df], keys=[queries_tuples[0]], names=names)
16.8 ms ± 605 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
英文:
Use pd.concat
:
>>> out_df
Col1 Col2
0 2.3 9.7
1 103.0 6.5
2 54.0 3.5
>>> queries_tuple
[('something_A', 'Something_B', 'Something_C'),
('something_A', 'Something_B', 'Something_C'),
('something_A', 'Something_B', 'Something_C')]
>>> names
['index_name_1', 'index_name_2', 'index_name_3']
Output:
>>> pd.concat([out_df], keys=[queries_tuples[0]], names=names)
Col1 Col2
index_name_1 index_name_2 index_name_3
something_A Something_B Something_C 0 2.3 9.7
1 103.0 6.5
2 54.0 3.5
Performance for 1,000,000 records:
%timeit pd.concat([out_df], keys=[queries_tuples[0]], names=names)
16.8 ms ± 605 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
答案2
得分: 0
我发现MultiIndex.from_product()
的速度大约是MultiIndex.from_tuples()
的2到10倍,具体取决于数据框中的行数。
import pandas as pd
import numpy as np
N = 1000000
out_df = pd.DataFrame(np.random.random((N, len(names))))
names = ['index_name_1', 'index_name_2', 'index_name_3']
queries_tuples = ('something_A', 'Something_B', 'Something_C')
product = [[elem] for elem in queries_tuples]
product.append(out_df.index)
mi = pd.MultiIndex.from_product(product, names=names + [out_df.index.name])
out_df.index = mi
print(out_df)
我怀疑速度差异是因为from_product()
知道每个命名列中只能出现一个可能值,因此大部分的分解步骤都被跳过了。
英文:
I found that MultiIndex.from_product()
ended up being about 2x-10x faster than MultiIndex.from_tuples()
, depending on the number of rows in the dataframe.
import pandas as pd
import numpy as np
N = 1000000
out_df = pd.DataFrame(np.random.random((N, len(names))))
names = ['index_name_1', 'index_name_2', 'index_name_3']
queries_tuples = ('something_A', 'Something_B', 'Something_C')
product = [[elem] for elem in queries_tuples]
product.append(out_df.index)
mi = pd.MultiIndex.from_product(product, names=names + [out_df.index.name])
out_df.index = mi
print(out_df)
I suspect the speed difference is because from_product()
knows that only one possible value can occur in each of the three named columns, so most of the factorization step is skipped.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论