如何高效地在整个数据框中广播相同的索引

huangapple go评论71阅读模式
英文:

How to efficiently broadcast the same index across an entire dataframe

问题

以下是翻译好的部分:

"original_names = out_df.index.names
mi = pd.MultiIndex.from_tuples(queries_tuples, names=names)
out_df.set_index(mi, append=True, inplace=True)
names_2 = names + original_names
out_df = out_df.reorder_levels(names_2)"

"然而,我感觉我在性能方面可能错过了一些东西,因为我所有的查询元组中的条目都是相同的。在我的具体情况下,数据框的大小相当大(几百万个条目),因此,我的当前解决方案速度相当慢。是否有一种方法可以利用所有查询元组中的所有条目都相同这一点,因此可以在整个数据框上“广播”它们?或者有没有更好的方法来做到这一点?请帮忙,谢谢!"

英文:

Say I have these parts

  • out_df: a pandas dataframe with a simple rangeindex as index and then some columns

Could look like this

>> out_df
    Col1  Col2
0    2.3  9.7
1    103  6.5
2    54   3.5
  • queries_tuples: a list of tuples, with same length as out_df

Could look like this

>> queries_tuples
[('something_A', 'Something_B', 'Something_C'), 
('something_A', 'Something_B', 'Something_C'),
('something_A', 'Something_B', 'Something_C')]
  • names: a list of names for the index names

Could look like

[index_name_1, index_name_2, index_name_3]

Now I want to combine these three to end up with a dataframe, which has a multiindex.
Could look like this

                                                Col1  Col2
index_name_1,  index_name_2,   index_name_3        
'something_A'  'Something_B'  'Something_C'  0   2.3  9.7
                                             1   103  6.5
                                             2   3.5  3.5

Now, this is my current solution

original_names = out_df.index.names
mi = pd.MultiIndex.from_tuples(queries_tuples, names=names)
out_df.set_index(mi, append=True, inplace=True)
names_2 = names + original_names
out_df = out_df.reorder_levels(names_2)

However, I feel like I am really missing out on some performance since all my entries in queries_tuples are identical. In my concrete case the size of the dataframe is quite large (a few million entries) and thus, my current solution is quite slow. Is there some way of exploiting that all entries in queries_tuples are identical and thus, can be "broadcasted" across the entire dataframe?
Or is there an even better way of doing this?
Please help, thanks !

答案1

得分: 1

使用 pd.concat

>>> out_df
    Col1  Col2
0    2.3   9.7
1  103.0   6.5
2   54.0   3.5

>>> queries_tuple
[('something_A', 'Something_B', 'Something_C'),
 ('something_A', 'Something_B', 'Something_C'),
 ('something_A', 'Something_B', 'Something_C')]

>>> names
['index_name_1', 'index_name_2', 'index_name_3']

输出:

>>> pd.concat([out_df], keys=[queries_tuples[0]], names=names)

                                           Col1  Col2
index_name_1 index_name_2 index_name_3               
something_A  Something_B  Something_C  0    2.3   9.7
                                       1  103.0   6.5
                                       2   54.0   3.5

1,000,000 条记录的性能:

%timeit pd.concat([out_df], keys=[queries_tuples[0]], names=names)
16.8 ms ± 605 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
英文:

Use pd.concat:

>>> out_df
    Col1  Col2
0    2.3   9.7
1  103.0   6.5
2   54.0   3.5

>>> queries_tuple
[('something_A', 'Something_B', 'Something_C'),
 ('something_A', 'Something_B', 'Something_C'),
 ('something_A', 'Something_B', 'Something_C')]

>>> names
['index_name_1', 'index_name_2', 'index_name_3']

Output:

>>> pd.concat([out_df], keys=[queries_tuples[0]], names=names)

                                           Col1  Col2
index_name_1 index_name_2 index_name_3               
something_A  Something_B  Something_C  0    2.3   9.7
                                       1  103.0   6.5
                                       2   54.0   3.5

Performance for 1,000,000 records:

%timeit pd.concat([out_df], keys=[queries_tuples[0]], names=names)
16.8 ms ± 605 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

答案2

得分: 0

我发现MultiIndex.from_product()的速度大约是MultiIndex.from_tuples()的2到10倍,具体取决于数据框中的行数。

import pandas as pd
import numpy as np

N = 1000000
out_df = pd.DataFrame(np.random.random((N, len(names))))
names = ['index_name_1', 'index_name_2', 'index_name_3']

queries_tuples = ('something_A', 'Something_B', 'Something_C')
product = [[elem] for elem in queries_tuples]
product.append(out_df.index)
mi = pd.MultiIndex.from_product(product, names=names + [out_df.index.name])
out_df.index = mi
print(out_df)

我怀疑速度差异是因为from_product() 知道每个命名列中只能出现一个可能值,因此大部分的分解步骤都被跳过了。

英文:

I found that MultiIndex.from_product() ended up being about 2x-10x faster than MultiIndex.from_tuples(), depending on the number of rows in the dataframe.

import pandas as pd
import numpy as np


N = 1000000
out_df = pd.DataFrame(np.random.random((N, len(names))))
names = ['index_name_1', 'index_name_2', 'index_name_3']

queries_tuples = ('something_A', 'Something_B', 'Something_C')
product = [[elem] for elem in queries_tuples]
product.append(out_df.index)
mi = pd.MultiIndex.from_product(product, names=names + [out_df.index.name])
out_df.index = mi
print(out_df)

I suspect the speed difference is because from_product() knows that only one possible value can occur in each of the three named columns, so most of the factorization step is skipped.

huangapple
  • 本文由 发表于 2023年6月2日 05:04:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76385699.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定