英文:
Python: Fast subsetting of pandas dataframe
问题
以下是翻译好的内容:
我有一个包含约 500,000 行和 40 列的大型 Pandas 数据框。
>>> 数据
ColA ColB ColC ... ColX ColY ColZ
445828 A 10 2020-02-21 ... 6 NaN 2019-08-13
445829 B 12 2020-02-21 ... 8 NaN 2019-08-13
445830 C 13 2020-02-21 ... 10 NaN 2019-08-13
445831 D 15 2020-02-21 ... 12 NaN 2019-08-13
445832 E 17 2020-02-21 ... 15 NaN 2019-08-13
我在一个类内部使用这个数据框。这个类的一个方法是 `get_property(self, A, B, C)`。
def get_property(self, A, B, C):
data_subset = self.data[(self.data.ColA == A) &
(self.data.ColB == B) &
(self.data.ColC == C)]
return data_subset
我多次执行这个查询。它相对耗时。有没有办法提高这个查询的速度?
我已经使用了 `data.set_index(['ColA', 'ColB', 'ColC'])`
英文:
I have a large pandas dataframe with ~500k lines and 40columns.
>>> data
ColA ColB ColC ... ColX ColY ColZ
445828 A 10 2020-02-21 ... 6 nan 2019-08-13
445829 B 12 2020-02-21 ... 8 nan 2019-08-13
445830 C 13 2020-02-21 ... 10 nan 2019-08-13
445831 D 15 2020-02-21 ... 12 nan 2019-08-13
445832 E 17 2020-02-21 ... 15 nan 2019-08-13
I use this dataframe inside a class. One of the method of this class is get_property(self, A, B, C)
.
def get_property(self, option_basics):
data_subset = self.data[(self.data.colA == A) &
(self.data.colB == B) &
(self.data.colC == C)]
return data_subset
I make this query hundreds of time. It's relatively time consuming. Is there a way to increase the speed of this request?
I have already used data.set_index(['colA', 'colB', 'colC'])
答案1
得分: 0
将索引设置为大幅提高了查询性能(保持数据框中使用的列以保持向后兼容性)
data.set_index(['colA', 'colB', 'colC'], drop = False)
df.loc[A, B, C]
英文:
Setting the index massively improved the performance of the query (keeping the columns used in the dataframe for backward compatibility)
> data.set_index(['colA', 'colB', 'colC'], drop = False)
>
> df.loc[A,B,C]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论