Python: 快速子集化pandas数据帧

huangapple go评论62阅读模式
英文:

Python: Fast subsetting of pandas dataframe

问题

以下是翻译好的内容:

我有一个包含约 500,000 行和 40 列的大型 Pandas 数据框

>>> 数据
      ColA  ColB        ColC  ...  ColX  ColY        ColZ
445828    A    10  2020-02-21  ...     6   NaN  2019-08-13
445829    B    12  2020-02-21  ...     8   NaN  2019-08-13
445830    C    13  2020-02-21  ...    10   NaN  2019-08-13
445831    D    15  2020-02-21  ...    12   NaN  2019-08-13
445832    E    17  2020-02-21  ...    15   NaN  2019-08-13

我在一个类内部使用这个数据框这个类的一个方法是 `get_property(self, A, B, C)`。

def get_property(self, A, B, C):
    data_subset = self.data[(self.data.ColA == A) &
                            (self.data.ColB == B) &
                            (self.data.ColC == C)]
    return data_subset

我多次执行这个查询它相对耗时有没有办法提高这个查询的速度

我已经使用了 `data.set_index(['ColA', 'ColB', 'ColC'])`
英文:

I have a large pandas dataframe with ~500k lines and 40columns.

>>> data 
       ColA  ColB     ColC           ...          ColX     ColY  ColZ
445828   A     10     2020-02-21     ...             6      nan  2019-08-13
445829   B     12     2020-02-21     ...             8      nan  2019-08-13
445830   C     13     2020-02-21     ...            10      nan  2019-08-13
445831   D     15     2020-02-21     ...            12      nan  2019-08-13
445832   E     17     2020-02-21     ...            15      nan  2019-08-13

I use this dataframe inside a class. One of the method of this class is get_property(self, A, B, C).

def get_property(self, option_basics):
    data_subset = self.data[(self.data.colA == A) &
                       (self.data.colB == B) &
                       (self.data.colC == C)]
    return data_subset 

I make this query hundreds of time. It's relatively time consuming. Is there a way to increase the speed of this request?

I have already used data.set_index(['colA', 'colB', 'colC'])

答案1

得分: 0

将索引设置为大幅提高了查询性能(保持数据框中使用的列以保持向后兼容性)

data.set_index(['colA', 'colB', 'colC'], drop = False)

df.loc[A, B, C]

英文:

Setting the index massively improved the performance of the query (keeping the columns used in the dataframe for backward compatibility)

> data.set_index(['colA', 'colB', 'colC'], drop = False)
>
> df.loc[A,B,C]

huangapple
  • 本文由 发表于 2020年1月7日 00:56:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/59616048.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定