在Pandas Multiindex中,如何在不知道级别位置的情况下执行索引切片?

huangapple go评论90阅读模式
英文:

In Pandas Multiindex, how do you do an indexslice without knowing the position of the level?

问题

我有一个使用pandas数据框的程序,使用了2级多索引(日期和数据):

Date_Time   Data    
date1       a
            b
            c

date2       a
            b
            c

date3       a
            b
            c
...

因此,所有函数在需要使用/修改数据框内容时都使用pandas的IndexSlice,例如:

df.loc[pd.IndexSlice[:, 'a'], :]

这很有效,易于阅读,简洁,使许多一行函数成为可能。

但是,我目前需要根据某些属性来区分数据,以避免在重新采样时合并它们,我是通过在必要时添加索引的第三级来实现的:

Date_Time   Property     Data    
date1       1            a
            1            b
            1            c

date2       2            a
            2            b
            2            c

date3       1            a
            1            b
            1            c
...

目标是能够对时间进行重新分组并得到这个多级索引:

Date_Time   Property     Data    
Period1     1            a
            1            b
            1            c
            2            a
            2            b
            2            c

Period2     1            a
            1            b
            1            c
...

因此,问题是df.loc[pd.IndexSlice[:, 'a'], :]不再起作用,我必须将其更改为

df.loc[pd.IndexSlice[:, :, 'a'], :]

但这意味着每次在带有额外列的数据框上使用它时都必须更改代码本身。

有没有办法以灵活的方式定义切片??

我希望能够使用变量来定义切片,就像在列表理解中一样,这样它可以防止未来对多级索引级别的长度和顺序进行更改。但是据我所查,这似乎是不可能的,那么我该怎么办?

我可以在每个函数的开头使用try-except块来定义切片,在这个块内部确保级别和级别值已存在;或者将属性级别移到右边,这样我仍然可以使用pd.IndexSlice[:, 'a'](但将来可能会再次遇到这个问题)


编辑:以下是生成使用此类索引的数据框的一些代码:

iter1=[["03/07/2023 07:40:00", "03/07/2023 07:50:00"], ["S=0.1"],["Probe1","Probe2","Probe3"]]
iter2=[["03/07/2023 07:45:00", "03/07/2023 07:55:00"], ["S=0.2"],["Probe1","Probe2","Probe3"]]
idx1=pd.MultiIndex.from_product(iter1, names=["Date_Time", "Property",'Data'])
idx2=pd.MultiIndex.from_product(iter2, names=["Date_Time", "Property",'Data'])

df_aux1=pd.DataFrame(np.random.randn(6, 3), index=idx1, columns=['X','Y','Error'])
df_aux2=pd.DataFrame(np.random.randn(6, 3), index=idx2, columns=['X','Y','Error'])

df=pd.concat([df_aux1,df_aux2]).sort_index(level='Date_Time')

这些是您提供的示例代码。

英文:

I have a program that works with pandas dataframes, using a multiindex of 2 levels (dates and data) such as:

Date_Time   Data    
date1       a
            b
            c

date2       a
            b
            c

date3       a
            b
            c
...

So all the functions use the pandas IndexSlice when having to use/modify the contents of the df, like:

df.loc[pd.IndexSlice[:,'a'],:]

This worked great, easy to read, short and efficient, and made possible a lot of one-lines functions.

However, I am currently having to differenciate the data based on some properties in order to not having them merge when doing a resample, and I am doing it by adding a third level to the index when necessary:

Date_Time   Property     Data    
date1       1            a
            1            b
            1            c

date2       2            a
            2            b
            2            c

date3       1            a
            1            b
            1            c
...

The goal is to be able to do a groupby with a resample over time and end up with this multiindex:

Date_Time   Property     Data    
Period1     1            a
            1            b
            1            c
            2            a
            2            b
            2            c

Period2     1            a
            1            b
            1            c
...

So, the problem is that df.loc[pd.IndexSlice[:,'a'],:] no longer works, I would have to change it to

df.loc[pd.IndexSlice[:,:,'a'],:]

But that means changing the code itself everytime I use that dataframe with the extra column.

Isn't there any way to define the slice in a flexible way??

I would like to define the slice using variables, like in list comprehension, so it is future protected against more changes in the length and order of the multiindex levels. But as far as I checked, that is not possible, so what should I do??

I could define the slice using try-except blocks at the beginning of each function, inside the block that already makes sure that level and level_value exists; or move the property level to the right so I could still use pd.IndexSlice[:,'a'] (but in the future I might end up with this problem again)


EDIT: Here is some code to generate a dataframe that uses this kind of index:

iter1=[["03/07/2023 07:40:00", "03/07/2023 07:50:00"], ["S=0.1"],["Probe1","Probe2","Probe3"]]
iter2=[["03/07/2023 07:45:00", "03/07/2023 07:55:00"], ["S=0.2"],["Probe1","Probe2","Probe3"]]
idx1=pd.MultiIndex.from_product(iter1, names=["Date_Time", "Property",'Data'])
idx2=pd.MultiIndex.from_product(iter2, names=["Date_Time", "Property",'Data'])

df_aux1=pd.DataFrame(np.random.randn(6, 3), index=idx1, columns=['X','Y','Error'])
df_aux2=pd.DataFrame(np.random.randn(6, 3), index=idx2, columns=['X','Y','Error'])

df=pd.concat([df_aux1,df_aux2]).sort_index(level='Date_Time')

答案1

得分: 1

以下是翻译好的部分:

The exact data and logic is unclear, but since you have named levels you could use Index.get_level_values and boolean indexing:

df.loc[df.index.get_level_values('Data') == 'a']

Or by position:

df.loc[df.index.get_level_values(-1) == 'a']
英文:

The exact data and logic is unclear, but since you have named levels you could use Index.get_level_values and boolean indexing:

df.loc[df.index.get_level_values('Data') == 'a']

Or by position:

df.loc[df.index.get_level_values(-1) == 'a']

huangapple
  • 本文由 发表于 2023年7月3日 20:04:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76604560.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定