英文:
How do you sort data with multiindex (columns) dataframe?
问题
请原谅我的词汇有限。我仍在努力学习正确的术语,刚刚发现我创建了一个多索引的数据框,我正在学习如何操作它。
这个多索引数据框有30行和546列,看起来像这样的更大版本:
- 如何获取 'level 0' 列名的列表。
要获取 'level 0' 列名的列表,您可以使用以下代码:
level_0_columns = df.columns.get_level_values(0).unique().tolist()
- 如何获取 'level 1' 列名的列表。
要获取 'level 1' 列名的列表,您可以使用以下代码:
level_1_columns = df.columns.get_level_values(1).unique().tolist()
- 对于一行(日期),提取所有 'level 0' 和一个 'level 1' 索引的数据。例如,提取一天中一家公司的所有财务数据。
要做到这一点,您可以使用以下代码:
date = '2023-01-02' # 指定日期
level_0 = 'A' # 指定 'level 0' 列
level_1 = 'aa' # 指定 'level 1' 列
data_for_one_company = df.loc[date, (level_0, level_1)]
- 对于一行(日期),一个 'level 0' 列与所有 'level 0' 数据。例如,提取一天中所有公司的成交量数据。
要做到这一点,您可以使用以下代码:
date = '2023-01-02' # 指定日期
level_0 = 'A' # 指定 'level 0' 列
volume_data_for_all_companies = df.loc[date, (level_0, slice(None))]
希望这些代码能帮助您进行多级索引数据框的操作。如果您需要更多帮助,可以随时提问。
英文:
First, please forgive my bad vocabulary. Im still struggling with the correct terms, and have just discovered that I have created a multiindexed dataframe, which Im trying to learn how to manipulate.
The multiindex dataframe has 30 rows and 546 columns, and looks like a bigger version of this:
A | B | C | D | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
aa | bb | cc | aa | bb | cc | aa | bb | cc | aa | bb | cc | |
Date | ||||||||||||
2023-01-02 | 1 | 24 | 6 | 3 | 2 | 7 | 3 | 10 | 12 | 5 | 9 | 21 |
2023-01-03 | 1 | 23 | 7 | 3 | 4 | 6 | 3 | 9 | 13 | 6 | 10 | 22 |
2023-01-04 | 2 | 22 | 8 | 4 | 6 | 7 | 3 | 9 | 12 | 8 | 14 | 24 |
2023-01-05 | 3 | 21 | 10 | 3 | 8 | 6 | 4 | 8 | 11 | 10 | 12 | 21 |
The index is a timestamp date, and the top level (level 0?) column indexes A, B, C, D, etc each have the same 91 second level (level 1?) members: aa, bb, cc, etc
Since there are 546 columns in total, and 91 'level 1' columns, there must be 6 'level 0' columns. I cant see them cos the tables so big it just shows the first and last.
In reality, its a table of stock data pulled off yahoo where A, B, C are the (6) financial values like close, volume, high, etc and aa, bb, cc, etc are the (91) company codes.
Id like to learn how to do the following:
-
How to pull off a list of the 'level 0' column names.
-
How to pull off a list of the 'level 1' column names.
-
For 1 row (date), pull out the data for ALL 'level 0' and ONE 'level 1' index. (For example, all financial data for one company on one day).
-
For 1 row (date), ONE 'level 0' with ALL 'level 0' data. For example, volume data for all companies on one day.
Ive been trying things like:
df.loc[:,(['A','B'],['aa,'bb', 'cc'])]
df.loc['2023-01-02', :]
which work, but I cant sort the brackets and colons right to do the above stuff.
Also,
df.loc[:,(['A','D'],['aa,'cc','ff'])]
and
df.loc['2023-01-05':,([A,C],[aa,dd])]
work, but
df.loc['2023-01-05',([A:],[aa,dd])]
and
df.loc['2023-01-05',(A:,[aa,dd])]
give invalid syntax. Can anyone explain, or maybe point me towards a tutorial that will help with the level definitions and round/square brackets and colons?
Thanks.
答案1
得分: 3
要提取level列名称的列表,您可以使用get_level_values
:
df.columns.get_level_values(0)
# Index(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'], dtype='object')
df.columns.get_level_values(1)
# Index(['aa', 'bb', 'cc', 'aa', 'bb', 'cc', 'aa', 'bb', 'cc', 'aa', 'bb', 'cc'], dtype='object')
df.columns.get_level_values(0).unique()
# Index(['A', 'B', 'C', 'D'], dtype='object')
df.columns.get_level_values(1).unique()
# Index(['aa', 'bb', 'cc'], dtype='object')
对于3和4,使用pd.IndexSlice
会很方便:
# 获取特定level one索引的所有level zero数据
df.loc['2023-01-05', pd.IndexSlice[:, 'aa']]
# A aa 3
# B aa 3
# C aa 4
# D aa 10
# Name: 2023-01-05, dtype: int64
# 获取特定level zero索引的所有level one数据
df.loc['2023-01-05', pd.IndexSlice['A', :]]
# A aa 3
# bb 21
# cc 10
# Name: 2023-01-05, dtype: int64
英文:
To pull a list of level column names, you can use get_level_values
:
df.columns.get_level_values(0)
#Index(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'], dtype='object')
df.columns.get_level_values(1)
#Index(['aa', 'bb', 'cc', 'aa', 'bb', 'cc', 'aa', 'bb', 'cc', 'aa', 'bb', 'cc'], dtype='object')
df.columns.get_level_values(0).unique()
#Index(['A', 'B', 'C', 'D'], dtype='object')
df.columns.get_level_values(1).unique()
#Index(['aa', 'bb', 'cc'], dtype='object')
For 3 and 4, pd.IndexSlice
would be convenient to use:
# all level zero data for a specific level one index
df.loc['2023-01-05', pd.IndexSlice[:, 'aa']]
#A aa 3
#B aa 3
#C aa 4
#D aa 10
#Name: 2023-01-05, dtype: int64
# all level one data for a specific level zero index
df.loc['2023-01-05', pd.IndexSlice['A', :]]
#A aa 3
# bb 21
# cc 10
#Name: 2023-01-05, dtype: int64
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论