英文:
transforming data frame with python
问题
假设我在pandas中有以下数据框:
我想要将它转换成以下形式:
我该如何做呢?
我尝试过使用transpose、pd.wide_to_long和pd.melt,但它们都报错。我是新手,需要帮助,请帮忙!
英文:
Let's assume that I have the following data frame in pandas:
dataframe
and I want to transform it to the following:
transformeddataframe
How can I do it?
I have tried doing transpose, pd.wide_to_long, pd.melt but they are throwing errors. I am new to this and need help please!
答案1
得分: 1
以下是翻译好的代码部分:
import pandas as pd
# 重新创建您的起始数据框架
df = pd.read_csv(r"C:\...\bmi.csv", index_col=0)
df = df.iloc[2:, :] # 删除前两行以匹配您的初始图片
# 将列名更改为仅年份
df.columns = pd.Series(df.columns).str.split(".", n=1, expand=True)[0]
# 并添加第一行
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns,
df.iloc[0].str.lstrip())))
# 然后删除第一行
df = df.iloc[1:]
# 堆叠多级索引列的第一级(年份),并排序索引
df = df.stack(level=0).sort_index(level=[0, 1], ascending=[True, False])
# 在每一列中...
for col in df.columns:
# ...提取字符串开头的浮点数(并转换为浮点数)
df[col] = df[col].str.extract(r'(\d+\.\d+)', expand=False).astype(float)
如果您有任何问题,请告诉我。
英文:
You can use the following code to recreate the output dataframe from the initial pictured dataframe (the first few lines just recreate your dataframe).
import pandas as pd
# recreating your starting dataframe
df = pd.read_csv(r"C:\...\bmi.csv", index_col=0)
df = df.iloc[2:, :] # drop first 2 rows to match your initial picture
# Change column names to just the year
df.columns = pd.Series(df.columns).str.split(".", n=1, expand=True)[0]
# and add the first row
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns,
df.iloc[0].str.lstrip())))
# then remove first row
df = df.iloc[1:]
# stack the first level of the MultiIndex column (year), and sort the index
df = df.stack(level=0).sort_index(level=[0, 1], ascending=[True, False])
# in each column...
for col in df.columns:
# ...extract the float at the start of the string (and convert to float)
df[col] = df[col].str.extract(r'(\d+\.\d+)', expand=False).astype(float)
Let me know if you have any questions.
答案2
得分: 0
可能的解决方案:
df = (
pd.read_csv("bmi.csv", index_col=0, header=[0,3], na_values="No data")
.replace("\s+.+", "", regex=True).stack(0, dropna=False)
.astype(float).reset_index(level=0, names="Country").pipe(
lambda x: x.set_axis(x.index.str.split(".").str[0]))
.groupby([pd.Grouper(level=0), "Country"], sort=False).first()
.reset_index(names=["Year", "Country"]).rename_axis(columns=None)
.sort_values(by=["Country", "Year"], ascending=[True, False])
.reset_index(drop=True)
)
输出:
print(df)
Year Country Both sexes Female Male
0 2016 Afghanistan 23.0 23.7 22.3
1 2015 Afghanistan 22.9 23.6 22.3
2 2014 Afghanistan 22.8 23.5 22.2
3 2013 Afghanistan 22.8 23.4 22.1
4 2012 Afghanistan 22.7 23.3 22.0
... ... ... ... ... ...
8143 1979 Zimbabwe 22.0 23.6 20.3
8144 1978 Zimbabwe 21.9 23.6 20.2
8145 1977 Zimbabwe 21.9 23.5 20.2
8146 1976 Zimbabwe 21.8 23.5 20.1
8147 1975 Zimbabwe 21.8 23.5 20.0
[8148 rows x 5 columns]
英文:
A possible solution :
df = (
pd.read_csv("bmi.csv", index_col=0, header=[0,3], na_values="No data")
.replace("\s+.+", "", regex=True).stack(0, dropna=False)
.astype(float).reset_index(level=0, names="Country").pipe(
lambda x: x.set_axis(x.index.str.split(".").str[0]))
.groupby([pd.Grouper(level=0), "Country"], sort=False).first()
.reset_index(names=["Year", "Country"]).rename_axis(columns=None)
.sort_values(by=["Country", "Year"], ascending=[True, False])
.reset_index(drop=True)
)
Output :
print(df)
Year Country Both sexes Female Male
0 2016 Afghanistan 23.0 23.7 22.3
1 2015 Afghanistan 22.9 23.6 22.3
2 2014 Afghanistan 22.8 23.5 22.2
3 2013 Afghanistan 22.8 23.4 22.1
4 2012 Afghanistan 22.7 23.3 22.0
... ... ... ... ... ...
8143 1979 Zimbabwe 22.0 23.6 20.3
8144 1978 Zimbabwe 21.9 23.6 20.2
8145 1977 Zimbabwe 21.9 23.5 20.2
8146 1976 Zimbabwe 21.8 23.5 20.1
8147 1975 Zimbabwe 21.8 23.5 20.0
[8148 rows x 5 columns]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论