英文:
Pandas: Unpivot Excel Data Where Category and Children Labels Are In Same Column
问题
基本上,我认为最简单的解释方法是,我正在尝试扩展一个多索引表,但这些索引都在同一列中。
我的数据结构如下:
行标签 | 总计 |
---|---|
Collection1 | 22 |
Data1 | 10 |
Data2 | 12 |
Collection2 | 33 |
Data1 | 33 |
Collection3 | 45 |
Data1 | 14 |
Data2 | 31 |
总计 | 100 |
我想要的输出是这样的一个数据框:
行标签 | Data1 | Data2 | 总计 |
---|---|---|---|
Collection1 | 10 | 12 | 22 |
Collection2 | 33 | 0 | 33 |
Collection3 | 14 | 31 | 45 |
总计 | 57 | 43 | 100 |
是否有内置的pandas方法或者处理这种类型转换的简单方法?
我尝试过手动拆分表格,并通过收集重复的行标签并从中创建列来重新创建它们,使用具有该标签的行的数据,但是,棘手的地方在于子数据缺失;就像上面的示例中,Collection2 Data2不存在。通过这种方法,我可以计算每行是否Data1等于Collection1,如果是,就在该索引处将0添加到Data2。但是,这似乎非常丑陋,我想肯定有更加优雅的方法。
英文:
Basically, I think the easiest way to explain this is, I am trying expand a multi-indexed table, but the indexes are both in the same column.
My data is structured like this:
Row Labels | Sum |
---|---|
Collection1 | 22 |
Data 1 | 10 |
Data 2 | 12 |
Collection2 | 33 |
Data 1 | 33 |
Collection3 | 45 |
Data 1 | 14 |
Data 2 | 31 |
Total | 100 |
What I would like out is a Dataframe like this:
Row Labels | Data1 | Data2 | Sum |
---|---|---|---|
Collection1 | 10 | 12 | 22 |
Collection2 | 33 | 0 | 33 |
Collection3 | 14 | 31 | 45 |
Total | 57 | 43 | 100 |
Is there any built in pandas methods, or a straight forward approach to handling this type of translation?
I have tried manually breaking down the table and recreating it by collecting the row labels that are repeated, and making columns from them, with the data from rows with that label, but, the tricky spot is where child data is missing; like in the example above Collection2 Data2 doesn't exist. With this approach I could calculate for each row if Data1 equals Collection1 and if it does, add a 0 to Data2 at that index. But, it seems super ugly, and figured there is probably a much more elegant approach.
答案1
得分: 2
使用 pivot_table
函数:
# 识别分组
m = df['Row Labels'].str.match(r'Collection\d+|Total')
# 重塑数据
out = (df
.assign(index=df['Row Labels'].where(m).ffill(),
col=df['Row Labels'].mask(m, 'Sum')
)
.pivot_table(index='index', columns='col', values='Sum', fill_value=0)
.rename_axis(columns=None)
)
# 重新计算总和
out.loc['Total'] = out.drop('Total').sum()
out = out.reset_index()
输出结果:
index Data 1 Data 2 Sum
0 Collection1 10 12 22
1 Collection2 33 0 33
2 Collection3 14 31 45
3 Total 57 43 100
英文:
Using a pivot_table
:
# identify groups
m = df['Row Labels'].str.match(r'Collection\d+|Total')
# reshape
out = (df
.assign(index=df['Row Labels'].where(m).ffill(),
col=df['Row Labels'].mask(m, 'Sum')
)
.pivot_table(index='index', columns='col', values='Sum', fill_value=0)
.rename_axis(columns=None)
)
# recompute Total
out.loc['Total'] = out.drop('Total').sum()
out = out.reset_index()
Output:
index Data 1 Data 2 Sum
0 Collection1 10 12 22
1 Collection2 33 0 33
2 Collection3 14 31 45
3 Total 57 43 100
答案2
得分: 1
以下是翻译好的部分:
# 移除总行 - 最后一步将重新创建
df = df[df["Row Labels"] != "Total"]
# 找到用于透视的索引
mask = df["Row Labels"].str.startswith("Collection")
df["idx"] = mask.cumsum()
# 在此处执行实际的转换:透视 + 合并
df = (
pd.merge(
df[mask],
df[~mask].pivot(index="idx", columns="Row Labels", values="Sum"),
left_on="idx",
right_index=True,
)
.drop(columns=["idx"])
.fillna(0)
)
# 添加总行
df = pd.concat(
[
df,
pd.DataFrame(
{"Row Labels": ["Total"], **{c: [df[c].sum()] for c in df.loc[:, "Sum":]}}
),
]
)
print(df)
打印结果:
Row Labels Sum Data 1 Data 2
0 Collection1 22 10.0 12.0
3 Collection2 33 33.0 0.0
5 Collection3 45 14.0 31.0
0 Total 100 57.0 43.0
英文:
I'm not sure if exists some straightforward Pandas solution, but you can try this example:
# remove the Total row - will recreate it as last step
df = df[df["Row Labels"] != "Total"]
# find the indices for pivoting
mask = df["Row Labels"].str.startswith("Collection")
df["idx"] = mask.cumsum()
# do the actual transformation here: pivot + merge
df = (
pd.merge(
df[mask],
df[~mask].pivot(index="idx", columns="Row Labels", values="Sum"),
left_on="idx",
right_index=True,
)
.drop(columns=["idx"])
.fillna(0)
)
# add Total row back
df = pd.concat(
[
df,
pd.DataFrame(
{"Row Labels": ["Total"], **{c: [df[c].sum()] for c in df.loc[:, "Sum":]}}
),
]
)
print(df)
Prints:
Row Labels Sum Data 1 Data 2
0 Collection1 22 10.0 12.0
3 Collection2 33 33.0 0.0
5 Collection3 45 14.0 31.0
0 Total 100 57.0 43.0
答案3
得分: 1
欢迎来到SO。在我看来,最简单的方法如下:
(1) 重新构建表格以进行数据透视
(解析 Collection 和 Data 信息)
# 提取 Collection 编号作为新行
df['Collection'] = df['Row Labels'].str.extract("Collection\s*(\d)", expand=True).ffill()
# 仅保留 'Data' 行,因为在冗余信息下进行数据透视效果不佳。
df = df[df['Row Labels'].str.contains('Data')]
重新构建后的输入表格:
Row Labels Sum Collection
1 Data 1 10 1
2 Data 2 12 1
4 Data 1 33 2
6 Data 1 14 3
7 Data 2 31 3
(2) 然后进行数据透视,同时在两个方向上补充总和:
pt = pd.pivot_table(data = df,
values = 'Sum',
index = 'Collection',
columns = 'Row Labels',
fill_value=0)
# 重新计算总和和合计
pt['Sum'] = pt.sum(axis=1)
pt.loc['Total'] = pt.sum(axis=0)
最终输出:
Row Labels Data 1 Data 2 Sum
Collection
1 10 12 22
2 33 0 33
3 14 31 45
Total 57 43 100
注意:上述代码中的文本 Collection
和 Data
部分都是以英文形式呈现,如有需要可以进行本地化翻译。
英文:
Welcome to SO. Simplest lines possible, in my opinion:
With input data:
df = pd.DataFrame(columns = ['Row Labels', 'Sum'],
data = [['Collection1', 22],
['Data 1', 10],
['Data 2', 12],
['Collection2', 33],
['Data 1', 33],
['Collection3', 45],
['Data 1', 14],
['Data 2', 31],
['Total', 100]])
Row Labels Sum
0 Collection1 22
1 Data 1 10
2 Data 2 12
3 Collection2 33
4 Data 1 33
5 Collection3 45
6 Data 1 14
7 Data 2 31
8 Total 100
(1) Reformulate the table so it can be pivoted
(Parse Collection and Data information)
# Extract Collection number as a new row
df['Collection'] = df['Row Labels'].str.extract("Collection\s*(\d)", expand=True).ffill()
#df['Collection'] = df['Row Labels'].str.startswith('Coll').cumsum()#old: assumed Collection always came in natural order -removed following mozway's comment, thank you!
# keep only 'Data' rows, because pivoting won't work well with redundant information.
df = df[df['Row Labels'].str.contains('Data')]
Reformulated input table:
Row Labels Sum Collection
1 Data 1 10 1
2 Data 2 12 1
4 Data 1 33 2
6 Data 1 14 3
7 Data 2 31 3
(2) Pivot then complete the table with sums in both directions:
pt = pd.pivot_table(data = df,
values = 'Sum',
index = 'Collection',
columns = 'Row Labels',
fill_value=0)
# Recalculate the sums and totals
pt['Sum'] = pt.sum(axis=1)
pt.loc['Total'] = pt.sum(axis=0)
Final output:
Row Labels Data 1 Data 2 Sum
Collection
1 10 12 22
2 33 0 33
3 14 31 45
Total 57 43 100
答案4
得分: 0
另一种可能的解决方案:
s = df['Row Labels'].str.startswith('Collection')
(df.assign(aux = s.cumsum())
.pivot(index='aux', columns='Row Labels', values='Sum')
.set_axis(df['Row Labels'].loc展开收缩)
.filter(like='Data')
.rename_axis(None, axis=1)
.fillna(0)
.pipe(lambda x: pd.concat([x, x.sum().to_frame().T.set_axis(['Total'])]))
.assign(Sum = lambda x: x.sum(axis=1))
.reset_index(names = 'Row Labels'))
输出:
Row Labels Data 1 Data 2 Sum
0 Collection1 10.0 12.0 22.0
1 Collection2 33.0 0.0 33.0
2 Collection3 14.0 31.0 45.0
3 Total 57.0 43.0 100.0
英文:
Another possible solution:
s = df['Row Labels'].str.startswith('Collection')
(df.assign(aux = s.cumsum())
.pivot(index='aux', columns='Row Labels', values='Sum')
.set_axis(df['Row Labels'].loc展开收缩)
.filter(like='Data')
.rename_axis(None, axis=1)
.fillna(0)
.pipe(lambda x: pd.concat([x, x.sum().to_frame().T.set_axis(['Total'])]))
.assign(Sum = lambda x: x.sum(axis=1))
.reset_index(names = 'Row Labels'))
Output:
Row Labels Data 1 Data 2 Sum
0 Collection1 10.0 12.0 22.0
1 Collection2 33.0 0.0 33.0
2 Collection3 14.0 31.0 45.0
3 Total 57.0 43.0 100.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论