Pandas:将Excel数据逆规整化,其中类别和子标签位于同一列。

huangapple go评论68阅读模式
英文:

Pandas: Unpivot Excel Data Where Category and Children Labels Are In Same Column

问题

基本上,我认为最简单的解释方法是,我正在尝试扩展一个多索引表,但这些索引都在同一列中。

我的数据结构如下:

行标签 总计
Collection1 22
Data1 10
Data2 12
Collection2 33
Data1 33
Collection3 45
Data1 14
Data2 31
总计 100

我想要的输出是这样的一个数据框:

行标签 Data1 Data2 总计
Collection1 10 12 22
Collection2 33 0 33
Collection3 14 31 45
总计 57 43 100

是否有内置的pandas方法或者处理这种类型转换的简单方法?

我尝试过手动拆分表格,并通过收集重复的行标签并从中创建列来重新创建它们,使用具有该标签的行的数据,但是,棘手的地方在于子数据缺失;就像上面的示例中,Collection2 Data2不存在。通过这种方法,我可以计算每行是否Data1等于Collection1,如果是,就在该索引处将0添加到Data2。但是,这似乎非常丑陋,我想肯定有更加优雅的方法。

英文:

Basically, I think the easiest way to explain this is, I am trying expand a multi-indexed table, but the indexes are both in the same column.

My data is structured like this:

Row Labels Sum
Collection1 22
Data 1 10
Data 2 12
Collection2 33
Data 1 33
Collection3 45
Data 1 14
Data 2 31
Total 100

What I would like out is a Dataframe like this:

Row Labels Data1 Data2 Sum
Collection1 10 12 22
Collection2 33 0 33
Collection3 14 31 45
Total 57 43 100

Is there any built in pandas methods, or a straight forward approach to handling this type of translation?

I have tried manually breaking down the table and recreating it by collecting the row labels that are repeated, and making columns from them, with the data from rows with that label, but, the tricky spot is where child data is missing; like in the example above Collection2 Data2 doesn't exist. With this approach I could calculate for each row if Data1 equals Collection1 and if it does, add a 0 to Data2 at that index. But, it seems super ugly, and figured there is probably a much more elegant approach.

答案1

得分: 2

使用 pivot_table 函数:

# 识别分组
m = df['Row Labels'].str.match(r'Collection\d+|Total')

# 重塑数据
out = (df
   .assign(index=df['Row Labels'].where(m).ffill(),
           col=df['Row Labels'].mask(m, 'Sum')
          )
   .pivot_table(index='index', columns='col', values='Sum', fill_value=0)
   .rename_axis(columns=None)
)

# 重新计算总和
out.loc['Total'] = out.drop('Total').sum()

out = out.reset_index()

输出结果:

         index  Data 1  Data 2  Sum
0  Collection1      10      12   22
1  Collection2      33       0   33
2  Collection3      14      31   45
3        Total      57      43  100
英文:

Using a pivot_table:

# identify groups
m = df['Row Labels'].str.match(r'Collection\d+|Total')

# reshape
out = (df
   .assign(index=df['Row Labels'].where(m).ffill(),
           col=df['Row Labels'].mask(m, 'Sum')
          )
   .pivot_table(index='index', columns='col', values='Sum', fill_value=0)
   .rename_axis(columns=None)
)

# recompute Total
out.loc['Total'] = out.drop('Total').sum()

out = out.reset_index()

Output:

         index  Data 1  Data 2  Sum
0  Collection1      10      12   22
1  Collection2      33       0   33
2  Collection3      14      31   45
3        Total      57      43  100

答案2

得分: 1

以下是翻译好的部分:

# 移除总行 - 最后一步将重新创建
df = df[df["Row Labels"] != "Total"]

# 找到用于透视的索引
mask = df["Row Labels"].str.startswith("Collection")
df["idx"] = mask.cumsum()

# 在此处执行实际的转换:透视 + 合并
df = (
    pd.merge(
        df[mask],
        df[~mask].pivot(index="idx", columns="Row Labels", values="Sum"),
        left_on="idx",
        right_index=True,
    )
    .drop(columns=["idx"])
    .fillna(0)
)

# 添加总行
df = pd.concat(
    [
        df,
        pd.DataFrame(
            {"Row Labels": ["Total"], **{c: [df[c].sum()] for c in df.loc[:, "Sum":]}}
        ),
    ]
)

print(df)

打印结果:

    Row Labels  Sum  Data 1  Data 2
0  Collection1   22    10.0    12.0
3  Collection2   33    33.0     0.0
5  Collection3   45    14.0    31.0
0        Total  100    57.0    43.0
英文:

I'm not sure if exists some straightforward Pandas solution, but you can try this example:

# remove the Total row - will recreate it as last step
df = df[df["Row Labels"] != "Total"]

# find the indices for pivoting
mask = df["Row Labels"].str.startswith("Collection")
df["idx"] = mask.cumsum()

# do the actual transformation here: pivot + merge
df = (
    pd.merge(
        df[mask],
        df[~mask].pivot(index="idx", columns="Row Labels", values="Sum"),
        left_on="idx",
        right_index=True,
    )
    .drop(columns=["idx"])
    .fillna(0)
)

# add Total row back
df = pd.concat(
    [
        df,
        pd.DataFrame(
            {"Row Labels": ["Total"], **{c: [df[c].sum()] for c in df.loc[:, "Sum":]}}
        ),
    ]
)

print(df)

Prints:

    Row Labels  Sum  Data 1  Data 2
0  Collection1   22    10.0    12.0
3  Collection2   33    33.0     0.0
5  Collection3   45    14.0    31.0
0        Total  100    57.0    43.0

答案3

得分: 1

欢迎来到SO。在我看来,最简单的方法如下:

(1) 重新构建表格以进行数据透视

(解析 CollectionData 信息)

# 提取 Collection 编号作为新行
df['Collection'] = df['Row Labels'].str.extract("Collection\s*(\d)", expand=True).ffill()

# 仅保留 'Data' 行,因为在冗余信息下进行数据透视效果不佳。
df = df[df['Row Labels'].str.contains('Data')]

重新构建后的输入表格:

  Row Labels  Sum  Collection
1     Data 1   10           1
2     Data 2   12           1
4     Data 1   33           2
6     Data 1   14           3
7     Data 2   31           3

(2) 然后进行数据透视,同时在两个方向上补充总和:

pt = pd.pivot_table(data    = df,
                    values  = 'Sum',
                    index   = 'Collection',
                    columns = 'Row Labels', 
                    fill_value=0)

# 重新计算总和和合计
pt['Sum']       = pt.sum(axis=1)
pt.loc['Total'] = pt.sum(axis=0)

最终输出:

Row Labels  Data 1  Data 2  Sum
Collection                     
1               10      12   22
2               33       0   33
3               14      31   45
Total           57      43  100

注意:上述代码中的文本 CollectionData 部分都是以英文形式呈现,如有需要可以进行本地化翻译。

英文:

Welcome to SO. Simplest lines possible, in my opinion:

With input data:

df = pd.DataFrame(columns = ['Row Labels', 'Sum'],
                  data =   [['Collection1', 22],
                            ['Data 1',      10],
                            ['Data 2',      12],
                            ['Collection2', 33],
                            ['Data 1',      33],
                            ['Collection3', 45],
                            ['Data 1',      14],
                            ['Data 2',      31],
                            ['Total',      100]])
    Row Labels  Sum
0  Collection1   22
1       Data 1   10
2       Data 2   12
3  Collection2   33
4       Data 1   33
5  Collection3   45
6       Data 1   14
7       Data 2   31
8        Total  100

(1) Reformulate the table so it can be pivoted

(Parse Collection and Data information)

# Extract Collection number as a new row
df['Collection'] = df['Row Labels'].str.extract("Collection\s*(\d)", expand=True).ffill()
#df['Collection'] = df['Row Labels'].str.startswith('Coll').cumsum()#old: assumed Collection always came in natural order -removed following mozway's comment, thank you!

# keep only 'Data' rows, because pivoting won't work well with redundant information.
df = df[df['Row Labels'].str.contains('Data')]

Reformulated input table:

  Row Labels  Sum  Collection
1     Data 1   10           1
2     Data 2   12           1
4     Data 1   33           2
6     Data 1   14           3
7     Data 2   31           3

(2) Pivot then complete the table with sums in both directions:

pt = pd.pivot_table(data    = df,
                    values  = 'Sum',
                    index   = 'Collection',
                    columns = 'Row Labels', 
                    fill_value=0)

# Recalculate the sums and totals
pt['Sum']       = pt.sum(axis=1)
pt.loc['Total'] = pt.sum(axis=0)

Final output:

Row Labels  Data 1  Data 2  Sum
Collection                     
1               10      12   22
2               33       0   33
3               14      31   45
Total           57      43  100

答案4

得分: 0

另一种可能的解决方案:

s = df['Row Labels'].str.startswith('Collection')

(df.assign(aux = s.cumsum())
 .pivot(index='aux', columns='Row Labels', values='Sum')
 .set_axis(df['Row Labels'].loc
展开收缩
)
.filter(like='Data') .rename_axis(None, axis=1) .fillna(0) .pipe(lambda x: pd.concat([x, x.sum().to_frame().T.set_axis(['Total'])])) .assign(Sum = lambda x: x.sum(axis=1)) .reset_index(names = 'Row Labels'))

输出:

        Row Labels  Data 1  Data 2    Sum
    0  Collection1    10.0    12.0   22.0
    1  Collection2    33.0     0.0   33.0
    2  Collection3    14.0    31.0   45.0
    3        Total    57.0    43.0  100.0
英文:

Another possible solution:

s = df['Row Labels'].str.startswith('Collection')

(df.assign(aux = s.cumsum())
 .pivot(index='aux', columns='Row Labels', values='Sum')
 .set_axis(df['Row Labels'].loc
展开收缩
) .filter(like='Data') .rename_axis(None, axis=1) .fillna(0) .pipe(lambda x: pd.concat([x, x.sum().to_frame().T.set_axis(['Total'])])) .assign(Sum = lambda x: x.sum(axis=1)) .reset_index(names = 'Row Labels'))

Output:

    Row Labels  Data 1  Data 2    Sum
0  Collection1    10.0    12.0   22.0
1  Collection2    33.0     0.0   33.0
2  Collection3    14.0    31.0   45.0
3        Total    57.0    43.0  100.0

huangapple
  • 本文由 发表于 2023年7月28日 04:04:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76783069.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定