2023年7月28日 04:04:07go评论112阅读模式

英文:

Pandas: Unpivot Excel Data Where Category and Children Labels Are In Same Column

问题

基本上，我认为最简单的解释方法是，我正在尝试扩展一个多索引表，但这些索引都在同一列中。

我的数据结构如下：

行标签	总计
Collection1	22
Data1	10
Data2	12
Collection2	33
Data1	33
Collection3	45
Data1	14
Data2	31
总计	100

我想要的输出是这样的一个数据框：

行标签	Data1	Data2	总计
Collection1	10	12	22
Collection2	33	0	33
Collection3	14	31	45
总计	57	43	100

是否有内置的pandas方法或者处理这种类型转换的简单方法？

我尝试过手动拆分表格，并通过收集重复的行标签并从中创建列来重新创建它们，使用具有该标签的行的数据，但是，棘手的地方在于子数据缺失；就像上面的示例中，Collection2 Data2不存在。通过这种方法，我可以计算每行是否Data1等于Collection1，如果是，就在该索引处将0添加到Data2。但是，这似乎非常丑陋，我想肯定有更加优雅的方法。

英文:

Basically, I think the easiest way to explain this is, I am trying expand a multi-indexed table, but the indexes are both in the same column.

My data is structured like this:

Row Labels	Sum
Collection1	22
Data 1	10
Data 2	12
Collection2	33
Data 1	33
Collection3	45
Data 1	14
Data 2	31
Total	100

What I would like out is a Dataframe like this:

Row Labels	Data1	Data2	Sum
Collection1	10	12	22
Collection2	33	0	33
Collection3	14	31	45
Total	57	43	100

Is there any built in pandas methods, or a straight forward approach to handling this type of translation?

I have tried manually breaking down the table and recreating it by collecting the row labels that are repeated, and making columns from them, with the data from rows with that label, but, the tricky spot is where child data is missing; like in the example above Collection2 Data2 doesn't exist. With this approach I could calculate for each row if Data1 equals Collection1 and if it does, add a 0 to Data2 at that index. But, it seems super ugly, and figured there is probably a much more elegant approach.

答案1

得分: 2

使用 pivot_table 函数：

# 识别分组
m = df['Row Labels'].str.match(r'Collection\d+|Total')
# 重塑数据
out = (df
   .assign(index=df['Row Labels'].where(m).ffill(),
           col=df['Row Labels'].mask(m, 'Sum')
          )
   .pivot_table(index='index', columns='col', values='Sum', fill_value=0)
   .rename_axis(columns=None)
)
# 重新计算总和
out.loc['Total'] = out.drop('Total').sum()
out = out.reset_index()

输出结果：

         index  Data 1  Data 2  Sum
0  Collection1      10      12   22
1  Collection2      33       0   33
2  Collection3      14      31   45
3        Total      57      43  100

英文:

Using a pivot_table:

# identify groups
m = df[&#39;Row Labels&#39;].str.match(r&#39;Collection\d+|Total&#39;)
# reshape
out = (df
   .assign(index=df[&#39;Row Labels&#39;].where(m).ffill(),
           col=df[&#39;Row Labels&#39;].mask(m, &#39;Sum&#39;)
          )
   .pivot_table(index=&#39;index&#39;, columns=&#39;col&#39;, values=&#39;Sum&#39;, fill_value=0)
   .rename_axis(columns=None)
)
# recompute Total
out.loc[&#39;Total&#39;] = out.drop(&#39;Total&#39;).sum()
out = out.reset_index()

Output:

         index  Data 1  Data 2  Sum
0  Collection1      10      12   22
1  Collection2      33       0   33
2  Collection3      14      31   45
3        Total      57      43  100

答案2

得分: 1

以下是翻译好的部分：

# 移除总行 - 最后一步将重新创建
df = df[df["Row Labels"] != "Total"]
# 找到用于透视的索引
mask = df["Row Labels"].str.startswith("Collection")
df["idx"] = mask.cumsum()
# 在此处执行实际的转换：透视 + 合并
df = (
    pd.merge(
        df[mask],
        df[~mask].pivot(index="idx", columns="Row Labels", values="Sum"),
        left_on="idx",
        right_index=True,
    )
    .drop(columns=["idx"])
    .fillna(0)
)
# 添加总行
df = pd.concat(
    [
        df,
        pd.DataFrame(
            {"Row Labels": ["Total"], **{c: [df[c].sum()] for c in df.loc[:, "Sum":]}}
        ),
    ]
)
print(df)

打印结果：

    Row Labels  Sum  Data 1  Data 2
0  Collection1   22    10.0    12.0
3  Collection2   33    33.0     0.0
5  Collection3   45    14.0    31.0
0        Total  100    57.0    43.0

英文:

I'm not sure if exists some straightforward Pandas solution, but you can try this example:

# remove the Total row - will recreate it as last step
df = df[df[&quot;Row Labels&quot;] != &quot;Total&quot;]
# find the indices for pivoting
mask = df[&quot;Row Labels&quot;].str.startswith(&quot;Collection&quot;)
df[&quot;idx&quot;] = mask.cumsum()
# do the actual transformation here: pivot + merge
df = (
    pd.merge(
        df[mask],
        df[~mask].pivot(index=&quot;idx&quot;, columns=&quot;Row Labels&quot;, values=&quot;Sum&quot;),
        left_on=&quot;idx&quot;,
        right_index=True,
    )
    .drop(columns=[&quot;idx&quot;])
    .fillna(0)
)
# add Total row back
df = pd.concat(
    [
        df,
        pd.DataFrame(
            {&quot;Row Labels&quot;: [&quot;Total&quot;], **{c: [df[c].sum()] for c in df.loc[:, &quot;Sum&quot;:]}}
        ),
    ]
)
print(df)

Prints:

    Row Labels  Sum  Data 1  Data 2
0  Collection1   22    10.0    12.0
3  Collection2   33    33.0     0.0
5  Collection3   45    14.0    31.0
0        Total  100    57.0    43.0

答案3

得分: 1

欢迎来到SO。在我看来，最简单的方法如下：

(1) 重新构建表格以进行数据透视

(解析 Collection 和 Data 信息)

# 提取 Collection 编号作为新行
df['Collection'] = df['Row Labels'].str.extract("Collection\s*(\d)", expand=True).ffill()
# 仅保留 'Data' 行，因为在冗余信息下进行数据透视效果不佳。
df = df[df['Row Labels'].str.contains('Data')]

重新构建后的输入表格：

  Row Labels  Sum  Collection
1     Data 1   10           1
2     Data 2   12           1
4     Data 1   33           2
6     Data 1   14           3
7     Data 2   31           3

(2) 然后进行数据透视，同时在两个方向上补充总和：

pt = pd.pivot_table(data    = df,
                    values  = 'Sum',
                    index   = 'Collection',
                    columns = 'Row Labels', 
                    fill_value=0)
# 重新计算总和和合计
pt['Sum']       = pt.sum(axis=1)
pt.loc['Total'] = pt.sum(axis=0)

最终输出：

Row Labels  Data 1  Data 2  Sum
Collection                     
1               10      12   22
2               33       0   33
3               14      31   45
Total           57      43  100

注意：上述代码中的文本 Collection 和 Data 部分都是以英文形式呈现，如有需要可以进行本地化翻译。

英文:

Welcome to SO. Simplest lines possible, in my opinion:

With input data:

df = pd.DataFrame(columns = [&#39;Row Labels&#39;, &#39;Sum&#39;],
                  data =   [[&#39;Collection1&#39;, 22],
                            [&#39;Data 1&#39;,      10],
                            [&#39;Data 2&#39;,      12],
                            [&#39;Collection2&#39;, 33],
                            [&#39;Data 1&#39;,      33],
                            [&#39;Collection3&#39;, 45],
                            [&#39;Data 1&#39;,      14],
                            [&#39;Data 2&#39;,      31],
                            [&#39;Total&#39;,      100]])
    Row Labels  Sum
0  Collection1   22
1       Data 1   10
2       Data 2   12
3  Collection2   33
4       Data 1   33
5  Collection3   45
6       Data 1   14
7       Data 2   31
8        Total  100

(1) Reformulate the table so it can be pivoted

(Parse Collection and Data information)

# Extract Collection number as a new row
df[&#39;Collection&#39;] = df[&#39;Row Labels&#39;].str.extract(&quot;Collection\s*(\d)&quot;, expand=True).ffill()
#df[&#39;Collection&#39;] = df[&#39;Row Labels&#39;].str.startswith(&#39;Coll&#39;).cumsum()#old: assumed Collection always came in natural order -removed following mozway&#39;s comment, thank you!
# keep only &#39;Data&#39; rows, because pivoting won&#39;t work well with redundant information.
df = df[df[&#39;Row Labels&#39;].str.contains(&#39;Data&#39;)]

Reformulated input table:

  Row Labels  Sum  Collection
1     Data 1   10           1
2     Data 2   12           1
4     Data 1   33           2
6     Data 1   14           3
7     Data 2   31           3

(2) Pivot then complete the table with sums in both directions:

pt = pd.pivot_table(data    = df,
                    values  = &#39;Sum&#39;,
                    index   = &#39;Collection&#39;,
                    columns = &#39;Row Labels&#39;, 
                    fill_value=0)
# Recalculate the sums and totals
pt[&#39;Sum&#39;]       = pt.sum(axis=1)
pt.loc[&#39;Total&#39;] = pt.sum(axis=0)

Final output:

Row Labels  Data 1  Data 2  Sum
Collection                     
1               10      12   22
2               33       0   33
3               14      31   45
Total           57      43  100

答案4

得分: 0

另一种可能的解决方案：

s = df['Row Labels'].str.startswith('Collection')
(df.assign(aux = s.cumsum())
 .pivot(index='aux', columns='Row Labels', values='Sum')
 .set_axis(df['Row Labels'].loc展开收缩
)
 .filter(like='Data')
 .rename_axis(None, axis=1)
 .fillna(0)
 .pipe(lambda x: pd.concat([x, x.sum().to_frame().T.set_axis(['Total'])]))
 .assign(Sum = lambda x: x.sum(axis=1))
 .reset_index(names = 'Row Labels'))

输出：

        Row Labels  Data 1  Data 2    Sum
    0  Collection1    10.0    12.0   22.0
    1  Collection2    33.0     0.0   33.0
    2  Collection3    14.0    31.0   45.0
    3        Total    57.0    43.0  100.0

英文:

Another possible solution:

s = df[&#39;Row Labels&#39;].str.startswith(&#39;Collection&#39;)
(df.assign(aux = s.cumsum())
 .pivot(index=&#39;aux&#39;, columns=&#39;Row Labels&#39;, values=&#39;Sum&#39;)
 .set_axis(df[&#39;Row Labels&#39;].loc展开收缩
)
 .filter(like=&#39;Data&#39;)
 .rename_axis(None, axis=1)
 .fillna(0)
 .pipe(lambda x: pd.concat([x, x.sum().to_frame().T.set_axis([&#39;Total&#39;])]))
 .assign(Sum = lambda x: x.sum(axis=1))
 .reset_index(names = &#39;Row Labels&#39;))

Output:

    Row Labels  Data 1  Data 2    Sum
0  Collection1    10.0    12.0   22.0
1  Collection2    33.0     0.0   33.0
2  Collection3    14.0    31.0   45.0
3        Total    57.0    43.0  100.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas：将Excel数据逆规整化，其中类别和子标签位于同一列。

问题

答案1

答案2

答案3

答案4

使用UDF筛选Spark DataFrame。

将Snowflake表中的值存储到Python变量中

将`&str`转换为`f64`，使用Rust Polars自定义函数。

在另一个数据框基础上添加一列

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。