2023年7月18日 04:32:16go评论106阅读模式

英文:

Is it even possible to reshape this using Pandas?

问题

我有这个数据框（包括所有50个州和更多类别，但如果我们可以让这个工作，我可以应用它到整个数据集）：

                   US     US_bp   US_bp%       AL   AL_bp  AL_bp%
total_pop   324173084  41393176     12.8  4920613  794326    16.1
white       198511109  19510415      9.8  3218517  378269    11.8
black        38526055   8402643     21.8  1260356  327284    26.0

需要得到这个：

    place  total_pop        bp    bp%  white_pop  white_bp  white_bp%  black_pop  black_bp  black_bp%
0      US  324173084  41393176   12.8  198511109  19510415        9.8    8402643    840263       21.4
1      AL    4920613    794326   16.1    3218517    378269       11.8    1260356    327284       26.0

如何使用Pandas实现这个？我尝试了melt和pivot，但无法理解可能起作用的方式。

英文:

I have this dataframe (with all 50 states and more categories, but if we can get this to work, I can apply it to the whole dataset):

                   US     US_bp   US_bp%       AL   AL_bp  AL_bp%
total_pop   324173084  41393176     12.8  4920613  794326    16.1
white       198511109  19510415      9.8  3218517  378269    11.8
black        38526055   8402643     21.8  1260356  327284    26.0

And need this:

    place  total_pop        bp    bp%  white_pop  white_bp  white_bp%  black_pop  black_bp  black_bp%
0      US  324173084  41393176   12.8  198511109  19510415        9.8    8402643    840263       21.4
1      AL    4920613    794326   16.1    3218517    378269       11.8    1260356    327284       26.0

How can I do this with Pandas? I tried melt and pivot, but can't wrap my head around what might work.

答案1

得分: 3

这里有一个部分解决方案：将列转换为多级索引，然后进行堆叠和取消堆叠。

我在第一步使用了正则表达式：提取两个大写字符，后面可以跟一个下划线以及其他部分。

col_pairs = df.columns.str.extract(r'^([A-Z]{2})(?:_(.*))?$').fillna('pop')
df.columns = pd.MultiIndex.from_arrays(col_pairs.T.values)
(df
    .stack(level=0)
    .unstack(level=0)
    .swaplevel(axis=1)
    # 修复顺序
    .reindex(df.index, axis=1, level=0)
    .reindex(df.columns.get_level_values(1).unique(), axis=1, level=1)
    .reindex(df.columns.get_level_values(0).unique())
    )

   total_pop                      white                     black               
         pop        bp   bp%        pop        bp   bp%       pop       bp   bp%
US 324173084  41393176  12.8  198511109  19510415   9.8  38526055  8402643  21.8
AL   4920613    794326  16.1    3218517    378269  11.8   1260356   327284  26.0

我希望堆叠/取消堆叠不会进行排序，但至少您可以通过重新索引来解决这个问题。我也没有费心确保标签完全符合您的要求，但大部分都很容易。

要归功于 sammywemmy 提出的堆叠/取消堆叠技巧。

英文:

Here's a partial solution: Convert the columns into a MultiIndex, then stack and unstack.

I'm using a regex for the first step: Extract two uppercase characters optionally followed by an underscore plus the other part.

col_pairs = df.columns.str.extract(r&#39;^([A-Z]{2})(?:_(.*))?$&#39;).fillna(&#39;pop&#39;)
df.columns = pd.MultiIndex.from_arrays(col_pairs.T.values)
(df
    .stack(level=0)
    .unstack(level=0)
    .swaplevel(axis=1)
    # Fix order
    .reindex(df.index, axis=1, level=0)
    .reindex(df.columns.get_level_values(1).unique(), axis=1, level=1)
    .reindex(df.columns.get_level_values(0).unique())
    )

   total_pop                      white                     black               
         pop        bp   bp%        pop        bp   bp%       pop       bp   bp%
US 324173084  41393176  12.8  198511109  19510415   9.8  38526055  8402643  21.8
AL   4920613    794326  16.1    3218517    378269  11.8   1260356   327284  26.0

I'd prefer if stacking/unstacking didn't sort, but at least you can fix that by reindexing. I'm also not bothering to get the labels exactly as you want them, but most of that's easy.

Due credit to sammywemmy for the stack/unstack technique.

1: https://stackoverflow.com/a/48636483/4518341 "jezrael's answer on "How can i unstack without sorting in pandas?""
2: https://stackoverflow.com/a/76708707/4518341

答案2

得分: 1

另一种解决方案，使用 pd.wide_to_long：

stubnames = sorted(set(c.split('_')[0] for c in df.columns))
df.columns = (c if '_' in c else c + '_total_pop' for c in df.columns)
df = df.rename(index={'total_pop': ''})
df = pd.wide_to_long(df.reset_index(), stubnames, i='index', j='xxx', sep='_', suffix=r'.*').T
df.columns = (f'{a}_{b}'.strip('_') for a, b in df.columns)
print(df)

打印结果：

      total_pop  white_total_pop  black_total_pop          bp    white_bp   black_bp   bp%  white_bp%  black_bp%
AL    4920613.0        3218517.0        1260356.0    794326.0    378269.0   327284.0  16.1       11.8       26.0
US  324173084.0      198511109.0       38526055.0  41393176.0  19510415.0  8402643.0  12.8        9.8       21.8

英文:

Another solution, using pd.wide_to_long:

stubnames = sorted(set(c.split(&#39;_&#39;)[0] for c in df.columns))
df.columns = (c if &#39;_&#39; in c else c + &#39;_total_pop&#39; for c in df.columns)
df = df.rename(index={&#39;total_pop&#39;: &#39;&#39;})
df = pd.wide_to_long(df.reset_index(), stubnames, i=&#39;index&#39;, j=&#39;xxx&#39;, sep=&#39;_&#39;, suffix=r&#39;.*&#39;).T
df.columns = (f&#39;{a}_{b}&#39;.strip(&#39;_&#39;) for a, b in df.columns)
print(df)

Prints:

      total_pop  white_total_pop  black_total_pop          bp    white_bp   black_bp   bp%  white_bp%  black_bp%
AL    4920613.0        3218517.0        1260356.0    794326.0    378269.0   327284.0  16.1       11.8       26.0
US  324173084.0      198511109.0       38526055.0  41393176.0  19510415.0  8402643.0  12.8        9.8       21.8

答案3

得分: 1

# 为那些不以 `_` 结尾的列重命名：
df.columns = [f"{col}_pop" if "bp" not in col else col for col in df]
# 从列创建一个 MultiIndex：
df.columns = df.columns.str.split("_", expand=True)
# 使用 stack、unstack 和 swaplevel 重塑数据框：
df = df.stack(level=0).unstack(level=0).swaplevel(axis=1)
# 运行列表推导式以构建符合您期望输出的最终列：
df.columns = [first if last in first 
              else last if first == "total_pop" 
              else f"{first}_{last}" 
              for first, last in df]
df.index.name = 'place'

英文:

rename the columns to cater for those that dont have an ending with _:

df.columns = [f&quot;{col}_pop&quot; if &quot;bp&quot; not in col else col for col in df]

create a MultiIndex from the columns:

df.columns = df.columns.str.split(&quot;_&quot;, expand=True)

reshape the dataframe with a combination of stack, unstack, swaplevel:

df = df.stack(level=0).unstack(level=0).swaplevel(axis=1)

run a list comprehension to build the final column to your expected output:

df.columns = [first if last in first 
              else last if first == &quot;total_pop&quot; 
              else f&quot;{first}_{last}&quot; 
              for first, last in df]
df.index.name = &#39;place&#39;
       black_bp        bp  white_bp  black_bp%   bp%  white_bp%  black_pop  total_pop  white_pop
place                                                                                           
AL       327284    794326    378269       26.0  16.1       11.8    1260356    4920613    3218517
US      8402643  41393176  19510415       21.8  12.8        9.8   38526055  324173084  198511109

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas 能否用来重新塑形这个数据？

问题

答案1

答案2

答案3

如何在pandas DataFrame中获取每天的最早时间和最晚时间？

根据变量名称中相同的后缀，在循环中合并多个数据框。

Would df.sort_values('A', kind = 'mergesort').sort_index(kind = 'mergesort') be a stable and valid way to sort by index and column?

从Python的SQLAlchemy连接对象和表名字符串中获取表描述。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。