Pandas 能否用来重新塑形这个数据?

huangapple go评论76阅读模式
英文:

Is it even possible to reshape this using Pandas?

问题

我有这个数据框(包括所有50个州和更多类别,但如果我们可以让这个工作,我可以应用它到整个数据集):

                   US     US_bp   US_bp%       AL   AL_bp  AL_bp%
total_pop   324173084  41393176     12.8  4920613  794326    16.1
white       198511109  19510415      9.8  3218517  378269    11.8
black        38526055   8402643     21.8  1260356  327284    26.0

需要得到这个:

    place  total_pop        bp    bp%  white_pop  white_bp  white_bp%  black_pop  black_bp  black_bp%
0      US  324173084  41393176   12.8  198511109  19510415        9.8    8402643    840263       21.4
1      AL    4920613    794326   16.1    3218517    378269       11.8    1260356    327284       26.0

如何使用Pandas实现这个?我尝试了melt和pivot,但无法理解可能起作用的方式。

英文:

I have this dataframe (with all 50 states and more categories, but if we can get this to work, I can apply it to the whole dataset):

                   US     US_bp   US_bp%       AL   AL_bp  AL_bp%
total_pop   324173084  41393176     12.8  4920613  794326    16.1
white       198511109  19510415      9.8  3218517  378269    11.8
black        38526055   8402643     21.8  1260356  327284    26.0

And need this:

    place  total_pop        bp    bp%  white_pop  white_bp  white_bp%  black_pop  black_bp  black_bp%
0      US  324173084  41393176   12.8  198511109  19510415        9.8    8402643    840263       21.4
1      AL    4920613    794326   16.1    3218517    378269       11.8    1260356    327284       26.0

How can I do this with Pandas? I tried melt and pivot, but can't wrap my head around what might work.

答案1

得分: 3

这里有一个部分解决方案:将列转换为多级索引,然后进行堆叠和取消堆叠。

我在第一步使用了正则表达式:提取两个大写字符,后面可以跟一个下划线以及其他部分。

col_pairs = df.columns.str.extract(r'^([A-Z]{2})(?:_(.*))?$').fillna('pop')
df.columns = pd.MultiIndex.from_arrays(col_pairs.T.values)

(df
    .stack(level=0)
    .unstack(level=0)
    .swaplevel(axis=1)
    # 修复顺序
    .reindex(df.index, axis=1, level=0)
    .reindex(df.columns.get_level_values(1).unique(), axis=1, level=1)
    .reindex(df.columns.get_level_values(0).unique())
    )
   total_pop                      white                     black               
         pop        bp   bp%        pop        bp   bp%       pop       bp   bp%
US 324173084  41393176  12.8  198511109  19510415   9.8  38526055  8402643  21.8
AL   4920613    794326  16.1    3218517    378269  11.8   1260356   327284  26.0

我希望堆叠/取消堆叠不会进行排序,但至少您可以通过重新索引来解决这个问题。我也没有费心确保标签完全符合您的要求,但大部分都很容易。

要归功于 sammywemmy 提出的 堆叠/取消堆叠技巧

英文:

Here's a partial solution: Convert the columns into a MultiIndex, then stack and unstack.

I'm using a regex for the first step: Extract two uppercase characters optionally followed by an underscore plus the other part.

col_pairs = df.columns.str.extract(r'^([A-Z]{2})(?:_(.*))?$').fillna('pop')
df.columns = pd.MultiIndex.from_arrays(col_pairs.T.values)

(df
    .stack(level=0)
    .unstack(level=0)
    .swaplevel(axis=1)
    # Fix order
    .reindex(df.index, axis=1, level=0)
    .reindex(df.columns.get_level_values(1).unique(), axis=1, level=1)
    .reindex(df.columns.get_level_values(0).unique())
    )
   total_pop                      white                     black               
         pop        bp   bp%        pop        bp   bp%       pop       bp   bp%
US 324173084  41393176  12.8  198511109  19510415   9.8  38526055  8402643  21.8
AL   4920613    794326  16.1    3218517    378269  11.8   1260356   327284  26.0

I'd prefer if stacking/unstacking didn't sort, but at least you can fix that by reindexing. I'm also not bothering to get the labels exactly as you want them, but most of that's easy.

Due credit to sammywemmy for the stack/unstack technique.

1: https://stackoverflow.com/a/48636483/4518341 "jezrael's answer on "How can i unstack without sorting in pandas?""
2: https://stackoverflow.com/a/76708707/4518341

答案2

得分: 1

另一种解决方案,使用 pd.wide_to_long

stubnames = sorted(set(c.split('_')[0] for c in df.columns))
df.columns = (c if '_' in c else c + '_total_pop' for c in df.columns)
df = df.rename(index={'total_pop': ''})

df = pd.wide_to_long(df.reset_index(), stubnames, i='index', j='xxx', sep='_', suffix=r'.*').T
df.columns = (f'{a}_{b}'.strip('_') for a, b in df.columns)

print(df)

打印结果:

      total_pop  white_total_pop  black_total_pop          bp    white_bp   black_bp   bp%  white_bp%  black_bp%
AL    4920613.0        3218517.0        1260356.0    794326.0    378269.0   327284.0  16.1       11.8       26.0
US  324173084.0      198511109.0       38526055.0  41393176.0  19510415.0  8402643.0  12.8        9.8       21.8
英文:

Another solution, using pd.wide_to_long:

stubnames = sorted(set(c.split('_')[0] for c in df.columns))
df.columns = (c if '_' in c else c + '_total_pop' for c in df.columns)
df = df.rename(index={'total_pop': ''})

df = pd.wide_to_long(df.reset_index(), stubnames, i='index', j='xxx', sep='_', suffix=r'.*').T
df.columns = (f'{a}_{b}'.strip('_') for a, b in df.columns)

print(df)

Prints:

      total_pop  white_total_pop  black_total_pop          bp    white_bp   black_bp   bp%  white_bp%  black_bp%
AL    4920613.0        3218517.0        1260356.0    794326.0    378269.0   327284.0  16.1       11.8       26.0
US  324173084.0      198511109.0       38526055.0  41393176.0  19510415.0  8402643.0  12.8        9.8       21.8

答案3

得分: 1

# 为那些不以 `_` 结尾的列重命名:
df.columns = [f"{col}_pop" if "bp" not in col else col for col in df]

# 从列创建一个 MultiIndex:
df.columns = df.columns.str.split("_", expand=True)

# 使用 stack、unstack 和 swaplevel 重塑数据框:
df = df.stack(level=0).unstack(level=0).swaplevel(axis=1)

# 运行列表推导式以构建符合您期望输出的最终列:
df.columns = [first if last in first 
              else last if first == "total_pop" 
              else f"{first}_{last}" 
              for first, last in df]

df.index.name = 'place'
英文:
  • rename the columns to cater for those that dont have an ending with _:
df.columns = [f"{col}_pop" if "bp" not in col else col for col in df]
  • create a MultiIndex from the columns:
df.columns = df.columns.str.split("_", expand=True)
  • reshape the dataframe with a combination of stack, unstack, swaplevel:
df = df.stack(level=0).unstack(level=0).swaplevel(axis=1)
  • run a list comprehension to build the final column to your expected output:
df.columns = [first if last in first 
              else last if first == "total_pop" 
              else f"{first}_{last}" 
              for first, last in df]

df.index.name = 'place'

       black_bp        bp  white_bp  black_bp%   bp%  white_bp%  black_pop  total_pop  white_pop
place                                                                                           
AL       327284    794326    378269       26.0  16.1       11.8    1260356    4920613    3218517
US      8402643  41393176  19510415       21.8  12.8        9.8   38526055  324173084  198511109

huangapple
  • 本文由 发表于 2023年7月18日 04:32:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76707900.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定