英文:
Is it even possible to reshape this using Pandas?
问题
我有这个数据框(包括所有50个州和更多类别,但如果我们可以让这个工作,我可以应用它到整个数据集):
US US_bp US_bp% AL AL_bp AL_bp%
total_pop 324173084 41393176 12.8 4920613 794326 16.1
white 198511109 19510415 9.8 3218517 378269 11.8
black 38526055 8402643 21.8 1260356 327284 26.0
需要得到这个:
place total_pop bp bp% white_pop white_bp white_bp% black_pop black_bp black_bp%
0 US 324173084 41393176 12.8 198511109 19510415 9.8 8402643 840263 21.4
1 AL 4920613 794326 16.1 3218517 378269 11.8 1260356 327284 26.0
如何使用Pandas实现这个?我尝试了melt和pivot,但无法理解可能起作用的方式。
英文:
I have this dataframe (with all 50 states and more categories, but if we can get this to work, I can apply it to the whole dataset):
US US_bp US_bp% AL AL_bp AL_bp%
total_pop 324173084 41393176 12.8 4920613 794326 16.1
white 198511109 19510415 9.8 3218517 378269 11.8
black 38526055 8402643 21.8 1260356 327284 26.0
And need this:
place total_pop bp bp% white_pop white_bp white_bp% black_pop black_bp black_bp%
0 US 324173084 41393176 12.8 198511109 19510415 9.8 8402643 840263 21.4
1 AL 4920613 794326 16.1 3218517 378269 11.8 1260356 327284 26.0
How can I do this with Pandas? I tried melt and pivot, but can't wrap my head around what might work.
答案1
得分: 3
这里有一个部分解决方案:将列转换为多级索引,然后进行堆叠和取消堆叠。
我在第一步使用了正则表达式:提取两个大写字符,后面可以跟一个下划线以及其他部分。
col_pairs = df.columns.str.extract(r'^([A-Z]{2})(?:_(.*))?$').fillna('pop')
df.columns = pd.MultiIndex.from_arrays(col_pairs.T.values)
(df
.stack(level=0)
.unstack(level=0)
.swaplevel(axis=1)
# 修复顺序
.reindex(df.index, axis=1, level=0)
.reindex(df.columns.get_level_values(1).unique(), axis=1, level=1)
.reindex(df.columns.get_level_values(0).unique())
)
total_pop white black
pop bp bp% pop bp bp% pop bp bp%
US 324173084 41393176 12.8 198511109 19510415 9.8 38526055 8402643 21.8
AL 4920613 794326 16.1 3218517 378269 11.8 1260356 327284 26.0
我希望堆叠/取消堆叠不会进行排序,但至少您可以通过重新索引来解决这个问题。我也没有费心确保标签完全符合您的要求,但大部分都很容易。
要归功于 sammywemmy 提出的 堆叠/取消堆叠技巧。
英文:
Here's a partial solution: Convert the columns into a MultiIndex, then stack and unstack.
I'm using a regex for the first step: Extract two uppercase characters optionally followed by an underscore plus the other part.
col_pairs = df.columns.str.extract(r'^([A-Z]{2})(?:_(.*))?$').fillna('pop')
df.columns = pd.MultiIndex.from_arrays(col_pairs.T.values)
(df
.stack(level=0)
.unstack(level=0)
.swaplevel(axis=1)
# Fix order
.reindex(df.index, axis=1, level=0)
.reindex(df.columns.get_level_values(1).unique(), axis=1, level=1)
.reindex(df.columns.get_level_values(0).unique())
)
total_pop white black
pop bp bp% pop bp bp% pop bp bp%
US 324173084 41393176 12.8 198511109 19510415 9.8 38526055 8402643 21.8
AL 4920613 794326 16.1 3218517 378269 11.8 1260356 327284 26.0
I'd prefer if stacking/unstacking didn't sort, but at least you can fix that by reindexing. I'm also not bothering to get the labels exactly as you want them, but most of that's easy.
Due credit to sammywemmy for the stack/unstack technique.
1: https://stackoverflow.com/a/48636483/4518341 "jezrael's answer on "How can i unstack without sorting in pandas?""
2: https://stackoverflow.com/a/76708707/4518341
答案2
得分: 1
另一种解决方案,使用 pd.wide_to_long
:
stubnames = sorted(set(c.split('_')[0] for c in df.columns))
df.columns = (c if '_' in c else c + '_total_pop' for c in df.columns)
df = df.rename(index={'total_pop': ''})
df = pd.wide_to_long(df.reset_index(), stubnames, i='index', j='xxx', sep='_', suffix=r'.*').T
df.columns = (f'{a}_{b}'.strip('_') for a, b in df.columns)
print(df)
打印结果:
total_pop white_total_pop black_total_pop bp white_bp black_bp bp% white_bp% black_bp%
AL 4920613.0 3218517.0 1260356.0 794326.0 378269.0 327284.0 16.1 11.8 26.0
US 324173084.0 198511109.0 38526055.0 41393176.0 19510415.0 8402643.0 12.8 9.8 21.8
英文:
Another solution, using pd.wide_to_long
:
stubnames = sorted(set(c.split('_')[0] for c in df.columns))
df.columns = (c if '_' in c else c + '_total_pop' for c in df.columns)
df = df.rename(index={'total_pop': ''})
df = pd.wide_to_long(df.reset_index(), stubnames, i='index', j='xxx', sep='_', suffix=r'.*').T
df.columns = (f'{a}_{b}'.strip('_') for a, b in df.columns)
print(df)
Prints:
total_pop white_total_pop black_total_pop bp white_bp black_bp bp% white_bp% black_bp%
AL 4920613.0 3218517.0 1260356.0 794326.0 378269.0 327284.0 16.1 11.8 26.0
US 324173084.0 198511109.0 38526055.0 41393176.0 19510415.0 8402643.0 12.8 9.8 21.8
答案3
得分: 1
# 为那些不以 `_` 结尾的列重命名:
df.columns = [f"{col}_pop" if "bp" not in col else col for col in df]
# 从列创建一个 MultiIndex:
df.columns = df.columns.str.split("_", expand=True)
# 使用 stack、unstack 和 swaplevel 重塑数据框:
df = df.stack(level=0).unstack(level=0).swaplevel(axis=1)
# 运行列表推导式以构建符合您期望输出的最终列:
df.columns = [first if last in first
else last if first == "total_pop"
else f"{first}_{last}"
for first, last in df]
df.index.name = 'place'
英文:
- rename the columns to cater for those that dont have an ending with
_
:
df.columns = [f"{col}_pop" if "bp" not in col else col for col in df]
- create a MultiIndex from the columns:
df.columns = df.columns.str.split("_", expand=True)
- reshape the dataframe with a combination of stack, unstack, swaplevel:
df = df.stack(level=0).unstack(level=0).swaplevel(axis=1)
- run a list comprehension to build the final column to your expected output:
df.columns = [first if last in first
else last if first == "total_pop"
else f"{first}_{last}"
for first, last in df]
df.index.name = 'place'
black_bp bp white_bp black_bp% bp% white_bp% black_pop total_pop white_pop
place
AL 327284 794326 378269 26.0 16.1 11.8 1260356 4920613 3218517
US 8402643 41393176 19510415 21.8 12.8 9.8 38526055 324173084 198511109
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论