英文:
Pandas new set of columns from present set of column sliced by a categorical column
问题
| Bins | Color | Size | 
|---|---|---|
| A | 'Red' | 50 | 
| B | 'Blue' | 60 | 
英文:
I have table of form
| Bins | A_Color | B_Color | A_Size | B_Size. | 
|---|---|---|---|---|
| A | 'Red' | 50 | ||
| B | 'Blue' | 60 | 
and I wanted to have a common color and size columns instead of a set of columns for each category like this
| Bins | Color | Size | 
|---|---|---|
| A | 'Red' | 50 | 
| B | 'Blue' | 60 | 
I tried the below however got NaN values for ['Color', 'Size'] columns
bins = ['A', 'B', 'C', 'D', 'E']
    for b in bins:
        df.loc[df['Bins'] == b, ['Color', 'Size']] = \
            df.loc[df['Bins'] == b, [f'{b}_Color', f'{b}_Size']]
This is just an example, and the real data contains roughly 100K rows and more than 300+ columns.
答案1
得分: 2
你可以使用 pd.wide_to_long。只需将列重命名以匹配格式(A_Color -> Color_A)
>>> (pd.wide_to_long(df.rename(columns=lambda x: '_'.join(x.split('_')[::-1])), 
                     stubnames=['Color', 'Size'], i='Category', j='Cat', 
                     sep='_', suffix='\w+')
       .query('Category == Cat').droplevel('Cat').reset_index())
  Category Color  Size
0        A   Red  50.0
1        B  Blue 60.0
详细信息:
# 重命名列
>>> df1 = df.rename(columns=lambda x: '_'.join(x.split('_')[::-1]))
  Category Color_A Color_B  Size_A  Size_B  # <- 在这里
0        A     Red     NaN    50.0     NaN
1        B     NaN    Blue     NaN    60.0
# 重塑数据框
>>> out = pd.wide_to_long(df1, stubnames=['Color', 'Size'], i='Category', j='Cat', sep='_', suffix='\w+')
             Color  Size
Category Cat            
A        A     Red  50.0  # 保留
B        A     NaN   NaN  # 删除
A        B     NaN   NaN  # 删除
B        B    Blue  60.0  # 保留
# 过滤行
>>> out = out.query('Category == Cat')
             Color  Size
Category Cat            
A        A     Red  50.0
B        B    Blue  60.0
# 获取最终数据框
>>> out = out.droplevel('Cat').reset_index()
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0
英文:
You can use pd.wide_to_long. You just have to rename your columns to match the format (A_Color -> Color_A)
>>> (pd.wide_to_long(df.rename(columns=lambda x: '_'.join(x.split('_')[::-1])), 
                     stubnames=['Color', 'Size'], i='Category', j='Cat', 
                     sep='_', suffix='\w+')
       .query('Category == Cat').droplevel('Cat').reset_index())
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0
Details:
# Rename columns
>>> df1 = df.rename(columns=lambda x: '_'.join(x.split('_')[::-1]))
  Category Color_A Color_B  Size_A  Size_B  # <- HERE
0        A     Red     NaN    50.0     NaN
1        B     NaN    Blue     NaN    60.0
# Reshape dataframe
>>> out = pd.wide_to_long(df1, stubnames=['Color', 'Size'], i='Category', j='Cat', sep='_', suffix='\w+')
             Color  Size
Category Cat            
A        A     Red  50.0  # Keep
B        A     NaN   NaN  # Drop
A        B     NaN   NaN  # Drop
B        B    Blue  60.0  # Keep
# Filter rows
>>> out = out.query('Category == Cat')
             Color  Size
Category Cat            
A        A     Red  50.0
B        B    Blue  60.0
# Get final dataframe
>>> out = out.droplevel('Cat').reset_index()
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0
答案2
得分: 1
代码部分不需要翻译,以下是已翻译的内容:
One idea is get first non missing value by splitted columns with _:
一个想法是通过使用下划线 _ 分割列来获取第一个非缺失值:
df1 = (df.set_index('Category')
         .groupby(lambda x: x.split('_')[-1], axis=1)
         .first()
         .reset_index())
print(df1)
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0
Solution with lookup:
使用 lookup 的解决方案:
categories = ['Size','Color']
for c in categories:
    idx, cols = pd.factorize(df['Category'].add(f'_{c}'))
    df[c] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Your solution should be changed by converting to numpy array:
您的解决方案应通过转换为 NumPy 数组来进行更改:
for b in bins:
    df.loc[df['Category'] == b, ['Color', 'Size']] = \
        df.loc[df['Category'] == b, [f'{b}_Color', f'{b}_Size']].to_numpy()
英文:
One idea is get first non missing value by splitted columns with _:
df1 = (df.set_index('Category')
         .groupby(lambda x: x.split('_')[-1], axis=1)
         .first()
         .reset_index())
print (df1)
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0
Solution with lookup:
categories = ['Size','Color']
for c in categories:
    idx, cols = pd.factorize(df['Category'].add(f'_{c}'))
    df[c] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Your solution should be changed by converting to numpy array:
for b in bins:
    df.loc[df['Category'] == b, ['Color', 'Size']] = \
        df.loc[df['Category'] == b, [f'{b}_Color', f'{b}_Size']].to_numpy()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论