Pandas通过分类列从当前列集创建一组新列的切片。

huangapple go评论61阅读模式
英文:

Pandas new set of columns from present set of column sliced by a categorical column

问题

Bins Color Size
A 'Red' 50
B 'Blue' 60
英文:

I have table of form

Bins A_Color B_Color A_Size B_Size.
A 'Red' 50
B 'Blue' 60

and I wanted to have a common color and size columns instead of a set of columns for each category like this

Bins Color Size
A 'Red' 50
B 'Blue' 60

I tried the below however got NaN values for ['Color', 'Size'] columns

bins = ['A', 'B', 'C', 'D', 'E']
    for b in bins:
        df.loc[df['Bins'] == b, ['Color', 'Size']] = \
            df.loc[df['Bins'] == b, [f'{b}_Color', f'{b}_Size']]

This is just an example, and the real data contains roughly 100K rows and more than 300+ columns.

答案1

得分: 2

你可以使用 pd.wide_to_long。只需将列重命名以匹配格式(A_Color -> Color_A)

>>> (pd.wide_to_long(df.rename(columns=lambda x: '_'.join(x.split('_')[::-1])), 
                     stubnames=['Color', 'Size'], i='Category', j='Cat', 
                     sep='_', suffix='\w+')
       .query('Category == Cat').droplevel('Cat').reset_index())

  Category Color  Size
0        A   Red  50.0
1        B  Blue 60.0

详细信息:

# 重命名列
>>> df1 = df.rename(columns=lambda x: '_'.join(x.split('_')[::-1]))
  Category Color_A Color_B  Size_A  Size_B  # <- 在这里
0        A     Red     NaN    50.0     NaN
1        B     NaN    Blue     NaN    60.0

# 重塑数据框
>>> out = pd.wide_to_long(df1, stubnames=['Color', 'Size'], i='Category', j='Cat', sep='_', suffix='\w+')
             Color  Size
Category Cat            
A        A     Red  50.0  # 保留
B        A     NaN   NaN  # 删除
A        B     NaN   NaN  # 删除
B        B    Blue  60.0  # 保留

# 过滤行
>>> out = out.query('Category == Cat')
             Color  Size
Category Cat            
A        A     Red  50.0
B        B    Blue  60.0

# 获取最终数据框
>>> out = out.droplevel('Cat').reset_index()
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0
英文:

You can use pd.wide_to_long. You just have to rename your columns to match the format (A_Color -> Color_A)

&gt;&gt;&gt; (pd.wide_to_long(df.rename(columns=lambda x: &#39;_&#39;.join(x.split(&#39;_&#39;)[::-1])), 
                     stubnames=[&#39;Color&#39;, &#39;Size&#39;], i=&#39;Category&#39;, j=&#39;Cat&#39;, 
                     sep=&#39;_&#39;, suffix=&#39;\w+&#39;)
       .query(&#39;Category == Cat&#39;).droplevel(&#39;Cat&#39;).reset_index())

  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0

Details:

# Rename columns
&gt;&gt;&gt; df1 = df.rename(columns=lambda x: &#39;_&#39;.join(x.split(&#39;_&#39;)[::-1]))
  Category Color_A Color_B  Size_A  Size_B  # &lt;- HERE
0        A     Red     NaN    50.0     NaN
1        B     NaN    Blue     NaN    60.0

# Reshape dataframe
&gt;&gt;&gt; out = pd.wide_to_long(df1, stubnames=[&#39;Color&#39;, &#39;Size&#39;], i=&#39;Category&#39;, j=&#39;Cat&#39;, sep=&#39;_&#39;, suffix=&#39;\w+&#39;)
             Color  Size
Category Cat            
A        A     Red  50.0  # Keep
B        A     NaN   NaN  # Drop
A        B     NaN   NaN  # Drop
B        B    Blue  60.0  # Keep

# Filter rows
&gt;&gt;&gt; out = out.query(&#39;Category == Cat&#39;)
             Color  Size
Category Cat            
A        A     Red  50.0
B        B    Blue  60.0

# Get final dataframe
&gt;&gt;&gt; out = out.droplevel(&#39;Cat&#39;).reset_index()
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0

答案2

得分: 1

代码部分不需要翻译,以下是已翻译的内容:

One idea is get first non missing value by splitted columns with _:

一个想法是通过使用下划线 _ 分割列来获取第一个非缺失值:

df1 = (df.set_index('Category')
         .groupby(lambda x: x.split('_')[-1], axis=1)
         .first()
         .reset_index())
print(df1)
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0

Solution with lookup:

使用 lookup 的解决方案:

categories = ['Size','Color']

for c in categories:
    idx, cols = pd.factorize(df['Category'].add(f'_{c}'))
    df[c] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Your solution should be changed by converting to numpy array:

您的解决方案应通过转换为 NumPy 数组来进行更改:

for b in bins:
    df.loc[df['Category'] == b, ['Color', 'Size']] = \
        df.loc[df['Category'] == b, [f'{b}_Color', f'{b}_Size']].to_numpy()
英文:

One idea is get first non missing value by splitted columns with _:

df1 = (df.set_index(&#39;Category&#39;)
         .groupby(lambda x: x.split(&#39;_&#39;)[-1], axis=1)
         .first()
         .reset_index())
print (df1)
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0

Solution with lookup:

categories = [&#39;Size&#39;,&#39;Color&#39;]

for c in categories:
    idx, cols = pd.factorize(df[&#39;Category&#39;].add(f&#39;_{c}&#39;))
    df[c] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Your solution should be changed by converting to numpy array:

for b in bins:
    df.loc[df[&#39;Category&#39;] == b, [&#39;Color&#39;, &#39;Size&#39;]] = \
        df.loc[df[&#39;Category&#39;] == b, [f&#39;{b}_Color&#39;, f&#39;{b}_Size&#39;]].to_numpy()

huangapple
  • 本文由 发表于 2023年4月4日 13:56:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75925922.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定