Pandas通过分类列从当前列集创建一组新列的切片。

huangapple go评论76阅读模式
英文:

Pandas new set of columns from present set of column sliced by a categorical column

问题

Bins Color Size
A 'Red' 50
B 'Blue' 60
英文:

I have table of form

Bins A_Color B_Color A_Size B_Size.
A 'Red' 50
B 'Blue' 60

and I wanted to have a common color and size columns instead of a set of columns for each category like this

Bins Color Size
A 'Red' 50
B 'Blue' 60

I tried the below however got NaN values for ['Color', 'Size'] columns

  1. bins = ['A', 'B', 'C', 'D', 'E']
  2. for b in bins:
  3. df.loc[df['Bins'] == b, ['Color', 'Size']] = \
  4. df.loc[df['Bins'] == b, [f'{b}_Color', f'{b}_Size']]

This is just an example, and the real data contains roughly 100K rows and more than 300+ columns.

答案1

得分: 2

你可以使用 pd.wide_to_long。只需将列重命名以匹配格式(A_Color -> Color_A)

  1. >>> (pd.wide_to_long(df.rename(columns=lambda x: '_'.join(x.split('_')[::-1])),
  2. stubnames=['Color', 'Size'], i='Category', j='Cat',
  3. sep='_', suffix='\w+')
  4. .query('Category == Cat').droplevel('Cat').reset_index())
  5. Category Color Size
  6. 0 A Red 50.0
  7. 1 B Blue 60.0

详细信息:

  1. # 重命名列
  2. >>> df1 = df.rename(columns=lambda x: '_'.join(x.split('_')[::-1]))
  3. Category Color_A Color_B Size_A Size_B # <- 在这里
  4. 0 A Red NaN 50.0 NaN
  5. 1 B NaN Blue NaN 60.0
  6. # 重塑数据框
  7. >>> out = pd.wide_to_long(df1, stubnames=['Color', 'Size'], i='Category', j='Cat', sep='_', suffix='\w+')
  8. Color Size
  9. Category Cat
  10. A A Red 50.0 # 保留
  11. B A NaN NaN # 删除
  12. A B NaN NaN # 删除
  13. B B Blue 60.0 # 保留
  14. # 过滤行
  15. >>> out = out.query('Category == Cat')
  16. Color Size
  17. Category Cat
  18. A A Red 50.0
  19. B B Blue 60.0
  20. # 获取最终数据框
  21. >>> out = out.droplevel('Cat').reset_index()
  22. Category Color Size
  23. 0 A Red 50.0
  24. 1 B Blue 60.0
英文:

You can use pd.wide_to_long. You just have to rename your columns to match the format (A_Color -> Color_A)

  1. &gt;&gt;&gt; (pd.wide_to_long(df.rename(columns=lambda x: &#39;_&#39;.join(x.split(&#39;_&#39;)[::-1])),
  2. stubnames=[&#39;Color&#39;, &#39;Size&#39;], i=&#39;Category&#39;, j=&#39;Cat&#39;,
  3. sep=&#39;_&#39;, suffix=&#39;\w+&#39;)
  4. .query(&#39;Category == Cat&#39;).droplevel(&#39;Cat&#39;).reset_index())
  5. Category Color Size
  6. 0 A Red 50.0
  7. 1 B Blue 60.0

Details:

  1. # Rename columns
  2. &gt;&gt;&gt; df1 = df.rename(columns=lambda x: &#39;_&#39;.join(x.split(&#39;_&#39;)[::-1]))
  3. Category Color_A Color_B Size_A Size_B # &lt;- HERE
  4. 0 A Red NaN 50.0 NaN
  5. 1 B NaN Blue NaN 60.0
  6. # Reshape dataframe
  7. &gt;&gt;&gt; out = pd.wide_to_long(df1, stubnames=[&#39;Color&#39;, &#39;Size&#39;], i=&#39;Category&#39;, j=&#39;Cat&#39;, sep=&#39;_&#39;, suffix=&#39;\w+&#39;)
  8. Color Size
  9. Category Cat
  10. A A Red 50.0 # Keep
  11. B A NaN NaN # Drop
  12. A B NaN NaN # Drop
  13. B B Blue 60.0 # Keep
  14. # Filter rows
  15. &gt;&gt;&gt; out = out.query(&#39;Category == Cat&#39;)
  16. Color Size
  17. Category Cat
  18. A A Red 50.0
  19. B B Blue 60.0
  20. # Get final dataframe
  21. &gt;&gt;&gt; out = out.droplevel(&#39;Cat&#39;).reset_index()
  22. Category Color Size
  23. 0 A Red 50.0
  24. 1 B Blue 60.0

答案2

得分: 1

代码部分不需要翻译,以下是已翻译的内容:

One idea is get first non missing value by splitted columns with _:

一个想法是通过使用下划线 _ 分割列来获取第一个非缺失值:

  1. df1 = (df.set_index('Category')
  2. .groupby(lambda x: x.split('_')[-1], axis=1)
  3. .first()
  4. .reset_index())
  5. print(df1)
  6. Category Color Size
  7. 0 A Red 50.0
  8. 1 B Blue 60.0

Solution with lookup:

使用 lookup 的解决方案:

  1. categories = ['Size','Color']
  2. for c in categories:
  3. idx, cols = pd.factorize(df['Category'].add(f'_{c}'))
  4. df[c] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Your solution should be changed by converting to numpy array:

您的解决方案应通过转换为 NumPy 数组来进行更改:

  1. for b in bins:
  2. df.loc[df['Category'] == b, ['Color', 'Size']] = \
  3. df.loc[df['Category'] == b, [f'{b}_Color', f'{b}_Size']].to_numpy()
英文:

One idea is get first non missing value by splitted columns with _:

  1. df1 = (df.set_index(&#39;Category&#39;)
  2. .groupby(lambda x: x.split(&#39;_&#39;)[-1], axis=1)
  3. .first()
  4. .reset_index())
  5. print (df1)
  6. Category Color Size
  7. 0 A Red 50.0
  8. 1 B Blue 60.0

Solution with lookup:

  1. categories = [&#39;Size&#39;,&#39;Color&#39;]
  2. for c in categories:
  3. idx, cols = pd.factorize(df[&#39;Category&#39;].add(f&#39;_{c}&#39;))
  4. df[c] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Your solution should be changed by converting to numpy array:

  1. for b in bins:
  2. df.loc[df[&#39;Category&#39;] == b, [&#39;Color&#39;, &#39;Size&#39;]] = \
  3. df.loc[df[&#39;Category&#39;] == b, [f&#39;{b}_Color&#39;, f&#39;{b}_Size&#39;]].to_numpy()

huangapple
  • 本文由 发表于 2023年4月4日 13:56:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75925922.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定