2023年4月4日 13:56:44go评论77阅读模式

英文:

Pandas new set of columns from present set of column sliced by a categorical column

问题

Bins	Color	Size
A	'Red'	50
B	'Blue'	60

英文:

I have table of form

Bins	A_Color	B_Color	A_Size	B_Size.
A	'Red'		50
B		'Blue'		60

and I wanted to have a common color and size columns instead of a set of columns for each category like this

Bins	Color	Size
A	'Red'	50
B	'Blue'	60

I tried the below however got NaN values for ['Color', 'Size'] columns

bins = [&#39;A&#39;, &#39;B&#39;, &#39;C&#39;, &#39;D&#39;, &#39;E&#39;]
    for b in bins:
        df.loc[df[&#39;Bins&#39;] == b, [&#39;Color&#39;, &#39;Size&#39;]] = \
            df.loc[df[&#39;Bins&#39;] == b, [f&#39;{b}_Color&#39;, f&#39;{b}_Size&#39;]]

This is just an example, and the real data contains roughly 100K rows and more than 300+ columns.

答案1

得分: 2

你可以使用 pd.wide_to_long。只需将列重命名以匹配格式（A_Color -> Color_A）

>>> (pd.wide_to_long(df.rename(columns=lambda x: '_'.join(x.split('_')[::-1])), 
                     stubnames=['Color', 'Size'], i='Category', j='Cat', 
                     sep='_', suffix='\w+')
       .query('Category == Cat').droplevel('Cat').reset_index())
  Category Color  Size
0        A   Red  50.0
1        B  Blue 60.0

详细信息：

# 重命名列
>>> df1 = df.rename(columns=lambda x: '_'.join(x.split('_')[::-1]))
  Category Color_A Color_B  Size_A  Size_B  # <- 在这里
0        A     Red     NaN    50.0     NaN
1        B     NaN    Blue     NaN    60.0
# 重塑数据框
>>> out = pd.wide_to_long(df1, stubnames=['Color', 'Size'], i='Category', j='Cat', sep='_', suffix='\w+')
             Color  Size
Category Cat            
A        A     Red  50.0  # 保留
B        A     NaN   NaN  # 删除
A        B     NaN   NaN  # 删除
B        B    Blue  60.0  # 保留
# 过滤行
>>> out = out.query('Category == Cat')
             Color  Size
Category Cat            
A        A     Red  50.0
B        B    Blue  60.0
# 获取最终数据框
>>> out = out.droplevel('Cat').reset_index()
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0

英文:

You can use pd.wide_to_long. You just have to rename your columns to match the format (A_Color -> Color_A)

&gt;&gt;&gt; (pd.wide_to_long(df.rename(columns=lambda x: &#39;_&#39;.join(x.split(&#39;_&#39;)[::-1])), 
                     stubnames=[&#39;Color&#39;, &#39;Size&#39;], i=&#39;Category&#39;, j=&#39;Cat&#39;, 
                     sep=&#39;_&#39;, suffix=&#39;\w+&#39;)
       .query(&#39;Category == Cat&#39;).droplevel(&#39;Cat&#39;).reset_index())
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0

Details:

# Rename columns
&gt;&gt;&gt; df1 = df.rename(columns=lambda x: &#39;_&#39;.join(x.split(&#39;_&#39;)[::-1]))
  Category Color_A Color_B  Size_A  Size_B  # &lt;- HERE
0        A     Red     NaN    50.0     NaN
1        B     NaN    Blue     NaN    60.0
# Reshape dataframe
&gt;&gt;&gt; out = pd.wide_to_long(df1, stubnames=[&#39;Color&#39;, &#39;Size&#39;], i=&#39;Category&#39;, j=&#39;Cat&#39;, sep=&#39;_&#39;, suffix=&#39;\w+&#39;)
             Color  Size
Category Cat            
A        A     Red  50.0  # Keep
B        A     NaN   NaN  # Drop
A        B     NaN   NaN  # Drop
B        B    Blue  60.0  # Keep
# Filter rows
&gt;&gt;&gt; out = out.query(&#39;Category == Cat&#39;)
             Color  Size
Category Cat            
A        A     Red  50.0
B        B    Blue  60.0
# Get final dataframe
&gt;&gt;&gt; out = out.droplevel(&#39;Cat&#39;).reset_index()
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0

答案2

得分: 1

代码部分不需要翻译，以下是已翻译的内容：

One idea is get first non missing value by splitted columns with _:

一个想法是通过使用下划线 _ 分割列来获取第一个非缺失值：

df1 = (df.set_index('Category')
         .groupby(lambda x: x.split('_')[-1], axis=1)
         .first()
         .reset_index())
print(df1)
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0

Solution with lookup:

使用 lookup 的解决方案：

categories = ['Size','Color']
for c in categories:
    idx, cols = pd.factorize(df['Category'].add(f'_{c}'))
    df[c] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Your solution should be changed by converting to numpy array:

您的解决方案应通过转换为 NumPy 数组来进行更改：

for b in bins:
    df.loc[df['Category'] == b, ['Color', 'Size']] = \
        df.loc[df['Category'] == b, [f'{b}_Color', f'{b}_Size']].to_numpy()

英文:

One idea is get first non missing value by splitted columns with _:

df1 = (df.set_index(&#39;Category&#39;)
         .groupby(lambda x: x.split(&#39;_&#39;)[-1], axis=1)
         .first()
         .reset_index())
print (df1)
  Category Color  Size
0        A   Red  50.0
1        B  Blue  60.0

Solution with lookup:

categories = [&#39;Size&#39;,&#39;Color&#39;]
for c in categories:
    idx, cols = pd.factorize(df[&#39;Category&#39;].add(f&#39;_{c}&#39;))
    df[c] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Your solution should be changed by converting to numpy array:

for b in bins:
    df.loc[df[&#39;Category&#39;] == b, [&#39;Color&#39;, &#39;Size&#39;]] = \
        df.loc[df[&#39;Category&#39;] == b, [f&#39;{b}_Color&#39;, f&#39;{b}_Size&#39;]].to_numpy()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas通过分类列从当前列集创建一组新列的切片。

问题

答案1

答案2

在R中按重复日期绑定或合并行。

创建一个新的数据框，其中较少行的数值是唯一的，并总结结果。

Pandas 网页抓取错误

在Pandas中，基于列A和B中出现的唯一值，计算多列C和D的值之和。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。