2023年2月18日 07:41:55go评论137阅读模式

英文:

How to use np.where for creating multiple conditional columns?

问题

我有一个如下所示的数据框：

df = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Group1': ['Maintenance', 'Shop', 'Admin', 'Shop'],
    'Hours1': [4, 4, 8, 8],
    'Group2': ['Admin', 'Customer', '0', '0'],
    'Hours2': [4.0, 2.0, 0.0, 0.0],
    'Group3': ['0', 'Admin', '0', '0'],
    'Hours3': [0.0, 2.0, 0.0, 0.0],
})

我想创建如下所示的新列：

期望的输出：

这是我的代码和当前输出。我理解为什么它没有给我想要的结果，但我不确定如何修改我的代码以获得期望的输出。

segment_list = ["Maintenance", "Shop", "Admin", "Customer"]

for i in segment_list:
    
    df["Seg_" + i] = np.where((df["Group1"] == i) | (df["Group2"] == i) | (df["Group3"] == i), 
                            (df["Hours1"] + df["Hours2"] + df["Hours3"]) / 8, 0)

当前的输出：

英文:

I have a dataframe as follows:

df = pd.DataFrame({
    &#39;ID&#39;: [1, 2, 3, 4],
    &#39;Group1&#39;: [&#39;Maintenance&#39;, &#39;Shop&#39;, &#39;Admin&#39;, &#39;Shop&#39;],
    &#39;Hours1&#39;: [4, 4, 8, 8],
    &#39;Group2&#39;: [&#39;Admin&#39;, &#39;Customer&#39;, &#39;0&#39;, &#39;0&#39;],
    &#39;Hours2&#39;: [4.0, 2.0, 0.0, 0.0],
    &#39;Group3&#39;: [&#39;0&#39;, &#39;Admin&#39;, &#39;0&#39;, &#39;0&#39;],
    &#39;Hours3&#39;: [0.0, 2.0, 0.0, 0.0],
})
&gt;&gt;&gt; df
   ID       Group1  Hours1    Group2  Hours2 Group3  Hours3
0   1  Maintenance       4     Admin     4.0      0     0.0
1   2         Shop       4  Customer     2.0  Admin     2.0
2   3        Admin       8         0     0.0      0     0.0
3   4         Shop       8         0     0.0      0     0.0

I would like to create new columns as follows:

desired output:

This is my code and the current output. I understand why it is not giving me what I want but I'm not sure how to modify my code for the desired output

Code:

segment_list=[&quot;Maintenance&quot;, &quot;Shop&quot;, &quot;Admin&quot;, &quot;Customer&quot;]

for i in segment_list:
    
    df[&quot;Seg_&quot;+i] = np.where((df[&quot;Group1&quot;] ==i) | (df[&quot;Group2&quot;]==i) | (df[&quot;Group3&quot;]==i), 
                            (df[&quot;Hours1&quot;] + df[&quot;Hours2&quot;] + df[&quot;Hours3&quot;])/8,0)

Current output

答案1

得分: 0

以下是代码的翻译部分：

print(df)
#    ID       Group1  Hours1    Group2  Hours2 Group3  Hours3
# 0   1  Maintenance       4     Admin     4.0    NaN     NaN
# 1   2         Shop       4  Customer     2.0  Admin     2.0
# 2   3        Admin       8       NaN     NaN    NaN     NaN
# 3   4         Shop       8       NaN     NaN    NaN     NaN

df1 = df.melt(id_vars=['ID'], value_vars=['Group1', 'Group2', 'Group3'], value_name='Group')
df2 = df.melt(id_vars=['ID'], value_vars=['Hours1', 'Hours2', 'Hours3'], value_name='Hours')
# We need the Hours column only, so just add it to df1
df1['Hours'] = df2['Hours']

# A lot of ID's will have NaN values for empty groups, so let's remove them.
df1 = df1.sort_values('ID').dropna()

# Now we pivot, where the Groups become the columns.
pvt = df1.pivot(index='ID', columns='Group', values='Hours')

# Calculate the percentage share of each group within a row.
pvt = pvt.apply(lambda r: r/r.sum() , axis=1).reset_index()

#merge the pivot table with the original df on ID.
result = pd.merge(df, pvt, how='inner', on='ID', )
print(result)
#   ID       Group1  Hours1    Group2  Hours2 Group3  Hours3  Admin  Customer  \
# 0   1  Maintenance       4     Admin     4.0    NaN     NaN   0.50       NaN   
# 1   2         Shop       4  Customer     2.0  Admin     2.0   0.25      0.25   
# 2   3        Admin       8       NaN     NaN    NaN     NaN   1.00       NaN   
# 3   4         Shop       8       NaN     NaN    NaN     NaN    NaN       NaN   

#    Maintenance  Shop  
# 0          0.5   NaN  
# 1          NaN   0.5  
# 2          NaN   NaN  
# 3          NaN   1.0

英文:

Propably not the cleanest way, but it does work and I couldn't come up with a more elegant approach.

print(df)
#    ID       Group1  Hours1    Group2  Hours2 Group3  Hours3
# 0   1  Maintenance       4     Admin     4.0    NaN     NaN
# 1   2         Shop       4  Customer     2.0  Admin     2.0
# 2   3        Admin       8       NaN     NaN    NaN     NaN
# 3   4         Shop       8       NaN     NaN    NaN     NaN

df1 = df.melt(id_vars=[&#39;ID&#39;], value_vars=[&#39;Group1&#39;, &#39;Group2&#39;, &#39;Group3&#39;], value_name=&#39;Group&#39;)
df2 = df.melt(id_vars=[&#39;ID&#39;], value_vars=[&#39;Hours1&#39;, &#39;Hours2&#39;, &#39;Hours3&#39;], value_name=&#39;Hours&#39;)
# We need the Hours column only, so just add it to df1
df1[&#39;Hours&#39;] = df2[&#39;Hours&#39;]

# A lot of ID&#39;s will have NaN values for empty groups, so let&#39;s remove them.
df1 = df1.sort_values(&#39;ID&#39;).dropna()

# Now we pivot, where the Groups become the columns.
pvt = df1.pivot(index=&#39;ID&#39;, columns=&#39;Group&#39;, values=&#39;Hours&#39;)

# Calculate the percentage share of each group within a row.
pvt = pvt.apply(lambda r: r/r.sum() , axis=1).reset_index()

#merge the pivot table with the original df on ID.
result = pd.merge(df, pvt, how=&#39;inner&#39;, on=&#39;ID&#39;, )
print(result)
#   ID       Group1  Hours1    Group2  Hours2 Group3  Hours3  Admin  Customer  \
# 0   1  Maintenance       4     Admin     4.0    NaN     NaN   0.50       NaN   
# 1   2         Shop       4  Customer     2.0  Admin     2.0   0.25      0.25   
# 2   3        Admin       8       NaN     NaN    NaN     NaN   1.00       NaN   
# 3   4         Shop       8       NaN     NaN    NaN     NaN    NaN       NaN   

#    Maintenance  Shop  
# 0          0.5   NaN  
# 1          NaN   0.5  
# 2          NaN   NaN  
# 3          NaN   1.0

答案2

得分: 0

这是我以相对通用的方式来处理这个问题的方法。对于这样的问题，我发现使用pandas更容易（因为它具有groupby以及对索引和多重索引的处理能力）。

编辑：可以使用pd.wide_to_long更简洁地完成任务。

首先，进行一些数据清理和重塑操作：

z = pd.wide_to_long(
    df, stubnames=['Group', 'Hours'], 
    i='ID',
    j='k')
z = z.loc[~z['Group'].isin({'0', 0})].droplevel('k').set_index(
    'Group', append=True).sort_index()

>>> z
                Hours
ID Group             
1  Admin          4.0
   Maintenance    4.0
2  Admin          2.0
   Customer       2.0
   Shop           4.0
3  Admin          8.0
4  Shop           8.0

在之前版本的答案中，我会通过手动方法获得相同的“长”结果，如下所示。将以一个或多个数字结尾的每个列名（如'Foo726'）转换为元组（'prefix'，'digits'）（例如，('Foo', '726')）。这些元组（一个元组列表，每个列名一个元组）被转换为多重索引，这是新的列轴。然后使用stack和一些索引操作来获取干净的长数据帧：

import re

# 将ID设置为索引并清除'0'条目，这些实际上应该是NaN（缺失数据）：
df2 = df.set_index('ID').replace({0: np.nan, '0': np.nan})

# 然后，将'Group1'，...转换为MultiIndex [(Group, 1)，(Hours, 1)，...]
ix = pd.MultiIndex.from_tuples([
    re.match(r'(.*?)(\d+)', k).groups() for k in df2.columns])

>>> ix
MultiIndex([('Group', '1'),
            ('Hours', '1'),
            ('Group', '2'),
            ('Hours', '2'),
            ('Group', '3'),
            ('Hours', '3')],
           )

# 并将其转换为长格式，使用['ID'，'Group']作为索引
z = df2.set_axis(ix, axis=1).stack(level=1).droplevel(1).set_index(
    'Group', append=True)

无论哪种方法，现在我们可以轻松计算所需的摘要（在这里，只是一个：相对于ID的总和的小时分数）：

def normalize(g):
    return g / g.sum()

# 添加一些摘要统计信息（相对于总和的分数）
z = z.assign(Seg=z.groupby('ID')['Hours'].transform(normalize))

>>> z
                Hours   Seg
ID Group                   
1  Maintenance    4.0  0.50
   Admin          4.0  0.50
2  Shop           4.0  0.50
   Customer       2.0  0.25
   Admin          2.0  0.25
3  Admin          8.0  1.00
4  Shop           8.0  1.00

此时，我们可以简单地再次重塑为具有MultiIndex列的宽格式：

>>> z.unstack('Group')
      Hours                             Seg                          
Group Admin Customer Maintenance Shop Admin Customer Maintenance Shop
ID                                                                   
1       4.0      NaN         4.0  NaN  0.50      NaN         0.5  NaN
2       2.0      2.0         NaN  4.0  0.25     0.25         NaN  0.5
3       8.0      NaN         NaN  NaN  1.00      NaN         NaN  NaN
4       NaN      NaN         NaN  8.0   NaN      NaN         NaN  1.0
], axis=1)

或者，更接近最初的意图，我们可以水平连接原始数据和Seg部分：

df2 = pd.concat([
    df.set_index('ID'),
    z['Seg'].unstack('Group').rename(columns=lambda s: f'Seg_{s}').fillna(0),
], axis=1)

>>> df2
         Group1  Hours1    Group2  Hours2 Group3  Hours3  Seg_Admin  Seg_Customer  Seg_Maintenance  Seg_Shop
ID                                                                                                          
1   Maintenance       4     Admin     4.0      0     0.0       0.50          0.00              0.5       0.0
2          Shop       4  Customer     2.0  Admin     2.0       0.25          0.25              0.0       0.5
3         Admin       8         0     0.0      0     0.0       1.00          0.00              0.0       0.0
4          Shop       8         0     0.0      0     0.0       0.00          0.00              0.0       1.0

英文:

Here is how I would approach this in a fairly generic way. For a problem like this, I find pandas easier to use (because of groupby and its handling of index and multi-index).

Edit: cleaner way using pd.wide_to_long.

First, some cleaning and reshaping:

z = pd.wide_to_long(
    df, stubnames=[&#39;Group&#39;, &#39;Hours&#39;], 
    i=&#39;ID&#39;,
    j=&#39;k&#39;)
z = z.loc[~z[&#39;Group&#39;].isin({&#39;0&#39;, 0})].droplevel(&#39;k&#39;).set_index(
    &#39;Group&#39;, append=True).sort_index()

&gt;&gt;&gt; z
                Hours
ID Group             
1  Admin          4.0
   Maintenance    4.0
2  Admin          2.0
   Customer       2.0
   Shop           4.0
3  Admin          8.0
4  Shop           8.0

In a previous version of this previous answer, I would get the same "long" result "by hand", as follow. Convert each column name ending with one or more digits, such as 'Foo726' into a tuple (prefix, digits), e.g. ('Foo', '726'). These tuples (a list of tuples, one tuple per column name) are converted into a MultiIndex, which is the new column axis. Then use stack and some index manipulations to get the clean, long dataframe:

import re

# set ID as index and clean up the &#39;0&#39; entries
# which really should be NaN (missing data):
df2 = df.set_index(&#39;ID&#39;).replace({0: np.nan, &#39;0&#39;: np.nan})

# then, convert &#39;Group1&#39;, ... into a MultiIndex [(Group, 1), (Hours, 1), ...]
ix = pd.MultiIndex.from_tuples([
    re.match(r&#39;(.*?)(\d+)&#39;, k).groups() for k in df2.columns])

&gt;&gt;&gt; ix
MultiIndex([(&#39;Group&#39;, &#39;1&#39;),
            (&#39;Hours&#39;, &#39;1&#39;),
            (&#39;Group&#39;, &#39;2&#39;),
            (&#39;Hours&#39;, &#39;2&#39;),
            (&#39;Group&#39;, &#39;3&#39;),
            (&#39;Hours&#39;, &#39;3&#39;)],
           )

# and convert to a long frame with [&#39;ID&#39;, &#39;Group&#39;] as index
z = df2.set_axis(ix, axis=1).stack(level=1).droplevel(1).set_index(
    &#39;Group&#39;, append=True)

Either way, we can now easily calculate the desired summaries (here, just one: fraction of hours relative to ID's total):

def normalize(g):
    return g / g.sum()

# add some summary stats (fraction of total)
z = z.assign(Seg=z.groupby(&#39;ID&#39;)[&#39;Hours&#39;].transform(normalize))

&gt;&gt;&gt; z
                Hours   Seg
ID Group                   
1  Maintenance    4.0  0.50
   Admin          4.0  0.50
2  Shop           4.0  0.50
   Customer       2.0  0.25
   Admin          2.0  0.25
3  Admin          8.0  1.00
4  Shop           8.0  1.00

At this point, one could simply reshape to wide again, with MultiIndex columns:

&gt;&gt;&gt; z.unstack(&#39;Group&#39;)
      Hours                             Seg                          
Group Admin Customer Maintenance Shop Admin Customer Maintenance Shop
ID                                                                   
1       4.0      NaN         4.0  NaN  0.50      NaN         0.5  NaN
2       2.0      2.0         NaN  4.0  0.25     0.25         NaN  0.5
3       8.0      NaN         NaN  NaN  1.00      NaN         NaN  NaN
4       NaN      NaN         NaN  8.0   NaN      NaN         NaN  1.0
], axis=1)

Or, closer to the original intention, we can concat horizontally just the Seg portion to the original:

df2 = pd.concat([
    df.set_index(&#39;ID&#39;),
    z[&#39;Seg&#39;].unstack(&#39;Group&#39;).rename(columns=lambda s: f&#39;Seg_{s}&#39;).fillna(0),
], axis=1)

&gt;&gt;&gt; df2
         Group1  Hours1    Group2  Hours2 Group3  Hours3  Seg_Admin  Seg_Customer  Seg_Maintenance  Seg_Shop
ID                                                                                                          
1   Maintenance       4     Admin     4.0      0     0.0       0.50          0.00              0.5       0.0
2          Shop       4  Customer     2.0  Admin     2.0       0.25          0.25              0.0       0.5
3         Admin       8         0     0.0      0     0.0       1.00          0.00              0.0       0.0
4          Shop       8         0     0.0      0     0.0       0.00          0.00              0.0       1.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用`np.where`来创建多个条件列？

问题

答案1

答案2

如何在Python中生成10位唯一标识符？

阅读网络服务器上的清新jpg图像并在Tkinter中查看

根据匹配的值和使用索引作为参考更新值。

使用不同形状和大小的数组为JAX vmap定义正确的矢量化轴。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论