如何使用`np.where`来创建多个条件列?

huangapple go评论66阅读模式
英文:

How to use np.where for creating multiple conditional columns?

问题

我有一个如下所示的数据框:

df = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Group1': ['Maintenance', 'Shop', 'Admin', 'Shop'],
    'Hours1': [4, 4, 8, 8],
    'Group2': ['Admin', 'Customer', '0', '0'],
    'Hours2': [4.0, 2.0, 0.0, 0.0],
    'Group3': ['0', 'Admin', '0', '0'],
    'Hours3': [0.0, 2.0, 0.0, 0.0],
})

我想创建如下所示的新列:

期望的输出:

如何使用`np.where`来创建多个条件列?

这是我的代码和当前输出。我理解为什么它没有给我想要的结果,但我不确定如何修改我的代码以获得期望的输出。

segment_list = ["Maintenance", "Shop", "Admin", "Customer"]

for i in segment_list:
    
    df["Seg_" + i] = np.where((df["Group1"] == i) | (df["Group2"] == i) | (df["Group3"] == i), 
                            (df["Hours1"] + df["Hours2"] + df["Hours3"]) / 8, 0)

当前的输出:

如何使用`np.where`来创建多个条件列?

英文:

I have a dataframe as follows:

df = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Group1': ['Maintenance', 'Shop', 'Admin', 'Shop'],
    'Hours1': [4, 4, 8, 8],
    'Group2': ['Admin', 'Customer', '0', '0'],
    'Hours2': [4.0, 2.0, 0.0, 0.0],
    'Group3': ['0', 'Admin', '0', '0'],
    'Hours3': [0.0, 2.0, 0.0, 0.0],
})
>>> df
   ID       Group1  Hours1    Group2  Hours2 Group3  Hours3
0   1  Maintenance       4     Admin     4.0      0     0.0
1   2         Shop       4  Customer     2.0  Admin     2.0
2   3        Admin       8         0     0.0      0     0.0
3   4         Shop       8         0     0.0      0     0.0

I would like to create new columns as follows:

desired output:

如何使用`np.where`来创建多个条件列?

This is my code and the current output. I understand why it is not giving me what I want but I'm not sure how to modify my code for the desired output

Code:

segment_list=["Maintenance", "Shop", "Admin", "Customer"]

for i in segment_list:
    
    df["Seg_"+i] = np.where((df["Group1"] ==i) | (df["Group2"]==i) | (df["Group3"]==i), 
                            (df["Hours1"] + df["Hours2"] + df["Hours3"])/8,0)

Current output
如何使用`np.where`来创建多个条件列?

答案1

得分: 0

以下是代码的翻译部分:

print(df)
#    ID       Group1  Hours1    Group2  Hours2 Group3  Hours3
# 0   1  Maintenance       4     Admin     4.0    NaN     NaN
# 1   2         Shop       4  Customer     2.0  Admin     2.0
# 2   3        Admin       8       NaN     NaN    NaN     NaN
# 3   4         Shop       8       NaN     NaN    NaN     NaN

df1 = df.melt(id_vars=['ID'], value_vars=['Group1', 'Group2', 'Group3'], value_name='Group')
df2 = df.melt(id_vars=['ID'], value_vars=['Hours1', 'Hours2', 'Hours3'], value_name='Hours')
# We need the Hours column only, so just add it to df1
df1['Hours'] = df2['Hours']

# A lot of ID's will have NaN values for empty groups, so let's remove them.
df1 = df1.sort_values('ID').dropna()

# Now we pivot, where the Groups become the columns.
pvt = df1.pivot(index='ID', columns='Group', values='Hours')

# Calculate the percentage share of each group within a row.
pvt = pvt.apply(lambda r: r/r.sum() , axis=1).reset_index()

#merge the pivot table with the original df on ID.
result = pd.merge(df, pvt, how='inner', on='ID', )
print(result)
#   ID       Group1  Hours1    Group2  Hours2 Group3  Hours3  Admin  Customer  \
# 0   1  Maintenance       4     Admin     4.0    NaN     NaN   0.50       NaN   
# 1   2         Shop       4  Customer     2.0  Admin     2.0   0.25      0.25   
# 2   3        Admin       8       NaN     NaN    NaN     NaN   1.00       NaN   
# 3   4         Shop       8       NaN     NaN    NaN     NaN    NaN       NaN   

#    Maintenance  Shop  
# 0          0.5   NaN  
# 1          NaN   0.5  
# 2          NaN   NaN  
# 3          NaN   1.0  
英文:

Propably not the cleanest way, but it does work and I couldn't come up with a more elegant approach.

print(df)
#    ID       Group1  Hours1    Group2  Hours2 Group3  Hours3
# 0   1  Maintenance       4     Admin     4.0    NaN     NaN
# 1   2         Shop       4  Customer     2.0  Admin     2.0
# 2   3        Admin       8       NaN     NaN    NaN     NaN
# 3   4         Shop       8       NaN     NaN    NaN     NaN

df1 = df.melt(id_vars=['ID'], value_vars=['Group1', 'Group2', 'Group3'], value_name='Group')
df2 = df.melt(id_vars=['ID'], value_vars=['Hours1', 'Hours2', 'Hours3'], value_name='Hours')
# We need the Hours column only, so just add it to df1
df1['Hours'] = df2['Hours']

# A lot of ID's will have NaN values for empty groups, so let's remove them.
df1 = df1.sort_values('ID').dropna()

# Now we pivot, where the Groups become the columns.
pvt = df1.pivot(index='ID', columns='Group', values='Hours')

# Calculate the percentage share of each group within a row.
pvt = pvt.apply(lambda r: r/r.sum() , axis=1).reset_index()

#merge the pivot table with the original df on ID.
result = pd.merge(df, pvt, how='inner', on='ID', )
print(result)
#   ID       Group1  Hours1    Group2  Hours2 Group3  Hours3  Admin  Customer  \
# 0   1  Maintenance       4     Admin     4.0    NaN     NaN   0.50       NaN   
# 1   2         Shop       4  Customer     2.0  Admin     2.0   0.25      0.25   
# 2   3        Admin       8       NaN     NaN    NaN     NaN   1.00       NaN   
# 3   4         Shop       8       NaN     NaN    NaN     NaN    NaN       NaN   

#    Maintenance  Shop  
# 0          0.5   NaN  
# 1          NaN   0.5  
# 2          NaN   NaN  
# 3          NaN   1.0  

答案2

得分: 0

这是我以相对通用的方式来处理这个问题的方法。对于这样的问题,我发现使用pandas更容易(因为它具有groupby以及对索引和多重索引的处理能力)。

编辑:可以使用pd.wide_to_long更简洁地完成任务。

首先,进行一些数据清理和重塑操作:

z = pd.wide_to_long(
    df, stubnames=['Group', 'Hours'], 
    i='ID',
    j='k')
z = z.loc[~z['Group'].isin({'0', 0})].droplevel('k').set_index(
    'Group', append=True).sort_index()

>>> z
                Hours
ID Group             
1  Admin          4.0
   Maintenance    4.0
2  Admin          2.0
   Customer       2.0
   Shop           4.0
3  Admin          8.0
4  Shop           8.0

在之前版本的答案中,我会通过手动方法获得相同的“长”结果,如下所示。将以一个或多个数字结尾的每个列名(如'Foo726')转换为元组('prefix','digits')(例如,('Foo', '726'))。这些元组(一个元组列表,每个列名一个元组)被转换为多重索引,这是新的列轴。然后使用stack和一些索引操作来获取干净的长数据帧:

import re

# 将ID设置为索引并清除'0'条目,这些实际上应该是NaN(缺失数据):
df2 = df.set_index('ID').replace({0: np.nan, '0': np.nan})

# 然后,将'Group1',...转换为MultiIndex [(Group, 1),(Hours, 1),...]
ix = pd.MultiIndex.from_tuples([
    re.match(r'(.*?)(\d+)', k).groups() for k in df2.columns])

>>> ix
MultiIndex([('Group', '1'),
            ('Hours', '1'),
            ('Group', '2'),
            ('Hours', '2'),
            ('Group', '3'),
            ('Hours', '3')],
           )

# 并将其转换为长格式,使用['ID','Group']作为索引
z = df2.set_axis(ix, axis=1).stack(level=1).droplevel(1).set_index(
    'Group', append=True)

无论哪种方法,现在我们可以轻松计算所需的摘要(在这里,只是一个:相对于ID的总和的小时分数):

def normalize(g):
    return g / g.sum()

# 添加一些摘要统计信息(相对于总和的分数)
z = z.assign(Seg=z.groupby('ID')['Hours'].transform(normalize))

>>> z
                Hours   Seg
ID Group                   
1  Maintenance    4.0  0.50
   Admin          4.0  0.50
2  Shop           4.0  0.50
   Customer       2.0  0.25
   Admin          2.0  0.25
3  Admin          8.0  1.00
4  Shop           8.0  1.00

此时,我们可以简单地再次重塑为具有MultiIndex列的宽格式:

>>> z.unstack('Group')
      Hours                             Seg                          
Group Admin Customer Maintenance Shop Admin Customer Maintenance Shop
ID                                                                   
1       4.0      NaN         4.0  NaN  0.50      NaN         0.5  NaN
2       2.0      2.0         NaN  4.0  0.25     0.25         NaN  0.5
3       8.0      NaN         NaN  NaN  1.00      NaN         NaN  NaN
4       NaN      NaN         NaN  8.0   NaN      NaN         NaN  1.0
], axis=1)

或者,更接近最初的意图,我们可以水平连接原始数据和Seg部分:

df2 = pd.concat([
    df.set_index('ID'),
    z['Seg'].unstack('Group').rename(columns=lambda s: f'Seg_{s}').fillna(0),
], axis=1)

>>> df2
         Group1  Hours1    Group2  Hours2 Group3  Hours3  Seg_Admin  Seg_Customer  Seg_Maintenance  Seg_Shop
ID                                                                                                          
1   Maintenance       4     Admin     4.0      0     0.0       0.50          0.00              0.5       0.0
2          Shop       4  Customer     2.0  Admin     2.0       0.25          0.25              0.0       0.5
3         Admin       8         0     0.0      0     0.0       1.00          0.00              0.0       0.0
4          Shop       8         0     0.0      0     0.0       0.00          0.00              0.0       1.0
英文:

Here is how I would approach this in a fairly generic way. For a problem like this, I find pandas easier to use (because of groupby and its handling of index and multi-index).

Edit: cleaner way using pd.wide_to_long.

First, some cleaning and reshaping:

z = pd.wide_to_long(
    df, stubnames=['Group', 'Hours'], 
    i='ID',
    j='k')
z = z.loc[~z['Group'].isin({'0', 0})].droplevel('k').set_index(
    'Group', append=True).sort_index()

>>> z
                Hours
ID Group             
1  Admin          4.0
   Maintenance    4.0
2  Admin          2.0
   Customer       2.0
   Shop           4.0
3  Admin          8.0
4  Shop           8.0

In a previous version of this previous answer, I would get the same "long" result "by hand", as follow. Convert each column name ending with one or more digits, such as 'Foo726' into a tuple (prefix, digits), e.g. ('Foo', '726'). These tuples (a list of tuples, one tuple per column name) are converted into a MultiIndex, which is the new column axis. Then use stack and some index manipulations to get the clean, long dataframe:

import re

# set ID as index and clean up the '0' entries
# which really should be NaN (missing data):
df2 = df.set_index('ID').replace({0: np.nan, '0': np.nan})

# then, convert 'Group1', ... into a MultiIndex [(Group, 1), (Hours, 1), ...]
ix = pd.MultiIndex.from_tuples([
    re.match(r'(.*?)(\d+)', k).groups() for k in df2.columns])

>>> ix
MultiIndex([('Group', '1'),
            ('Hours', '1'),
            ('Group', '2'),
            ('Hours', '2'),
            ('Group', '3'),
            ('Hours', '3')],
           )

# and convert to a long frame with ['ID', 'Group'] as index
z = df2.set_axis(ix, axis=1).stack(level=1).droplevel(1).set_index(
    'Group', append=True)

Either way, we can now easily calculate the desired summaries (here, just one: fraction of hours relative to ID's total):

def normalize(g):
    return g / g.sum()

# add some summary stats (fraction of total)
z = z.assign(Seg=z.groupby('ID')['Hours'].transform(normalize))

>>> z
                Hours   Seg
ID Group                   
1  Maintenance    4.0  0.50
   Admin          4.0  0.50
2  Shop           4.0  0.50
   Customer       2.0  0.25
   Admin          2.0  0.25
3  Admin          8.0  1.00
4  Shop           8.0  1.00

At this point, one could simply reshape to wide again, with MultiIndex columns:

>>> z.unstack('Group')
      Hours                             Seg                          
Group Admin Customer Maintenance Shop Admin Customer Maintenance Shop
ID                                                                   
1       4.0      NaN         4.0  NaN  0.50      NaN         0.5  NaN
2       2.0      2.0         NaN  4.0  0.25     0.25         NaN  0.5
3       8.0      NaN         NaN  NaN  1.00      NaN         NaN  NaN
4       NaN      NaN         NaN  8.0   NaN      NaN         NaN  1.0
], axis=1)

Or, closer to the original intention, we can concat horizontally just the Seg portion to the original:

df2 = pd.concat([
    df.set_index('ID'),
    z['Seg'].unstack('Group').rename(columns=lambda s: f'Seg_{s}').fillna(0),
], axis=1)

>>> df2
         Group1  Hours1    Group2  Hours2 Group3  Hours3  Seg_Admin  Seg_Customer  Seg_Maintenance  Seg_Shop
ID                                                                                                          
1   Maintenance       4     Admin     4.0      0     0.0       0.50          0.00              0.5       0.0
2          Shop       4  Customer     2.0  Admin     2.0       0.25          0.25              0.0       0.5
3         Admin       8         0     0.0      0     0.0       1.00          0.00              0.0       0.0
4          Shop       8         0     0.0      0     0.0       0.00          0.00              0.0       1.0

huangapple
  • 本文由 发表于 2023年2月18日 07:41:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75490178.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定