英文:
How to use np.where for creating multiple conditional columns?
问题
我有一个如下所示的数据框:
df = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Group1': ['Maintenance', 'Shop', 'Admin', 'Shop'],
'Hours1': [4, 4, 8, 8],
'Group2': ['Admin', 'Customer', '0', '0'],
'Hours2': [4.0, 2.0, 0.0, 0.0],
'Group3': ['0', 'Admin', '0', '0'],
'Hours3': [0.0, 2.0, 0.0, 0.0],
})
我想创建如下所示的新列:
期望的输出:
这是我的代码和当前输出。我理解为什么它没有给我想要的结果,但我不确定如何修改我的代码以获得期望的输出。
segment_list = ["Maintenance", "Shop", "Admin", "Customer"]
for i in segment_list:
df["Seg_" + i] = np.where((df["Group1"] == i) | (df["Group2"] == i) | (df["Group3"] == i),
(df["Hours1"] + df["Hours2"] + df["Hours3"]) / 8, 0)
当前的输出:
英文:
I have a dataframe as follows:
df = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Group1': ['Maintenance', 'Shop', 'Admin', 'Shop'],
'Hours1': [4, 4, 8, 8],
'Group2': ['Admin', 'Customer', '0', '0'],
'Hours2': [4.0, 2.0, 0.0, 0.0],
'Group3': ['0', 'Admin', '0', '0'],
'Hours3': [0.0, 2.0, 0.0, 0.0],
})
>>> df
ID Group1 Hours1 Group2 Hours2 Group3 Hours3
0 1 Maintenance 4 Admin 4.0 0 0.0
1 2 Shop 4 Customer 2.0 Admin 2.0
2 3 Admin 8 0 0.0 0 0.0
3 4 Shop 8 0 0.0 0 0.0
I would like to create new columns as follows:
desired output:
This is my code and the current output. I understand why it is not giving me what I want but I'm not sure how to modify my code for the desired output
Code:
segment_list=["Maintenance", "Shop", "Admin", "Customer"]
for i in segment_list:
df["Seg_"+i] = np.where((df["Group1"] ==i) | (df["Group2"]==i) | (df["Group3"]==i),
(df["Hours1"] + df["Hours2"] + df["Hours3"])/8,0)
答案1
得分: 0
以下是代码的翻译部分:
print(df)
# ID Group1 Hours1 Group2 Hours2 Group3 Hours3
# 0 1 Maintenance 4 Admin 4.0 NaN NaN
# 1 2 Shop 4 Customer 2.0 Admin 2.0
# 2 3 Admin 8 NaN NaN NaN NaN
# 3 4 Shop 8 NaN NaN NaN NaN
df1 = df.melt(id_vars=['ID'], value_vars=['Group1', 'Group2', 'Group3'], value_name='Group')
df2 = df.melt(id_vars=['ID'], value_vars=['Hours1', 'Hours2', 'Hours3'], value_name='Hours')
# We need the Hours column only, so just add it to df1
df1['Hours'] = df2['Hours']
# A lot of ID's will have NaN values for empty groups, so let's remove them.
df1 = df1.sort_values('ID').dropna()
# Now we pivot, where the Groups become the columns.
pvt = df1.pivot(index='ID', columns='Group', values='Hours')
# Calculate the percentage share of each group within a row.
pvt = pvt.apply(lambda r: r/r.sum() , axis=1).reset_index()
#merge the pivot table with the original df on ID.
result = pd.merge(df, pvt, how='inner', on='ID', )
print(result)
# ID Group1 Hours1 Group2 Hours2 Group3 Hours3 Admin Customer \
# 0 1 Maintenance 4 Admin 4.0 NaN NaN 0.50 NaN
# 1 2 Shop 4 Customer 2.0 Admin 2.0 0.25 0.25
# 2 3 Admin 8 NaN NaN NaN NaN 1.00 NaN
# 3 4 Shop 8 NaN NaN NaN NaN NaN NaN
# Maintenance Shop
# 0 0.5 NaN
# 1 NaN 0.5
# 2 NaN NaN
# 3 NaN 1.0
英文:
Propably not the cleanest way, but it does work and I couldn't come up with a more elegant approach.
print(df)
# ID Group1 Hours1 Group2 Hours2 Group3 Hours3
# 0 1 Maintenance 4 Admin 4.0 NaN NaN
# 1 2 Shop 4 Customer 2.0 Admin 2.0
# 2 3 Admin 8 NaN NaN NaN NaN
# 3 4 Shop 8 NaN NaN NaN NaN
df1 = df.melt(id_vars=['ID'], value_vars=['Group1', 'Group2', 'Group3'], value_name='Group')
df2 = df.melt(id_vars=['ID'], value_vars=['Hours1', 'Hours2', 'Hours3'], value_name='Hours')
# We need the Hours column only, so just add it to df1
df1['Hours'] = df2['Hours']
# A lot of ID's will have NaN values for empty groups, so let's remove them.
df1 = df1.sort_values('ID').dropna()
# Now we pivot, where the Groups become the columns.
pvt = df1.pivot(index='ID', columns='Group', values='Hours')
# Calculate the percentage share of each group within a row.
pvt = pvt.apply(lambda r: r/r.sum() , axis=1).reset_index()
#merge the pivot table with the original df on ID.
result = pd.merge(df, pvt, how='inner', on='ID', )
print(result)
# ID Group1 Hours1 Group2 Hours2 Group3 Hours3 Admin Customer \
# 0 1 Maintenance 4 Admin 4.0 NaN NaN 0.50 NaN
# 1 2 Shop 4 Customer 2.0 Admin 2.0 0.25 0.25
# 2 3 Admin 8 NaN NaN NaN NaN 1.00 NaN
# 3 4 Shop 8 NaN NaN NaN NaN NaN NaN
# Maintenance Shop
# 0 0.5 NaN
# 1 NaN 0.5
# 2 NaN NaN
# 3 NaN 1.0
答案2
得分: 0
这是我以相对通用的方式来处理这个问题的方法。对于这样的问题,我发现使用pandas
更容易(因为它具有groupby
以及对索引和多重索引的处理能力)。
编辑:可以使用pd.wide_to_long
更简洁地完成任务。
首先,进行一些数据清理和重塑操作:
z = pd.wide_to_long(
df, stubnames=['Group', 'Hours'],
i='ID',
j='k')
z = z.loc[~z['Group'].isin({'0', 0})].droplevel('k').set_index(
'Group', append=True).sort_index()
>>> z
Hours
ID Group
1 Admin 4.0
Maintenance 4.0
2 Admin 2.0
Customer 2.0
Shop 4.0
3 Admin 8.0
4 Shop 8.0
在之前版本的答案中,我会通过手动方法获得相同的“长”结果,如下所示。将以一个或多个数字结尾的每个列名(如'Foo726'
)转换为元组('prefix','digits')(例如,('Foo', '726'))。这些元组(一个元组列表,每个列名一个元组)被转换为多重索引,这是新的列轴。然后使用stack
和一些索引操作来获取干净的长数据帧:
import re
# 将ID设置为索引并清除'0'条目,这些实际上应该是NaN(缺失数据):
df2 = df.set_index('ID').replace({0: np.nan, '0': np.nan})
# 然后,将'Group1',...转换为MultiIndex [(Group, 1),(Hours, 1),...]
ix = pd.MultiIndex.from_tuples([
re.match(r'(.*?)(\d+)', k).groups() for k in df2.columns])
>>> ix
MultiIndex([('Group', '1'),
('Hours', '1'),
('Group', '2'),
('Hours', '2'),
('Group', '3'),
('Hours', '3')],
)
# 并将其转换为长格式,使用['ID','Group']作为索引
z = df2.set_axis(ix, axis=1).stack(level=1).droplevel(1).set_index(
'Group', append=True)
无论哪种方法,现在我们可以轻松计算所需的摘要(在这里,只是一个:相对于ID的总和的小时分数):
def normalize(g):
return g / g.sum()
# 添加一些摘要统计信息(相对于总和的分数)
z = z.assign(Seg=z.groupby('ID')['Hours'].transform(normalize))
>>> z
Hours Seg
ID Group
1 Maintenance 4.0 0.50
Admin 4.0 0.50
2 Shop 4.0 0.50
Customer 2.0 0.25
Admin 2.0 0.25
3 Admin 8.0 1.00
4 Shop 8.0 1.00
此时,我们可以简单地再次重塑为具有MultiIndex列的宽格式:
>>> z.unstack('Group')
Hours Seg
Group Admin Customer Maintenance Shop Admin Customer Maintenance Shop
ID
1 4.0 NaN 4.0 NaN 0.50 NaN 0.5 NaN
2 2.0 2.0 NaN 4.0 0.25 0.25 NaN 0.5
3 8.0 NaN NaN NaN 1.00 NaN NaN NaN
4 NaN NaN NaN 8.0 NaN NaN NaN 1.0
], axis=1)
或者,更接近最初的意图,我们可以水平连接原始数据和Seg
部分:
df2 = pd.concat([
df.set_index('ID'),
z['Seg'].unstack('Group').rename(columns=lambda s: f'Seg_{s}').fillna(0),
], axis=1)
>>> df2
Group1 Hours1 Group2 Hours2 Group3 Hours3 Seg_Admin Seg_Customer Seg_Maintenance Seg_Shop
ID
1 Maintenance 4 Admin 4.0 0 0.0 0.50 0.00 0.5 0.0
2 Shop 4 Customer 2.0 Admin 2.0 0.25 0.25 0.0 0.5
3 Admin 8 0 0.0 0 0.0 1.00 0.00 0.0 0.0
4 Shop 8 0 0.0 0 0.0 0.00 0.00 0.0 1.0
英文:
Here is how I would approach this in a fairly generic way. For a problem like this, I find pandas
easier to use (because of groupby
and its handling of index and multi-index).
Edit: cleaner way using pd.wide_to_long
.
First, some cleaning and reshaping:
z = pd.wide_to_long(
df, stubnames=['Group', 'Hours'],
i='ID',
j='k')
z = z.loc[~z['Group'].isin({'0', 0})].droplevel('k').set_index(
'Group', append=True).sort_index()
>>> z
Hours
ID Group
1 Admin 4.0
Maintenance 4.0
2 Admin 2.0
Customer 2.0
Shop 4.0
3 Admin 8.0
4 Shop 8.0
In a previous version of this previous answer, I would get the same "long" result "by hand", as follow. Convert each column name ending with one or more digits, such as 'Foo726'
into a tuple (prefix, digits)
, e.g. ('Foo', '726')
. These tuples (a list of tuples, one tuple per column name) are converted into a MultiIndex, which is the new column axis. Then use stack
and some index manipulations to get the clean, long dataframe:
import re
# set ID as index and clean up the '0' entries
# which really should be NaN (missing data):
df2 = df.set_index('ID').replace({0: np.nan, '0': np.nan})
# then, convert 'Group1', ... into a MultiIndex [(Group, 1), (Hours, 1), ...]
ix = pd.MultiIndex.from_tuples([
re.match(r'(.*?)(\d+)', k).groups() for k in df2.columns])
>>> ix
MultiIndex([('Group', '1'),
('Hours', '1'),
('Group', '2'),
('Hours', '2'),
('Group', '3'),
('Hours', '3')],
)
# and convert to a long frame with ['ID', 'Group'] as index
z = df2.set_axis(ix, axis=1).stack(level=1).droplevel(1).set_index(
'Group', append=True)
Either way, we can now easily calculate the desired summaries (here, just one: fraction of hours relative to ID's total):
def normalize(g):
return g / g.sum()
# add some summary stats (fraction of total)
z = z.assign(Seg=z.groupby('ID')['Hours'].transform(normalize))
>>> z
Hours Seg
ID Group
1 Maintenance 4.0 0.50
Admin 4.0 0.50
2 Shop 4.0 0.50
Customer 2.0 0.25
Admin 2.0 0.25
3 Admin 8.0 1.00
4 Shop 8.0 1.00
At this point, one could simply reshape to wide again, with MultiIndex columns:
>>> z.unstack('Group')
Hours Seg
Group Admin Customer Maintenance Shop Admin Customer Maintenance Shop
ID
1 4.0 NaN 4.0 NaN 0.50 NaN 0.5 NaN
2 2.0 2.0 NaN 4.0 0.25 0.25 NaN 0.5
3 8.0 NaN NaN NaN 1.00 NaN NaN NaN
4 NaN NaN NaN 8.0 NaN NaN NaN 1.0
], axis=1)
Or, closer to the original intention, we can concat horizontally just the Seg
portion to the original:
df2 = pd.concat([
df.set_index('ID'),
z['Seg'].unstack('Group').rename(columns=lambda s: f'Seg_{s}').fillna(0),
], axis=1)
>>> df2
Group1 Hours1 Group2 Hours2 Group3 Hours3 Seg_Admin Seg_Customer Seg_Maintenance Seg_Shop
ID
1 Maintenance 4 Admin 4.0 0 0.0 0.50 0.00 0.5 0.0
2 Shop 4 Customer 2.0 Admin 2.0 0.25 0.25 0.0 0.5
3 Admin 8 0 0.0 0 0.0 1.00 0.00 0.0 0.0
4 Shop 8 0 0.0 0 0.0 0.00 0.00 0.0 1.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论