英文:
Add a column based on a condition in Pandas
问题
我有以下的数据框(df)。
data = {
'policyId': ['X','X','X','X', 'Y','Y','Y','Y', 'Z', 'Z','Z','Z'],
'element_id': [242, 243, 241, 257, 242, 243, 241, 257, 242, 243, 241, 257],
'element_selected': [True, True, True, True, False, True, True, True, True, True, True, False]
}
df = pd.DataFrame(data)
所有的 `policyIds` 具有相同的 `element_id`。不同的是,`element_id` 可以被选择或者不被选择(`element_selected` 决定了这一点)。我想要在数据框中添加一个 `contract_type` 列。如果 `element_id` 242 被选择了,`contract_type` 就应该是 "内容"。如果 `element_id` 257 被选择了,那么 `contract_type` 就应该是 "私人责任"。如果 `element_id` 242 和 257 都被选择了,那么 `contract_type` 就应该是 "内容 + 私人责任"。
问题在于每个 `policyId` 应该只有一个 `contract_type` 值,我不知道如何实现这个逻辑。
期望的输出:
policyId element_id element_selected contract_type
X 242 True 内容 + 私人责任
X 243 True 内容 + 私人责任
X 241 True 内容 + 私人责任
X 257 True 内容 + 私人责任
Y 242 False 私人责任
Y 243 True 私人责任
Y 241 True 私人责任
Y 257 True 私人责任
Z 242 True 内容
Z 243 True 内容
Z 241 True 内容
Z 257 False 内容
英文:
I have the following df.
data = {
'policyId': ['X','X','X','X', 'Y','Y','Y','Y', 'Z', 'Z','Z','Z'],
'element_id': [242, 243, 241, 257, 242, 243, 241, 257, 242, 243, 241, 257],
'element_selected': [True, True, True, True, False, True, True, True, True, True, True, False]
}
df = pd.DataFrame(data)
All policyIds
have the same element_id
. What changes is that the element_id
can be select or not (element_selected
assigns that). I want to add a contract_type
column to the df. If element_id
242 has been selected the contract_type
should be "CONTENT". If element_id
257 has been selected then the contract_type
should "PRIVATE LIABILITY". If element_id
242 and 257 have both been selected then the contract_type
should "CONTENT + PRIVATE LIABILITY".
The problem is that each policyId
should have only one value per contract_type
and I do not know how to implement this logic
Expected output
policyId element_id element_selected contract_type
X 242 True CONTENT + PRIVATE LIABILITY
X 243 True CONTENT + PRIVATE LIABILITY
X 241 True CONTENT + PRIVATE LIABILITY
X 257 True CONTENT + PRIVATE LIABILITY
Y 242 False PRIVATE LIABILITY
Y 243 True PRIVATE LIABILITY
Y 241 True PRIVATE LIABILITY
Y 257 True PRIVATE LIABILITY
Z 242 True CONTENT
Z 243 True CONTENT
Z 241 True CONTENT
Z 257 False CONTENT
答案1
得分: 1
解决方案
将数据框进行透视
s = df.pivot(index='policyId', columns='element_id', values='element_selected')
# element_id 241 242 243 257
# policyId
# X True True True True
# Y True False True True
# Z True True True False
使用 np.select 基于预定义条件选择选项
s['contract_type'] = np.select(
展开收缩].all(axis=1), s[242], s[257]],
["CONTENT + PRIVATE LIABILITY", "CONTENT", "PRIVATE LIABILITY"]
)
# element_id 241 242 243 257 contract_type
# policyId
# X True True True True CONTENT + PRIVATE LIABILITY
# Y True False True True PRIVATE LIABILITY
# Z True True True False CONTENT
将选定的选项映射回数据框
df['contract_type'] = df['policyId'].map(s['contract_type'])
# policyId element_id element_selected contract_type
# 0 X 242 True CONTENT + PRIVATE LIABILITY
# 1 X 243 True CONTENT + PRIVATE LIABILITY
# 2 X 241 True CONTENT + PRIVATE LIABILITY
# 3 X 257 True CONTENT + PRIVATE LIABILITY
# 4 Y 242 False PRIVATE LIABILITY
# 5 Y 243 True PRIVATE LIABILITY
# 6 Y 241 True PRIVATE LIABILITY
# 7 Y 257 True PRIVATE LIABILITY
# 8 Z 242 True CONTENT
# 9 Z 243 True CONTENT
# 10 Z 241 True CONTENT
# 11 Z 257 False CONTENT
英文:
Solution
Pivot the dataframe
s = df.pivot(index='policyId', columns='element_id', values='element_selected')
# element_id 241 242 243 257
# policyId
# X True True True True
# Y True False True True
# Z True True True False
Use np.select to select the choices based on predefined conditions
s['contract_type'] = np.select(
展开收缩].all(axis=1), s[242], s[257]],
["CONTENT + PRIVATE LIABILITY", "CONTENT", "PRIVATE LIABILITY"]
)
# element_id 241 242 243 257 contract_type
# policyId
# X True True True True CONTENT + PRIVATE LIABILITY
# Y True False True True PRIVATE LIABILITY
# Z True True True False CONTENT
Map the selected choices back to dataframe
df['contract_type'] = df['policyId'].map(s['contract_type'])
# policyId element_id element_selected contract_type
# 0 X 242 True CONTENT + PRIVATE LIABILITY
# 1 X 243 True CONTENT + PRIVATE LIABILITY
# 2 X 241 True CONTENT + PRIVATE LIABILITY
# 3 X 257 True CONTENT + PRIVATE LIABILITY
# 4 Y 242 False PRIVATE LIABILITY
# 5 Y 243 True PRIVATE LIABILITY
# 6 Y 241 True PRIVATE LIABILITY
# 7 Y 257 True PRIVATE LIABILITY
# 8 Z 242 True CONTENT
# 9 Z 243 True CONTENT
# 10 Z 241 True CONTENT
# 11 Z 257 False CONTENT
答案2
得分: 0
你可以使用一个名为"policyId"的分组,然后实现一个映射函数,用于执行分配"contract_type"的逻辑:
def get_contract_type(group):
types = []
# 获取选定的ID
ids = list(group[group["element_selected"]]["element_id"])
# 根据选定的ID添加类型
if 242 in ids:
types.append("CONTENT")
if 257 in ids:
types.append("PRIVATE LIABILITY")
# 添加新的"contract_type"列
group["contract_type"] = " + ".join(types)
return group
df.groupby("policyId").apply(get_contract_type)
英文:
you could use a groupy "policyId" and then implement a mapping function that does the logic of assigning the contract_type:
def get_contract_type(group):
types = []
# get selected ids
ids = list(group[group["element_selected"]]["element_id"])
# add the types based on the selected ids
if 242 in ids:
types.append("CONTENT")
if 257 in ids:
types.append("PRIVATE LIABILITY")
# add the new contract_type column
group["contract_type"] = " + ".join(types)
return group
df.groupby("policyId").apply(get_contract_type)
答案3
得分: 0
首先,让我们创建一个包含零的临时列:
df["contract_type"] = 0
初始化这一列为零的目的将在后面的答案中变得清晰。
现在,按照 policyId
进行分组:
policy_groups = df.groupby("policyId")
接下来,对于每个组,筛选出其中 element_selected
为 False
的行,并查看 element_id
列中是否存在 242
和 257
。如果存在 242
,让我们为该组中的所有行添加 1
到 contract_type
。如果存在 257
,让我们添加 2
。
for policy_id, group in policy_groups:
selected_ids = group.loc[group["element_selected"], "element_id"]
if (selected_ids == 242).any():
df.loc[group.index, "contract_type"] += 1
if (selected_ids == 257).any():
df.loc[group.index, "contract_type"] += 2
在此阶段,df
看起来如下:
policyId element_id element_selected contract_type
0 X 242 True 3
1 X 243 True 3
2 X 241 True 3
3 X 257 True 3
4 Y 242 False 2
5 Y 243 True 2
6 Y 241 True 2
7 Y 257 True 2
8 Z 242 True 1
9 Z 243 True 1
10 Z 241 True 1
11 Z 257 False 1
现在,对于所有组已经完成,我们可以将 contract_type
列中的整数映射到所需的字符串:
df["contract_type"] = df["contract_type"].map({0: "", 1: "CONTENT", 2: "PRIVATE LIABILITY", 3: "CONTENT + PRIVATE LIABILITY"})
然后我们得到了所需的数据框:
policyId element_id element_selected contract_type
0 X 242 True CONTENT + PRIVATE LIABILITY
1 X 243 True CONTENT + PRIVATE LIABILITY
2 X 241 True CONTENT + PRIVATE LIABILITY
3 X 257 True CONTENT + PRIVATE LIABILITY
4 Y 242 False PRIVATE LIABILITY
5 Y 243 True PRIVATE LIABILITY
6 Y 241 True PRIVATE LIABILITY
7 Y 257 True PRIVATE LIABILITY
8 Z 242 True CONTENT
9 Z 243 True CONTENT
10 Z 241 True CONTENT
11 Z 257 False CONTENT
英文:
First, let's create a temporary column that contains zeros:
df["contract_type"] = 0
The purpose of initializing this column to zeros will become clear later in this answer.
Now, group by policyId
:
policy_groups = df.groupby("policyId")
Next, for each group, filter out the rows where element_selected
is False
, and see if 242
and 257
exist in the element_id
column. If 242
exists, let's add 1
to contract_type
for all rows in the group. If 257
exists, let's add 2
.
for policy_id, group in policy_groups:
selected_ids = group.loc[group["element_selected"], "element_id"]
if (selected_ids == 242).any():
df.loc[group.index, "contract_type"] += 1
if (selected_ids == 257).any():
df.loc[group.index, "contract_type"] += 2
At this stage, df
looks like:
policyId element_id element_selected contract_type
0 X 242 True 3
1 X 243 True 3
2 X 241 True 3
3 X 257 True 3
4 Y 242 False 2
5 Y 243 True 2
6 Y 241 True 2
7 Y 257 True 2
8 Z 242 True 1
9 Z 243 True 1
10 Z 241 True 1
11 Z 257 False 1
Now that this is complete for all groups, we can map the integers in the contract_type
column to the required strings:
df["contract_type"] = df["contract_type"].map({0: "", 1: "CONTENT", 2: "PRIVATE LIABILITY", 3: "CONTENT + PRIVATE LIABILITY"})
And we have the required dataframe:
policyId element_id element_selected contract_type
0 X 242 True CONTENT + PRIVATE LIABILITY
1 X 243 True CONTENT + PRIVATE LIABILITY
2 X 241 True CONTENT + PRIVATE LIABILITY
3 X 257 True CONTENT + PRIVATE LIABILITY
4 Y 242 False PRIVATE LIABILITY
5 Y 243 True PRIVATE LIABILITY
6 Y 241 True PRIVATE LIABILITY
7 Y 257 True PRIVATE LIABILITY
8 Z 242 True CONTENT
9 Z 243 True CONTENT
10 Z 241 True CONTENT
11 Z 257 False CONTENT
答案4
得分: 0
尝试这个:
(df.assign(contract_type = df['policyId']
.map(df['element_id'].map({242:'0',257:'1'})
.where(df['element_selected'])
.str.get_dummies()
.groupby(df['policyId']).any()
.mul([1,2])
.sum(axis=1)
.map(dict(enumerate(['NEITHER','CONTENT','PRIVATE LIABILITY','CONTENT + PRIVATE LIABILITY'])))))
或者
df.assign(
contract_type = (df['policyId'].map(
df['element_id']
.sort_values()
.map(
{242:'CONTENT',257:'PRIVATE LIABILITY'})
.where(df['element_selected'] & ~df.duplicated(subset = ['policyId','element_id']))
.dropna()
.groupby(df['policyId'])
.agg(' + '.join))))
输出:
policyId element_id element_selected contract_type
0 X 242 True CONTENT + PRIVATE LIABILITY
1 X 243 True CONTENT + PRIVATE LIABILITY
2 X 241 True CONTENT + PRIVATE LIABILITY
3 X 257 True CONTENT + PRIVATE LIABILITY
4 Y 242 False PRIVATE LIABILITY
5 Y 243 True PRIVATE LIABILITY
6 Y 241 True PRIVATE LIABILITY
7 Y 257 True PRIVATE LIABILITY
8 Z 242 True CONTENT
9 Z 243 True CONTENT
10 Z 241 True CONTENT
11 Z 257 False CONTENT
英文:
Try this:
(df.assign(contract_type = df['policyId']
.map(df['element_id'].map({242:'0',257:'1'})
.where(df['element_selected'])
.str.get_dummies()
.groupby(df['policyId']).any()
.mul([1,2])
.sum(axis=1)
.map(dict(enumerate(['NEITHER','CONTENT','PRIVATE LIABILITY','CONTENT + PRIVATE LIABILITY']))))))
or
df.assign(
contract_type = (df['policyId'].map(
df['element_id']
.sort_values()
.map(
{242:'CONTENT',257:'PRIVATE LIABILITY'})
.where(df['element_selected'] & ~df.duplicated(subset = ['policyId','element_id']))
.dropna()
.groupby(df['policyId'])
.agg(' + '.join))))
Output:
policyId element_id element_selected contract_type
0 X 242 True CONTENT + PRIVATE LIABILITY
1 X 243 True CONTENT + PRIVATE LIABILITY
2 X 241 True CONTENT + PRIVATE LIABILITY
3 X 257 True CONTENT + PRIVATE LIABILITY
4 Y 242 False PRIVATE LIABILITY
5 Y 243 True PRIVATE LIABILITY
6 Y 241 True PRIVATE LIABILITY
7 Y 257 True PRIVATE LIABILITY
8 Z 242 True CONTENT
9 Z 243 True CONTENT
10 Z 241 True CONTENT
11 Z 257 False CONTENT
答案5
得分: 0
让我们使用CategoricalDtype来实现
ContractTypes = {
242: '内容',
257: '私人责任'
}
contract_type = (
df.set_index('policyId')
.product('columns')
.astype(pd.CategoricalDtype(ContractTypes))
.cat.rename_categories(ContractTypes.values())
.dropna()
.groupby(level=0)
.agg(' + '.join)
.rename('合同类型')
)
df = df.set_index('policyId').join(contract_type).reset_index()
首先,我们创建了一个合同类型代码的字典,其中键是代码,值是相关描述。随着时间的推移,我们可能想要在这里添加一个新的合同类型,而不必与代码的其余部分一起进行修改。
接下来,我们创建了一个系列contract_type
:
- 将
policyId
作为索引以便进一步按索引分组; - 沿着行进行乘法运算,将“false”代码替换为零;
- 将结果转换为分类
ContractTypes
,将未使用的合同类型替换为NaN
,然后可以安全地丢弃它们,因为我们的数据与policyId
索引相关; - 接下来,我们使用相关描述重命名类别;
- 现在,我们通过索引(即
level=0
)对其余数据进行分组,并使用+
连接它们; - 最后,我们为序列命名,这将在与
df
进一步连接后成为一个列的名称。
让我们通过数据透视表来实现
ContractTypes = pd.Series({
242: '内容',
257: '私人责任'
})
selected = df.pivot(
index='policyId',
columns='element_id',
values='element_selected'
).apply(
lambda s: ' + '.join(ContractTypes展开收缩),
axis='columns'
)
df['合同类型'] = df['policyId'].map(selected)
这是Shubham Sharma的答案的升级版本。
唯一的变化发生在中间,其中将数据透视表的每一行用作ContractTypes系列中的索引,并将获取的描述连接成字符串。
英文:
Let's do it with CategoricalDtype
ContractTypes = {
242: 'Content',
257: 'Private liability'
}
contract_type = (
df.set_index('policyId')
.product('columns')
.astype(pd.CategoricalDtype(ContractTypes))
.cat.rename_categories(ContractTypes.values())
.dropna()
.groupby(level=0)
.agg(' + '.join)
.rename('contract_type')
)
df = df.set_index('policyId').join(contract_type).reset_index()
First we create a dictionary of contract type codes as keys and associated descriptions. With time we may want to add a new contract type here without mitigating with the rest of the code.
Next we create a series contract_type
:
- put
policyId
as an index for further grouping by index; - product along the rows returns zero in place of false codes;
- transforming the result to categorical
ContractTypes
will replace unused contract types withNaN
, which in turn we can safely discard, since the data we have is tied to thepolicyId
index; - next we rename categories with their associated descriptions;
- now we group the rest of the data by index (i.e.
level=0
) and join them together with' + '
in between; - finally we give the sequence a name which will become a name of a column after further joining with
df
.
Let's do it by Pivoting
ContractTypes = pd.Series({
242: 'Content',
257: 'Private liability'
})
selected = df.pivot(
index='policyId',
columns='element_id',
values='element_selected'
).apply(
lambda s: ' + '.join(ContractTypes展开收缩),
axis='columns'
)
df['contract_type'] = df['policyId'].map(selected)
This one is an upgraded version of the Shubham Sharma's answer.
The only change is made in the middle, where each row of the pivoted data is used as an index in the ContractTypes series, and obtained descriptions are joined in a string.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论