在Pandas中根据条件添加一列

huangapple go评论64阅读模式
英文:

Add a column based on a condition in Pandas

问题

我有以下的数据框df)。

    data = {
        'policyId': ['X','X','X','X', 'Y','Y','Y','Y', 'Z', 'Z','Z','Z'],
        'element_id': [242, 243, 241, 257, 242, 243, 241, 257, 242, 243, 241, 257],
        'element_selected': [True, True, True, True, False, True, True, True, True, True, True, False]
    }
    df = pd.DataFrame(data)

所有的 `policyIds` 具有相同的 `element_id`。不同的是,`element_id` 可以被选择或者不被选择(`element_selected` 决定了这一点)。我想要在数据框中添加一个 `contract_type`如果 `element_id` 242 被选择了,`contract_type` 就应该是 "内容"如果 `element_id` 257 被选择了那么 `contract_type` 就应该是 "私人责任"如果 `element_id` 242257 都被选择了那么 `contract_type` 就应该是 "内容 + 私人责任"

问题在于每个 `policyId` 应该只有一个 `contract_type`我不知道如何实现这个逻辑

期望的输出

	policyId	element_id	element_selected	contract_type
	X	               242	True	            内容 + 私人责任
	X	               243	True	            内容 + 私人责任
	X	               241	True	            内容 + 私人责任
	X	               257	True	            内容 + 私人责任
	Y	               242	False	            私人责任
	Y	               243	True	            私人责任
	Y	               241	True	            私人责任
	Y	               257	True	            私人责任
	Z	               242	True	            内容
	Z	               243	True	            内容
	Z	               241	True	            内容
	Z	               257	False	            内容
英文:

I have the following df.

data = {
    'policyId': ['X','X','X','X', 'Y','Y','Y','Y', 'Z', 'Z','Z','Z'],
    'element_id': [242, 243, 241, 257, 242, 243, 241, 257, 242, 243, 241, 257],
    'element_selected': [True, True, True, True, False, True, True, True, True, True, True, False]
}
df = pd.DataFrame(data)

All policyIds have the same element_id. What changes is that the element_id can be select or not (element_selected assigns that). I want to add a contract_type column to the df. If element_id 242 has been selected the contract_type should be "CONTENT". If element_id 257 has been selected then the contract_type should "PRIVATE LIABILITY". If element_id 242 and 257 have both been selected then the contract_type should "CONTENT + PRIVATE LIABILITY".

The problem is that each policyId should have only one value per contract_type and I do not know how to implement this logic

Expected output

policyId	element_id	element_selected	contract_type
X	               242	True	            CONTENT + PRIVATE LIABILITY
X	               243	True	            CONTENT + PRIVATE LIABILITY
X	               241	True	            CONTENT + PRIVATE LIABILITY
X	               257	True	            CONTENT + PRIVATE LIABILITY
Y	               242	False	            PRIVATE LIABILITY
Y	               243	True	            PRIVATE LIABILITY
Y	               241	True	            PRIVATE LIABILITY
Y	               257	True	            PRIVATE LIABILITY
Z	               242	True	            CONTENT
Z	               243	True	            CONTENT
Z	               241	True	            CONTENT
Z	               257	False	            CONTENT

答案1

得分: 1

解决方案

将数据框进行透视

s = df.pivot(index='policyId', columns='element_id', values='element_selected')
# element_id   241    242   243    257
# policyId                            
# X           True   True  True   True
# Y           True  False  True   True
# Z           True   True  True  False

使用 np.select 基于预定义条件选择选项

s['contract_type'] = np.select(
    
展开收缩
].all(axis=1), s[242], s[257]],
["CONTENT + PRIVATE LIABILITY", "CONTENT", "PRIVATE LIABILITY"] )
# element_id   241    242   243    257                contract_type
# policyId                                                         
# X           True   True  True   True  CONTENT + PRIVATE LIABILITY
# Y           True  False  True   True            PRIVATE LIABILITY
# Z           True   True  True  False                      CONTENT

将选定的选项映射回数据框

df['contract_type'] = df['policyId'].map(s['contract_type'])
#    policyId  element_id  element_selected                contract_type
# 0         X         242              True  CONTENT + PRIVATE LIABILITY
# 1         X         243              True  CONTENT + PRIVATE LIABILITY
# 2         X         241              True  CONTENT + PRIVATE LIABILITY
# 3         X         257              True  CONTENT + PRIVATE LIABILITY
# 4         Y         242             False            PRIVATE LIABILITY
# 5         Y         243              True            PRIVATE LIABILITY
# 6         Y         241              True            PRIVATE LIABILITY
# 7         Y         257              True            PRIVATE LIABILITY
# 8         Z         242              True                      CONTENT
# 9         Z         243              True                      CONTENT
# 10        Z         241              True                      CONTENT
# 11        Z         257             False                      CONTENT
英文:

Solution

Pivot the dataframe

s = df.pivot(index='policyId', columns='element_id', values='element_selected')

# element_id   241    242   243    257
# policyId                            
# X           True   True  True   True
# Y           True  False  True   True
# Z           True   True  True  False

Use np.select to select the choices based on predefined conditions

s['contract_type'] = np.select(
    
展开收缩
].all(axis=1), s[242], s[257]], ["CONTENT + PRIVATE LIABILITY", "CONTENT", "PRIVATE LIABILITY"] ) # element_id 241 242 243 257 contract_type # policyId # X True True True True CONTENT + PRIVATE LIABILITY # Y True False True True PRIVATE LIABILITY # Z True True True False CONTENT

Map the selected choices back to dataframe

df['contract_type'] = df['policyId'].map(s['contract_type'])

#    policyId  element_id  element_selected                contract_type
# 0         X         242              True  CONTENT + PRIVATE LIABILITY
# 1         X         243              True  CONTENT + PRIVATE LIABILITY
# 2         X         241              True  CONTENT + PRIVATE LIABILITY
# 3         X         257              True  CONTENT + PRIVATE LIABILITY
# 4         Y         242             False            PRIVATE LIABILITY
# 5         Y         243              True            PRIVATE LIABILITY
# 6         Y         241              True            PRIVATE LIABILITY
# 7         Y         257              True            PRIVATE LIABILITY
# 8         Z         242              True                      CONTENT
# 9         Z         243              True                      CONTENT
# 10        Z         241              True                      CONTENT
# 11        Z         257             False                      CONTENT

答案2

得分: 0

你可以使用一个名为"policyId"的分组,然后实现一个映射函数,用于执行分配"contract_type"的逻辑:

def get_contract_type(group):
    types = []
    # 获取选定的ID
    ids = list(group[group["element_selected"]]["element_id"])

    # 根据选定的ID添加类型
    if 242 in ids:
        types.append("CONTENT")

    if 257 in ids:
       types.append("PRIVATE LIABILITY")

    # 添加新的"contract_type"列
    group["contract_type"] = " + ".join(types)
    return group

df.groupby("policyId").apply(get_contract_type)
英文:

you could use a groupy "policyId" and then implement a mapping function that does the logic of assigning the contract_type:

def get_contract_type(group):
    types = []
    # get selected ids
    ids = list(group[group["element_selected"]]["element_id"])

    # add the types based on the selected ids
    if 242 in ids:
        types.append("CONTENT")

    if 257 in ids:
       types.append("PRIVATE LIABILITY")


    # add the new contract_type column
    group["contract_type"] = " + ".join(types)
    return group

df.groupby("policyId").apply(get_contract_type)

答案3

得分: 0

首先,让我们创建一个包含零的临时列:

df["contract_type"] = 0

初始化这一列为零的目的将在后面的答案中变得清晰。

现在,按照 policyId 进行分组:

policy_groups = df.groupby("policyId")

接下来,对于每个组,筛选出其中 element_selectedFalse 的行,并查看 element_id 列中是否存在 242257。如果存在 242,让我们为该组中的所有行添加 1contract_type。如果存在 257,让我们添加 2

for policy_id, group in policy_groups:
    selected_ids = group.loc[group["element_selected"], "element_id"]
    if (selected_ids == 242).any():
        df.loc[group.index, "contract_type"] += 1
    if (selected_ids == 257).any():
        df.loc[group.index, "contract_type"] += 2

在此阶段,df 看起来如下:

   policyId  element_id  element_selected  contract_type
0         X         242              True              3
1         X         243              True              3
2         X         241              True              3
3         X         257              True              3
4         Y         242             False              2
5         Y         243              True              2
6         Y         241              True              2
7         Y         257              True              2
8         Z         242              True              1
9         Z         243              True              1
10        Z         241              True              1
11        Z         257             False              1

现在,对于所有组已经完成,我们可以将 contract_type 列中的整数映射到所需的字符串:

df["contract_type"] = df["contract_type"].map({0: "", 1: "CONTENT", 2: "PRIVATE LIABILITY", 3: "CONTENT + PRIVATE LIABILITY"})

然后我们得到了所需的数据框:

   policyId  element_id  element_selected                contract_type
0         X         242              True  CONTENT + PRIVATE LIABILITY
1         X         243              True  CONTENT + PRIVATE LIABILITY
2         X         241              True  CONTENT + PRIVATE LIABILITY
3         X         257              True  CONTENT + PRIVATE LIABILITY
4         Y         242             False            PRIVATE LIABILITY
5         Y         243              True            PRIVATE LIABILITY
6         Y         241              True            PRIVATE LIABILITY
7         Y         257              True            PRIVATE LIABILITY
8         Z         242              True                      CONTENT
9         Z         243              True                      CONTENT
10        Z         241              True                      CONTENT
11        Z         257             False                      CONTENT
英文:

First, let's create a temporary column that contains zeros:

df["contract_type"] = 0

The purpose of initializing this column to zeros will become clear later in this answer.

Now, group by policyId:

policy_groups = df.groupby("policyId")

Next, for each group, filter out the rows where element_selected is False, and see if 242 and 257 exist in the element_id column. If 242 exists, let's add 1 to contract_type for all rows in the group. If 257 exists, let's add 2.

for policy_id, group in policy_groups:
    selected_ids = group.loc[group["element_selected"], "element_id"]
    if (selected_ids == 242).any():
        df.loc[group.index, "contract_type"] += 1
    if (selected_ids == 257).any():
        df.loc[group.index, "contract_type"] += 2 

At this stage, df looks like:

   policyId  element_id  element_selected  contract_type
0         X         242              True              3
1         X         243              True              3
2         X         241              True              3
3         X         257              True              3
4         Y         242             False              2
5         Y         243              True              2
6         Y         241              True              2
7         Y         257              True              2
8         Z         242              True              1
9         Z         243              True              1
10        Z         241              True              1
11        Z         257             False              1

Now that this is complete for all groups, we can map the integers in the contract_type column to the required strings:

df["contract_type"] = df["contract_type"].map({0: "", 1: "CONTENT", 2: "PRIVATE LIABILITY", 3: "CONTENT + PRIVATE LIABILITY"})

And we have the required dataframe:

   policyId  element_id  element_selected                contract_type
0         X         242              True  CONTENT + PRIVATE LIABILITY
1         X         243              True  CONTENT + PRIVATE LIABILITY
2         X         241              True  CONTENT + PRIVATE LIABILITY
3         X         257              True  CONTENT + PRIVATE LIABILITY
4         Y         242             False            PRIVATE LIABILITY
5         Y         243              True            PRIVATE LIABILITY
6         Y         241              True            PRIVATE LIABILITY
7         Y         257              True            PRIVATE LIABILITY
8         Z         242              True                      CONTENT
9         Z         243              True                      CONTENT
10        Z         241              True                      CONTENT
11        Z         257             False                      CONTENT

答案4

得分: 0

尝试这个:

(df.assign(contract_type = df['policyId']
           .map(df['element_id'].map({242:'0',257:'1'})
                .where(df['element_selected'])
                .str.get_dummies()
                .groupby(df['policyId']).any()
                .mul([1,2])
                .sum(axis=1)
                .map(dict(enumerate(['NEITHER','CONTENT','PRIVATE LIABILITY','CONTENT + PRIVATE LIABILITY'])))))

或者

df.assign(
    contract_type = (df['policyId'].map(
        df['element_id']
        .sort_values()
        .map(
            {242:'CONTENT',257:'PRIVATE LIABILITY'})
            .where(df['element_selected'] & ~df.duplicated(subset = ['policyId','element_id']))
            .dropna()
            .groupby(df['policyId'])
            .agg(' + '.join))))

输出:

       policyId  element_id  element_selected                contract_type
    0         X         242              True  CONTENT + PRIVATE LIABILITY
    1         X         243              True  CONTENT + PRIVATE LIABILITY
    2         X         241              True  CONTENT + PRIVATE LIABILITY
    3         X         257              True  CONTENT + PRIVATE LIABILITY
    4         Y         242             False            PRIVATE LIABILITY
    5         Y         243              True            PRIVATE LIABILITY
    6         Y         241              True            PRIVATE LIABILITY
    7         Y         257              True            PRIVATE LIABILITY
    8         Z         242              True                      CONTENT
    9         Z         243              True                      CONTENT
    10        Z         241              True                      CONTENT
    11        Z         257             False                      CONTENT
英文:

Try this:

(df.assign(contract_type = df['policyId']
           .map(df['element_id'].map({242:'0',257:'1'})
                .where(df['element_selected'])
                .str.get_dummies()
                .groupby(df['policyId']).any()
                .mul([1,2])
                .sum(axis=1)
                .map(dict(enumerate(['NEITHER','CONTENT','PRIVATE LIABILITY','CONTENT + PRIVATE LIABILITY']))))))

or

df.assign(
    contract_type = (df['policyId'].map(
        df['element_id']
        .sort_values()
        .map(
            {242:'CONTENT',257:'PRIVATE LIABILITY'})
            .where(df['element_selected'] & ~df.duplicated(subset = ['policyId','element_id']))
            .dropna()
            .groupby(df['policyId'])
            .agg(' + '.join))))

Output:

   policyId  element_id  element_selected                contract_type
0         X         242              True  CONTENT + PRIVATE LIABILITY
1         X         243              True  CONTENT + PRIVATE LIABILITY
2         X         241              True  CONTENT + PRIVATE LIABILITY
3         X         257              True  CONTENT + PRIVATE LIABILITY
4         Y         242             False            PRIVATE LIABILITY
5         Y         243              True            PRIVATE LIABILITY
6         Y         241              True            PRIVATE LIABILITY
7         Y         257              True            PRIVATE LIABILITY
8         Z         242              True                      CONTENT
9         Z         243              True                      CONTENT
10        Z         241              True                      CONTENT
11        Z         257             False                      CONTENT

答案5

得分: 0

让我们使用CategoricalDtype来实现

ContractTypes = {
    242: '内容',
    257: '私人责任'
}

contract_type = (
    df.set_index('policyId')
    .product('columns')
    .astype(pd.CategoricalDtype(ContractTypes))
    .cat.rename_categories(ContractTypes.values())
    .dropna()
    .groupby(level=0)
    .agg(' + '.join)
    .rename('合同类型')
)

df = df.set_index('policyId').join(contract_type).reset_index()

首先,我们创建了一个合同类型代码的字典,其中键是代码,值是相关描述。随着时间的推移,我们可能想要在这里添加一个新的合同类型,而不必与代码的其余部分一起进行修改。

接下来,我们创建了一个系列contract_type

  • policyId作为索引以便进一步按索引分组;
  • 沿着行进行乘法运算,将“false”代码替换为零;
  • 将结果转换为分类ContractTypes,将未使用的合同类型替换为NaN,然后可以安全地丢弃它们,因为我们的数据与policyId索引相关;
  • 接下来,我们使用相关描述重命名类别;
  • 现在,我们通过索引(即level=0)对其余数据进行分组,并使用+连接它们;
  • 最后,我们为序列命名,这将在与df进一步连接后成为一个列的名称。

让我们通过数据透视表来实现

ContractTypes = pd.Series({
    242: '内容',
    257: '私人责任'
})

selected = df.pivot(
    index='policyId',
    columns='element_id',
    values='element_selected'
).apply(
    lambda s: ' + '.join(ContractTypes
展开收缩
),
axis='columns' ) df['合同类型'] = df['policyId'].map(selected)

这是Shubham Sharma的答案的升级版本。
唯一的变化发生在中间,其中将数据透视表的每一行用作ContractTypes系列中的索引,并将获取的描述连接成字符串。

英文:

Let's do it with CategoricalDtype

ContractTypes = {
    242: 'Content',
    257: 'Private liability'
}

contract_type = (
    df.set_index('policyId')
    .product('columns')
    .astype(pd.CategoricalDtype(ContractTypes))
    .cat.rename_categories(ContractTypes.values())
    .dropna()
    .groupby(level=0)
    .agg(' + '.join)
    .rename('contract_type')
)

df = df.set_index('policyId').join(contract_type).reset_index()

First we create a dictionary of contract type codes as keys and associated descriptions. With time we may want to add a new contract type here without mitigating with the rest of the code.

Next we create a series contract_type:

  • put policyId as an index for further grouping by index;
  • product along the rows returns zero in place of false codes;
  • transforming the result to categorical ContractTypes will replace unused contract types with NaN, which in turn we can safely discard, since the data we have is tied to the policyId index;
  • next we rename categories with their associated descriptions;
  • now we group the rest of the data by index (i.e. level=0) and join them together with ' + ' in between;
  • finally we give the sequence a name which will become a name of a column after further joining with df.

Let's do it by Pivoting

ContractTypes = pd.Series({
    242: 'Content',
    257: 'Private liability'
})

selected = df.pivot(
    index='policyId',
    columns='element_id',
    values='element_selected'
).apply(
    lambda s: ' + '.join(ContractTypes
展开收缩
),
axis='columns' ) df['contract_type'] = df['policyId'].map(selected)

This one is an upgraded version of the Shubham Sharma's answer.
The only change is made in the middle, where each row of the pivoted data is used as an index in the ContractTypes series, and obtained descriptions are joined in a string.

huangapple
  • 本文由 发表于 2023年7月31日 23:44:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76805184.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定