2023年7月31日 23:44:14go评论80阅读模式

英文:

Add a column based on a condition in Pandas

问题

我有以下的数据框（df）。

    data = {
        'policyId': ['X','X','X','X', 'Y','Y','Y','Y', 'Z', 'Z','Z','Z'],
        'element_id': [242, 243, 241, 257, 242, 243, 241, 257, 242, 243, 241, 257],
        'element_selected': [True, True, True, True, False, True, True, True, True, True, True, False]
    }
    df = pd.DataFrame(data)

所有的 `policyIds` 具有相同的 `element_id`。不同的是，`element_id` 可以被选择或者不被选择（`element_selected` 决定了这一点）。我想要在数据框中添加一个 `contract_type` 列。如果 `element_id` 242 被选择了，`contract_type` 就应该是 "内容"。如果 `element_id` 257 被选择了，那么 `contract_type` 就应该是 "私人责任"。如果 `element_id` 242 和 257 都被选择了，那么 `contract_type` 就应该是 "内容 + 私人责任"。

问题在于每个 `policyId` 应该只有一个 `contract_type` 值，我不知道如何实现这个逻辑。

期望的输出：

	policyId	element_id	element_selected	contract_type
	X	               242	True	            内容 + 私人责任
	X	               243	True	            内容 + 私人责任
	X	               241	True	            内容 + 私人责任
	X	               257	True	            内容 + 私人责任
	Y	               242	False	            私人责任
	Y	               243	True	            私人责任
	Y	               241	True	            私人责任
	Y	               257	True	            私人责任
	Z	               242	True	            内容
	Z	               243	True	            内容
	Z	               241	True	            内容
	Z	               257	False	            内容

英文:

I have the following df.

data = {
    &#39;policyId&#39;: [&#39;X&#39;,&#39;X&#39;,&#39;X&#39;,&#39;X&#39;, &#39;Y&#39;,&#39;Y&#39;,&#39;Y&#39;,&#39;Y&#39;, &#39;Z&#39;, &#39;Z&#39;,&#39;Z&#39;,&#39;Z&#39;],
    &#39;element_id&#39;: [242, 243, 241, 257, 242, 243, 241, 257, 242, 243, 241, 257],
    &#39;element_selected&#39;: [True, True, True, True, False, True, True, True, True, True, True, False]
}
df = pd.DataFrame(data)

All policyIds have the same element_id. What changes is that the element_id can be select or not (element_selected assigns that). I want to add a contract_type column to the df. If element_id 242 has been selected the contract_type should be "CONTENT". If element_id 257 has been selected then the contract_type should "PRIVATE LIABILITY". If element_id 242 and 257 have both been selected then the contract_type should "CONTENT + PRIVATE LIABILITY".

The problem is that each policyId should have only one value per contract_type and I do not know how to implement this logic

Expected output

policyId	element_id	element_selected	contract_type
X	               242	True	            CONTENT + PRIVATE LIABILITY
X	               243	True	            CONTENT + PRIVATE LIABILITY
X	               241	True	            CONTENT + PRIVATE LIABILITY
X	               257	True	            CONTENT + PRIVATE LIABILITY
Y	               242	False	            PRIVATE LIABILITY
Y	               243	True	            PRIVATE LIABILITY
Y	               241	True	            PRIVATE LIABILITY
Y	               257	True	            PRIVATE LIABILITY
Z	               242	True	            CONTENT
Z	               243	True	            CONTENT
Z	               241	True	            CONTENT
Z	               257	False	            CONTENT

答案1

得分: 1

解决方案

将数据框进行透视

s = df.pivot(index='policyId', columns='element_id', values='element_selected')

# element_id   241    242   243    257
# policyId                            
# X           True   True  True   True
# Y           True  False  True   True
# Z           True   True  True  False

使用 np.select 基于预定义条件选择选项

s['contract_type'] = np.select(
    展开收缩
].all(axis=1), s[242], s[257]],
    ["CONTENT + PRIVATE LIABILITY", "CONTENT", "PRIVATE LIABILITY"]
)

# element_id   241    242   243    257                contract_type
# policyId                                                         
# X           True   True  True   True  CONTENT + PRIVATE LIABILITY
# Y           True  False  True   True            PRIVATE LIABILITY
# Z           True   True  True  False                      CONTENT

将选定的选项映射回数据框

df['contract_type'] = df['policyId'].map(s['contract_type'])

#    policyId  element_id  element_selected                contract_type
# 0         X         242              True  CONTENT + PRIVATE LIABILITY
# 1         X         243              True  CONTENT + PRIVATE LIABILITY
# 2         X         241              True  CONTENT + PRIVATE LIABILITY
# 3         X         257              True  CONTENT + PRIVATE LIABILITY
# 4         Y         242             False            PRIVATE LIABILITY
# 5         Y         243              True            PRIVATE LIABILITY
# 6         Y         241              True            PRIVATE LIABILITY
# 7         Y         257              True            PRIVATE LIABILITY
# 8         Z         242              True                      CONTENT
# 9         Z         243              True                      CONTENT
# 10        Z         241              True                      CONTENT
# 11        Z         257             False                      CONTENT

英文:

Solution

Pivot the dataframe

s = df.pivot(index=&#39;policyId&#39;, columns=&#39;element_id&#39;, values=&#39;element_selected&#39;)

# element_id   241    242   243    257
# policyId                            
# X           True   True  True   True
# Y           True  False  True   True
# Z           True   True  True  False

Use np.select to select the choices based on predefined conditions

s[&#39;contract_type&#39;] = np.select(
    展开收缩].all(axis=1), s[242], s[257]],
    [&quot;CONTENT + PRIVATE LIABILITY&quot;, &quot;CONTENT&quot;, &quot;PRIVATE LIABILITY&quot;]
)

# element_id   241    242   243    257                contract_type
# policyId                                                         
# X           True   True  True   True  CONTENT + PRIVATE LIABILITY
# Y           True  False  True   True            PRIVATE LIABILITY
# Z           True   True  True  False                      CONTENT

Map the selected choices back to dataframe

df[&#39;contract_type&#39;] = df[&#39;policyId&#39;].map(s[&#39;contract_type&#39;])

#    policyId  element_id  element_selected                contract_type
# 0         X         242              True  CONTENT + PRIVATE LIABILITY
# 1         X         243              True  CONTENT + PRIVATE LIABILITY
# 2         X         241              True  CONTENT + PRIVATE LIABILITY
# 3         X         257              True  CONTENT + PRIVATE LIABILITY
# 4         Y         242             False            PRIVATE LIABILITY
# 5         Y         243              True            PRIVATE LIABILITY
# 6         Y         241              True            PRIVATE LIABILITY
# 7         Y         257              True            PRIVATE LIABILITY
# 8         Z         242              True                      CONTENT
# 9         Z         243              True                      CONTENT
# 10        Z         241              True                      CONTENT
# 11        Z         257             False                      CONTENT

答案2

得分: 0

你可以使用一个名为"policyId"的分组，然后实现一个映射函数，用于执行分配"contract_type"的逻辑：

def get_contract_type(group):
    types = []
    # 获取选定的ID
    ids = list(group[group["element_selected"]]["element_id"])

    # 根据选定的ID添加类型
    if 242 in ids:
        types.append("CONTENT")

    if 257 in ids:
       types.append("PRIVATE LIABILITY")

    # 添加新的"contract_type"列
    group["contract_type"] = " + ".join(types)
    return group

df.groupby("policyId").apply(get_contract_type)

英文:

you could use a groupy "policyId" and then implement a mapping function that does the logic of assigning the contract_type:

def get_contract_type(group):
    types = []
    # get selected ids
    ids = list(group[group[&quot;element_selected&quot;]][&quot;element_id&quot;])

    # add the types based on the selected ids
    if 242 in ids:
        types.append(&quot;CONTENT&quot;)

    if 257 in ids:
       types.append(&quot;PRIVATE LIABILITY&quot;)


    # add the new contract_type column
    group[&quot;contract_type&quot;] = &quot; + &quot;.join(types)
    return group

df.groupby(&quot;policyId&quot;).apply(get_contract_type)

答案3

得分: 0

首先，让我们创建一个包含零的临时列：

df["contract_type"] = 0

初始化这一列为零的目的将在后面的答案中变得清晰。

现在，按照 policyId 进行分组：

policy_groups = df.groupby("policyId")

接下来，对于每个组，筛选出其中 element_selected 为 False 的行，并查看 element_id 列中是否存在 242 和 257。如果存在 242，让我们为该组中的所有行添加 1 到 contract_type。如果存在 257，让我们添加 2。

for policy_id, group in policy_groups:
    selected_ids = group.loc[group["element_selected"], "element_id"]
    if (selected_ids == 242).any():
        df.loc[group.index, "contract_type"] += 1
    if (selected_ids == 257).any():
        df.loc[group.index, "contract_type"] += 2

在此阶段，df 看起来如下：

   policyId  element_id  element_selected  contract_type
0         X         242              True              3
1         X         243              True              3
2         X         241              True              3
3         X         257              True              3
4         Y         242             False              2
5         Y         243              True              2
6         Y         241              True              2
7         Y         257              True              2
8         Z         242              True              1
9         Z         243              True              1
10        Z         241              True              1
11        Z         257             False              1

现在，对于所有组已经完成，我们可以将 contract_type 列中的整数映射到所需的字符串：

df["contract_type"] = df["contract_type"].map({0: "", 1: "CONTENT", 2: "PRIVATE LIABILITY", 3: "CONTENT + PRIVATE LIABILITY"})

然后我们得到了所需的数据框：

   policyId  element_id  element_selected                contract_type
0         X         242              True  CONTENT + PRIVATE LIABILITY
1         X         243              True  CONTENT + PRIVATE LIABILITY
2         X         241              True  CONTENT + PRIVATE LIABILITY
3         X         257              True  CONTENT + PRIVATE LIABILITY
4         Y         242             False            PRIVATE LIABILITY
5         Y         243              True            PRIVATE LIABILITY
6         Y         241              True            PRIVATE LIABILITY
7         Y         257              True            PRIVATE LIABILITY
8         Z         242              True                      CONTENT
9         Z         243              True                      CONTENT
10        Z         241              True                      CONTENT
11        Z         257             False                      CONTENT

英文:

First, let's create a temporary column that contains zeros:

df[&quot;contract_type&quot;] = 0

The purpose of initializing this column to zeros will become clear later in this answer.

Now, group by policyId:

policy_groups = df.groupby(&quot;policyId&quot;)

Next, for each group, filter out the rows where element_selected is False, and see if 242 and 257 exist in the element_id column. If 242 exists, let's add 1 to contract_type for all rows in the group. If 257 exists, let's add 2.

for policy_id, group in policy_groups:
    selected_ids = group.loc[group[&quot;element_selected&quot;], &quot;element_id&quot;]
    if (selected_ids == 242).any():
        df.loc[group.index, &quot;contract_type&quot;] += 1
    if (selected_ids == 257).any():
        df.loc[group.index, &quot;contract_type&quot;] += 2

At this stage, df looks like:

   policyId  element_id  element_selected  contract_type
0         X         242              True              3
1         X         243              True              3
2         X         241              True              3
3         X         257              True              3
4         Y         242             False              2
5         Y         243              True              2
6         Y         241              True              2
7         Y         257              True              2
8         Z         242              True              1
9         Z         243              True              1
10        Z         241              True              1
11        Z         257             False              1

Now that this is complete for all groups, we can map the integers in the contract_type column to the required strings:

df[&quot;contract_type&quot;] = df[&quot;contract_type&quot;].map({0: &quot;&quot;, 1: &quot;CONTENT&quot;, 2: &quot;PRIVATE LIABILITY&quot;, 3: &quot;CONTENT + PRIVATE LIABILITY&quot;})

And we have the required dataframe:

   policyId  element_id  element_selected                contract_type
0         X         242              True  CONTENT + PRIVATE LIABILITY
1         X         243              True  CONTENT + PRIVATE LIABILITY
2         X         241              True  CONTENT + PRIVATE LIABILITY
3         X         257              True  CONTENT + PRIVATE LIABILITY
4         Y         242             False            PRIVATE LIABILITY
5         Y         243              True            PRIVATE LIABILITY
6         Y         241              True            PRIVATE LIABILITY
7         Y         257              True            PRIVATE LIABILITY
8         Z         242              True                      CONTENT
9         Z         243              True                      CONTENT
10        Z         241              True                      CONTENT
11        Z         257             False                      CONTENT

答案4

得分: 0

尝试这个：

(df.assign(contract_type = df['policyId']
           .map(df['element_id'].map({242:'0',257:'1'})
                .where(df['element_selected'])
                .str.get_dummies()
                .groupby(df['policyId']).any()
                .mul([1,2])
                .sum(axis=1)
                .map(dict(enumerate(['NEITHER','CONTENT','PRIVATE LIABILITY','CONTENT + PRIVATE LIABILITY'])))))

或者

df.assign(
    contract_type = (df['policyId'].map(
        df['element_id']
        .sort_values()
        .map(
            {242:'CONTENT',257:'PRIVATE LIABILITY'})
            .where(df['element_selected'] & ~df.duplicated(subset = ['policyId','element_id']))
            .dropna()
            .groupby(df['policyId'])
            .agg(' + '.join))))

输出：

       policyId  element_id  element_selected                contract_type
    0         X         242              True  CONTENT + PRIVATE LIABILITY
    1         X         243              True  CONTENT + PRIVATE LIABILITY
    2         X         241              True  CONTENT + PRIVATE LIABILITY
    3         X         257              True  CONTENT + PRIVATE LIABILITY
    4         Y         242             False            PRIVATE LIABILITY
    5         Y         243              True            PRIVATE LIABILITY
    6         Y         241              True            PRIVATE LIABILITY
    7         Y         257              True            PRIVATE LIABILITY
    8         Z         242              True                      CONTENT
    9         Z         243              True                      CONTENT
    10        Z         241              True                      CONTENT
    11        Z         257             False                      CONTENT

英文:

Try this:

(df.assign(contract_type = df[&#39;policyId&#39;]
           .map(df[&#39;element_id&#39;].map({242:&#39;0&#39;,257:&#39;1&#39;})
                .where(df[&#39;element_selected&#39;])
                .str.get_dummies()
                .groupby(df[&#39;policyId&#39;]).any()
                .mul([1,2])
                .sum(axis=1)
                .map(dict(enumerate([&#39;NEITHER&#39;,&#39;CONTENT&#39;,&#39;PRIVATE LIABILITY&#39;,&#39;CONTENT + PRIVATE LIABILITY&#39;]))))))

df.assign(
    contract_type = (df[&#39;policyId&#39;].map(
        df[&#39;element_id&#39;]
        .sort_values()
        .map(
            {242:&#39;CONTENT&#39;,257:&#39;PRIVATE LIABILITY&#39;})
            .where(df[&#39;element_selected&#39;] &amp; ~df.duplicated(subset = [&#39;policyId&#39;,&#39;element_id&#39;]))
            .dropna()
            .groupby(df[&#39;policyId&#39;])
            .agg(&#39; + &#39;.join))))

Output:

   policyId  element_id  element_selected                contract_type
0         X         242              True  CONTENT + PRIVATE LIABILITY
1         X         243              True  CONTENT + PRIVATE LIABILITY
2         X         241              True  CONTENT + PRIVATE LIABILITY
3         X         257              True  CONTENT + PRIVATE LIABILITY
4         Y         242             False            PRIVATE LIABILITY
5         Y         243              True            PRIVATE LIABILITY
6         Y         241              True            PRIVATE LIABILITY
7         Y         257              True            PRIVATE LIABILITY
8         Z         242              True                      CONTENT
9         Z         243              True                      CONTENT
10        Z         241              True                      CONTENT
11        Z         257             False                      CONTENT

答案5

得分: 0

让我们使用CategoricalDtype来实现

ContractTypes = {
    242: '内容',
    257: '私人责任'
}

contract_type = (
    df.set_index('policyId')
    .product('columns')
    .astype(pd.CategoricalDtype(ContractTypes))
    .cat.rename_categories(ContractTypes.values())
    .dropna()
    .groupby(level=0)
    .agg(' + '.join)
    .rename('合同类型')
)

df = df.set_index('policyId').join(contract_type).reset_index()

首先，我们创建了一个合同类型代码的字典，其中键是代码，值是相关描述。随着时间的推移，我们可能想要在这里添加一个新的合同类型，而不必与代码的其余部分一起进行修改。

接下来，我们创建了一个系列contract_type：

将policyId作为索引以便进一步按索引分组；
沿着行进行乘法运算，将“false”代码替换为零；
将结果转换为分类ContractTypes，将未使用的合同类型替换为NaN，然后可以安全地丢弃它们，因为我们的数据与policyId索引相关；
接下来，我们使用相关描述重命名类别；
现在，我们通过索引（即level=0）对其余数据进行分组，并使用+连接它们；
最后，我们为序列命名，这将在与df进一步连接后成为一个列的名称。

让我们通过数据透视表来实现

ContractTypes = pd.Series({
    242: '内容',
    257: '私人责任'
})

selected = df.pivot(
    index='policyId',
    columns='element_id',
    values='element_selected'
).apply(
    lambda s: ' + '.join(ContractTypes展开收缩), 
    axis='columns'
)

df['合同类型'] = df['policyId'].map(selected)

这是Shubham Sharma的答案的升级版本。
唯一的变化发生在中间，其中将数据透视表的每一行用作ContractTypes系列中的索引，并将获取的描述连接成字符串。

英文:

Let's do it with CategoricalDtype

ContractTypes = {
    242: &#39;Content&#39;,
    257: &#39;Private liability&#39;
}

contract_type = (
    df.set_index(&#39;policyId&#39;)
    .product(&#39;columns&#39;)
    .astype(pd.CategoricalDtype(ContractTypes))
    .cat.rename_categories(ContractTypes.values())
    .dropna()
    .groupby(level=0)
    .agg(&#39; + &#39;.join)
    .rename(&#39;contract_type&#39;)
)

df = df.set_index(&#39;policyId&#39;).join(contract_type).reset_index()

First we create a dictionary of contract type codes as keys and associated descriptions. With time we may want to add a new contract type here without mitigating with the rest of the code.

Next we create a series contract_type:

put policyId as an index for further grouping by index;
product along the rows returns zero in place of false codes;
transforming the result to categorical ContractTypes will replace unused contract types with NaN, which in turn we can safely discard, since the data we have is tied to the policyId index;
next we rename categories with their associated descriptions;
now we group the rest of the data by index (i.e. level=0) and join them together with ' + ' in between;
finally we give the sequence a name which will become a name of a column after further joining with df.

Let's do it by Pivoting

ContractTypes = pd.Series({
    242: &#39;Content&#39;,
    257: &#39;Private liability&#39;
})

selected = df.pivot(
    index=&#39;policyId&#39;,
    columns=&#39;element_id&#39;,
    values=&#39;element_selected&#39;
).apply(
    lambda s: &#39; + &#39;.join(ContractTypes展开收缩
), 
    axis=&#39;columns&#39;
)

df[&#39;contract_type&#39;] = df[&#39;policyId&#39;].map(selected)

This one is an upgraded version of the Shubham Sharma's answer.
The only change is made in the middle, where each row of the pivoted data is used as an index in the ContractTypes series, and obtained descriptions are joined in a string.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Pandas中根据条件添加一列

问题

答案1

解决方案

Solution

答案2

答案3

答案4

答案5

让我们使用CategoricalDtype来实现

让我们通过数据透视表来实现

Let's do it with CategoricalDtype

Let's do it by Pivoting

Python正则表达式带有可选组，至少匹配一次。

NetworkX有一个函数可以找到0度节点的k个最近节点吗？

我的代码执行乘客摘要太早了

解决一个含有一个变量的方程以得到特定的结果

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论