英文:
How to populate a column with a value if condition is met across 2+ other colums
问题
我的数据框类似于下面的表格。我有6列,每一列都有'Yes'或'No',表示是否使用了特定的抗生素。
要求在' TREATED '列中填入'True',如果特定的抗生素列的组合包含'Yes'。如果条件不满足,那么我想在'TREATED'列中填入'False'值。
如果 ['AZITH'] 和 ['CLIN'] == 'Yes' 或
['AZITH'] 和 ['CFTX'] 和 ['CLIN'] == 'Yes' 或
['AZITH'] 和 ['CFTX'] 和 ['METRO'] == 'Yes' 或
['AZITH'] 和 ['CFTN'] == 'Yes' 或
['CFTX'] 和 ['DOXY'] 和 ['METRO'] == 'Yes' 或
['CFTN'] 和 ['DOXY'] == 'Yes' 或
['DOXY'] 和 ['METRO'] == 'Yes',
然后在 'TREATED' 列中返回 'True',
否则返回 'False'。
此外,还包括所有6种药物都被使用的情况。如果是这样,那么应该返回'True',因为已经满足了至少使用2种治疗药物的条件。
期望的输出如下:
AZITH | CLIN | CFTX | METRO | CFTN | DOXY | TREATED |
---|---|---|---|---|---|---|
Yes | Yes | No | No | No | No | Yes |
No | Yes | No | Yes | No | No | No |
Yes | Yes | No | No | No | No | Yes |
No | No | No | No | No | No | No |
Yes | Yes | Yes | Yes | Yes | Yes | Yes |
No | Yes | Yes | Yes | No | Yes | Yes |
No | No | No | No | No | Yes | No |
No | No | No | No | No | No | No |
Yes | Yes | Yes | No | No | No | Yes |
Yes | No | Yes | Yes | No | No | Yes |
Yes | No | No | No | Yes | No | Yes |
No | No | Yes | Yes | No | Yes | Yes |
No | No | No | No | Yes | Yes | Yes |
No | No | No | Yes | No | Yes | Yes |
英文:
My dataframe is similar to the table below. I have 6 columns, each with 'Yes' or 'No' if a specific antibiotic was given.
AZITH | CLIN | CFTX | METRO | CFTN | DOXY | TREATED |
---|---|---|---|---|---|---|
Yes | Yes | No | No | No | No | |
No | Yes | No | Yes | No | No | |
Yes | Yes | No | No | No | No | |
No | No | No | No | No | No | |
Yes | Yes | Yes | Yes | Yes | Yes | |
No | Yes | Yes | Yes | No | Yes | |
No | No | No | No | No | Yes | |
No | No | No | No | No | No | |
Yes | Yes | Yes | No | No | No | |
Yes | No | Yes | Yes | No | No | |
Yes | No | No | No | Yes | No | |
No | No | Yes | Yes | No | Yes | |
No | No | No | No | Yes | Yes | |
No | No | No | Yes | No | Yes |
I want to fill the column 'TREATED' with 'True' if specific combinations of antibiotic columns contain 'Yes.' If the conditions aren't met, then I would like to fill the 'TREATED' column with a 'False' value.
> If ['AZITH'] & ['CLIN'] == 'Yes' |
>
> ['AZITH'] & ['CFTX'] & ['CLIN'] == 'Yes' |
>
> ['AZITH'] & ['CFTX'] & ['METRO']== 'Yes' |
>
> ['AZITH'] & ['CFTN'] == 'Yes' |
>
> ['CFTX'] & ['DOXY'] & ['METRO']== 'Yes' |
>
>['CFTN'] & ['DOXY'] == 'Yes' |
>
> ['DOXY'] & ['METRO']== 'Yes' ,
>
> Then return 'True' in column 'TREATED'
>
> Else 'False'
What I had in mind was some sort of if statement or use of lambda function, however, I am having trouble.
This must not be exclusive to the above combinations but also include for example if all 6 medications were given. If that's the case, then 'True' should be returned because the condition has been met to give at least 2 of the treatment medications.
The desired output is below:
AZITH | CLIN | CFTX | METRO | CFTN | DOXY | TREATED |
---|---|---|---|---|---|---|
Yes | Yes | No | No | No | No | Yes |
No | Yes | No | Yes | No | No | No |
Yes | Yes | No | No | No | No | Yes |
No | No | No | No | No | No | No |
Yes | Yes | Yes | Yes | Yes | Yes | Yes |
No | Yes | Yes | Yes | No | Yes | Yes |
No | No | No | No | No | Yes | No |
No | No | No | No | No | No | No |
Yes | Yes | Yes | No | No | No | Yes |
Yes | No | Yes | Yes | No | No | Yes |
Yes | No | No | No | Yes | No | Yes |
No | No | Yes | Yes | No | Yes | Yes |
No | No | No | No | Yes | Yes | Yes |
No | No | No | Yes | No | Yes | Yes |
答案1
得分: 1
使用您提供的数据帧,以下是一个代码示例,用于创建一个名为"TREATED"的新列,根据一组条件将其设置为"Yes"或"No":
mask = (
((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes") & (df["CFTX"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CFTX"] == "Yes") & (df["METRO"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CFTN"] == "Yes"))
| ((df["CFTX"] == "Yes") & (df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
| ((df["CFTN"] == "Yes") & (df["DOXY"] == "Yes"))
| ((df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
)
df.loc[mask, "TREATED"] = "Yes"
df = df.fillna("No")
然后,使用print(df)
可以打印出结果数据帧,其中包含了新的"TREATED"列。
希望这能帮助您。如果您需要进一步的翻译或有其他问题,请随时提出。
英文:
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"AZITH": [
"Yes",
"No",
"Yes",
"No",
"Yes",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
"No",
"No",
"No",
],
"CLIN": [
"Yes",
"Yes",
"Yes",
"No",
"Yes",
"Yes",
"No",
"No",
"Yes",
"No",
"No",
"No",
"No",
"No",
],
"CFTX": [
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"No",
"No",
"Yes",
"Yes",
"No",
"Yes",
"No",
"No",
],
"METRO": [
"No",
"Yes",
"No",
"No",
"Yes",
"Yes",
"No",
"No",
"No",
"Yes",
"No",
"Yes",
"No",
"Yes",
],
"CFTN": [
"No",
"No",
"No",
"No",
"Yes",
"No",
"No",
"No",
"No",
"No",
"Yes",
"No",
"Yes",
"No",
],
"DOXY": [
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
"No",
"No",
"No",
"No",
"Yes",
"Yes",
"Yes",
],
}
)
Here is one way to do it:
mask = (
((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CLIN"] == "Yes") & (df["CFTX"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CFTX"] == "Yes") & (df["METRO"] == "Yes"))
| ((df["AZITH"] == "Yes") & (df["CFTN"] == "Yes"))
| ((df["CFTX"] == "Yes") & (df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
| ((df["CFTN"] == "Yes") & (df["DOXY"] == "Yes"))
| ((df["DOXY"] == "Yes") & (df["METRO"] == "Yes"))
)
df.loc[mask, "TREATED"] = "Yes"
df = df.fillna("No")
Then:
print(df)
# Output
AZITH CLIN CFTX METRO CFTN DOXY TREATED
0 Yes Yes No No No No Yes
1 No Yes No Yes No No No
2 Yes Yes No No No No Yes
3 No No No No No No No
4 Yes Yes Yes Yes Yes Yes Yes
5 No Yes Yes Yes No Yes Yes
6 No No No No No Yes No
7 No No No No No No No
8 Yes Yes Yes No No No Yes
9 Yes No Yes Yes No No Yes
10 Yes No No No Yes No Yes
11 No No Yes Yes No Yes Yes
12 No No No No Yes Yes Yes
13 No No No Yes No Yes Yes
答案2
得分: 0
这有点抽象,但你可以使用位标志来表示每个“是”(True),为它分配一个二进制值,然后根据 if 语句在线程中进行数学运算。
https://dietertack.medium.com/using-bit-flags-in-c-d39ec6e30f08
英文:
This is a little abstract but you can use bit flag to represent each Yes (True) assign it a binary value and then do the math in threated based on an if statement.
https://dietertack.medium.com/using-bit-flags-in-c-d39ec6e30f08
答案3
得分: 0
你可以使用集合操作,首先将给定的药物作为集合聚合,然后检查所有可能的组合是否有超集:
valid_treatments = [{'AZITH', 'CLIN'}, {'AZITH', 'CFTX', 'CLIN'},
{'AZITH', 'CFTX', 'METRO'}, {'AZITH', 'CFTN'},
{'CFTX', 'DOXY', 'METRO'}, {'CFTN', 'DOXY'},
{'DOXY', 'METRO'},
]
def is_valid(row):
combination = set(df.columns[row])
return 'Yes' if any(
combination.issuperset(v)
for v in valid_treatments
) else 'No'
out = df.assign(TREATED=df.eq('Yes').apply(is_valid, axis=1))
输出:
AZITH CLIN CFTX METRO CFTN DOXY TREATED
0 Yes Yes No No No No Yes
1 No Yes No Yes No No No
2 Yes Yes No No No No Yes
3 No No No No No No No
4 Yes Yes Yes Yes Yes Yes Yes
5 No Yes Yes Yes No Yes Yes
6 No No No No No Yes No
7 No No No No No No No
8 Yes Yes Yes No No No Yes
9 Yes No Yes Yes No No Yes
10 Yes No No No Yes No Yes
11 No No Yes Yes No Yes Yes
12 No No No No Yes Yes Yes
13 No No No Yes No Yes Yes
英文:
You can use set operations, first aggregate as the sets of given medications, then check over all possible combinations if you have a superset:
valid_treatments = [{'AZITH', 'CLIN'}, {'AZITH', 'CFTX', 'CLIN'},
{'AZITH', 'CFTX', 'METRO'}, {'AZITH', 'CFTN'},
{'CFTX', 'DOXY', 'METRO'}, {'CFTN', 'DOXY'},
{'DOXY', 'METRO'},
]
def is_valid(row):
combination = set(df.columns[row])
return 'Yes' if any(
combination.issuperset(v)
for v in valid_treatments
) else 'No'
out = df.assign(TREATED=df.eq('Yes').apply(is_valid, axis=1))
Output:
AZITH CLIN CFTX METRO CFTN DOXY TREATED
0 Yes Yes No No No No Yes
1 No Yes No Yes No No No
2 Yes Yes No No No No Yes
3 No No No No No No No
4 Yes Yes Yes Yes Yes Yes Yes
5 No Yes Yes Yes No Yes Yes
6 No No No No No Yes No
7 No No No No No No No
8 Yes Yes Yes No No No Yes
9 Yes No Yes Yes No No Yes
10 Yes No No No Yes No Yes
11 No No Yes Yes No Yes Yes
12 No No No No Yes Yes Yes
13 No No No Yes No Yes Yes
答案4
得分: 0
我的答案将尝试将此解决方案向量化。但我从Mozway那里得到了关于超集的想法。在那之前,我无法弄清如何处理所有的组合。
import numpy as np
import pandas as pd
import itertools
df = pd.DataFrame({"AZITH": ["Yes", "No", "Yes", "No", "Yes", "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No", ],
"CLIN": ["Yes", "Yes", "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "No", ],
"CFTX": ["No", "No", "No", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No", "No", ],
"METRO": ["No", "Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "Yes", "No", "Yes", "No", "Yes", ],
"CFTN": ["No", "No", "No", "No", "Yes", "No", "No", "No", "No", "No", "Yes", "No", "Yes", "No", ],
"DOXY": ["No", "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No", "No", "No", "Yes", "Yes", "Yes", ]})
combos = np.array([[1, 1, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0], [1, 0, 1, 1, 0, 0], [1, 0, 0, 0, 1, 0], [0, 0, 1, 1, 0, 1], [0, 0, 0, 0, 1, 1], [0, 0, 0, 1, 0, 1]])
df = df.replace("Yes", 1)
df = df.replace("No", 0)
c = []
for l in range(len(combos)):
c.extend(itertools.combinations(range(len(combos)), l))
all_combos = combos
for combo in c[1:]:
combined = np.sum(combos[combo, :], axis=0)
all_combos = np.vstack([all_combos, combined])
all_combos[all_combos != 0] = 1
all_combos = np.unique(all_combos, axis=0)
combo_sum = all_combos.sum(axis=1)
all_combos[all_combos == 0] = -1
new_df = df.dot(all_combos.transpose())
for i, x in enumerate(combo_sum):
new_df.loc[new_df[i] < x, i] = 0
new_df[new_df > 0] = 1
new_df["res"] = new_df.sum(axis=1)
new_df.loc[new_df.res > 0, "res"] = True
new_df.loc[new_df.res == 0, "res"] = False
df["res"] = new_df["res"]
AZITH CLIN CFTX METRO CFTN DOXY res
0 1 1 0 0 0 0 True
1 0 1 0 1 0 0 False
2 1 1 0 0 0 0 True
3 0 0 0 0 0 0 False
4 1 1 1 1 1 1 True
5 0 1 1 1 0 1 False
6 0 0 0 0 0 1 False
7 0 0 0 0 0 0 False
8 1 1 1 0 0 0 True
9 1 0 1 1 0 0 True
10 1 0 0 0 1 0 True
11 0 0 1 1 0 1 True
12 0 0 0 0 1 1 True
13 0 0 0 1 0 1 True
这段代码的一般解释是,我创建了一个包括所有组合的numpy数组,包括可接受的组合的组合(两个或更多组合的总和)。我删除了重复的组合,剩下的是这些:
np.unique(all_combos, axis=0)
Out[38]:
array([[0, 0, 0, 0, 1, 1],
[0, 0, 0, 1, 0, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 1, 1, 0, 1],
[0, 0, 1, 1, 1, 1],
[1, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 1, 1],
[1, 0, 0, 1, 1, 1],
[1, 0, 1, 1, 0, 0],
[1, 0, 1, 1, 0, 1],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 1, 1, 1],
[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 1, 1],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 0, 1, 0],
[
<details>
<summary>英文:</summary>
My answer is going to attempt to vectorize this solution. But I got the idea for the superset from Mozway. Before that I couldn't figure out how to handle all combos
import numpy as np
import pandas as pd
import itertools
df = pd.DataFrame({ "AZITH": ["Yes","No","Yes","No","Yes","No","No","No","Yes","Yes","Yes","No","No","No",],
"CLIN": ["Yes","Yes","Yes","No","Yes","Yes","No","No","Yes","No","No","No","No","No",],
"CFTX": ["No","No","No","No","Yes","Yes","No","No","Yes","Yes","No","Yes","No","No",],
"METRO": ["No","Yes","No","No","Yes","Yes","No","No","No","Yes","No","Yes","No","Yes",],
"CFTN": ["No","No","No","No","Yes","No","No","No","No","No","Yes","No","Yes","No",],
"DOXY": ["No","No","No","No","Yes","Yes","Yes","No","No","No","No","Yes","Yes","Yes",]})
combos = np.array([[1,1,0,0,0,0],[1,1,1,0,0,0],[1,0,1,1,0,0],[1,0,0,0,1,0],[0,0,1,1,0,1],[0,0,0,0,1,1],[0,0,0,1,0,1]])
df = df.replace("Yes",1)
df = df.replace("No",0)
c = []
for l in range(len(combos)):
c.extend(itertools.combinations(range(len(combos)),l))
all_combos = combos
for combo in c[1:]:
combined = np.sum(combos[combo,:],axis=0)
all_combos = np.vstack([all_combos,combined])
all_combos[all_combos!=0]=1
all_combos = np.unique(all_combos,axis=0)
combo_sum = all_combos.sum(axis=1)
all_combos[all_combos==0]=-1
new_df = df.dot(all_combos.transpose())
for i,x in enumerate(combo_sum):
new_df.loc[new_df[i]<x,i] = 0
new_df[new_df>0]=1
new_df["res"] = new_df.sum(axis=1)
new_df.loc[new_df.res>0,"res"] = True
new_df.loc[new_df.res==0,"res"] = False
df["res"] = new_df["res"]
AZITH CLIN CFTX METRO CFTN DOXY res
0 1 1 0 0 0 0 True
1 0 1 0 1 0 0 False
2 1 1 0 0 0 0 True
3 0 0 0 0 0 0 False
4 1 1 1 1 1 1 True
5 0 1 1 1 0 1 False
6 0 0 0 0 0 1 False
7 0 0 0 0 0 0 False
8 1 1 1 0 0 0 True
9 1 0 1 1 0 0 True
10 1 0 0 0 1 0 True
11 0 0 1 1 0 1 True
12 0 0 0 0 1 1 True
13 0 0 0 1 0 1 True
The general explanation of the code is that I create a numpy array of all combos including combinations of combos (sum of two or more combos) that are acceptable. I delete duplicate combinations and this is what is left
np.unique(all_combos,axis=0)
Out[38]:
array([[0, 0, 0, 0, 1, 1],
[0, 0, 0, 1, 0, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 1, 1, 0, 1],
[0, 0, 1, 1, 1, 1],
[1, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 1, 1],
[1, 0, 0, 1, 1, 1],
[1, 0, 1, 1, 0, 0],
[1, 0, 1, 1, 0, 1],
[1, 0, 1, 1, 1, 0],
[1, 0, 1, 1, 1, 1],
[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 1, 1],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 0, 1, 0],
[1, 1, 1, 0, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 0, 1],
[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1]])
Any extra medications that are not part of the combo are penalized by setting the value to -1 in the combo list. (If extra medications are not to be penalized then the superset is not needed, you can just compare to a sum of the original combos variable.)
A dot product between the dataset and the set of all combos is then done and the value is compared against the sum of the combo (prior to replacing the 0s with -1s). This means if the value is 3 and the expected outcome of the combo is 3, then it is a valid combo. Below is the sum of the combos as an array
ipdb> combo_sum
array([2, 2, 3, 3, 4, 2, 3, 4, 3, 4, 4, 5, 2, 3, 4, 4, 5, 3, 4, 5, 4, 5,
5, 6])
ipdb> new_df
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
0 -2 -2 -2 -2 -2 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2
1 -2 0 0 0 0 -2 -2 0 0 0 ... 0 2 2 0 0 0 2 2 2 2
2 -2 -2 -2 -2 -2 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 -2 -2 0 0 2 -2 0 2 0 2 ... 2 2 4 0 2 4 2 4 4 6
5 -2 0 0 2 2 -4 -2 0 0 2 ... 0 2 2 0 0 2 2 4 2 4
6 1 1 1 1 1 -1 1 1 -1 1 ... 1 1 1 -1 -1 1 -1 1 -1 1
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 -3 -3 -3 -1 -1 -1 -1 -1 1 1 ... 1 1 1 3 3 3 3 3 3 3
9 -3 -1 -1 1 1 -1 -1 1 3 3 ... -1 1 1 1 1 1 3 3 3 3
10 0 -2 0 -2 0 2 2 2 0 0 ... 2 0 2 0 2 2 0 0 2 2
11 -1 1 1 3 3 -3 -1 1 1 3 ... -1 1 1 -1 -1 1 1 3 1 3
12 2 0 2 0 2 0 2 2 -2 0 ... 2 0 2 -2 0 2 -2 0 0 2
13 0 2 2 2 2 -2 0 2 0 2 ... 0 2 2 -2 -2 0 0 2 0 2
After the dot product, we replace the valid values with 1 and invalid (less than the expected sum) with 0. We sum on all the combinations to see if any are valid. If the sum of combinations >= 1, then at least one combo was valid. Else, all were invalid.
ipdb> new_df
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 res
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 1
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 1
9 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 1
10 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
11 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
12 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
13 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
Replace the final summed column with True or false and apply to original dataframe.
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论