英文:
Identifying and removing duplicates present in column 2 when checked with column 1
问题
以下是已翻译的内容:
ID | ab_keywords | bc_keywords | duplis | bc_new |
---|---|---|---|---|
ABL345 | ryzen,ryzen 7x,ryzen 5800,ryzen 7x | ryzen,ryzen 71x,ryzen 5900,best | ryzen | ryzen 71x,ryzen 5900,best |
ABL448 | ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 | ryzen 5900 71x,ryzen 8x,processor,best | ryzen 8x | ryzen 5900 71x,processor,best |
英文:
ID | ab_keywords | bc_keywords |
---|---|---|
ABL345 | ryzen,ryzen 7x,ryzen 5800,ryzen 7x | ryzen,ryzen 71x,ryzen 5900,best |
ABL448 | ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 | ryzen 5900 71x,ryzen 8x,processor,best |
This is the table, I want to identify and remove the duplicate values present in bc_keywords*,* if those are already present in the column: ab_keywords
For eg: "ryzen" is present in both the columns for ID: ABL345, so i want to identify it and remove it from the bc_keywords
so my expected table would look something like this:
ID | ab_keywords | bc_keywords | duplis | bc_new |
---|---|---|---|---|
ABL345 | ryzen,ryzen 7x,ryzen 5800,ryzen 7x | ryzen,ryzen 71x,ryzen 5900,best | ryzen | ryzen 71x,ryzen 5900,best |
ABL448 | ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 | ryzen 5900 71x,ryzen 8x,processor,best | ryzen 8x | ryzen 5900 71x,processor,best |
Is there any way that I can do it ?
column "duplis" is also not much needed, my main objective is to remove the duplis and add the new list of keywords in a new column.
I have tried using df.duplicated(), but definitely seems like I am doing something wrong, and did not get the answer that I was looking for
df.duplicated() just gave me a boolean series column
I have also tried the following method:
dof['new'] = list(set(dof['bc']) - set(dof['ab']))
dof['new']
dof.head()
but the output seems weird:
ID | ab_keywords | bc_keywords | bc_new |
---|---|---|---|
ABL345 | ryzen,ryzen 7x,ryzen 5800,ryzen 7x | ryzen,ryzen 71x,ryzen 5900,best | ryzen 5900 71x,ryzen 8x,processor,best |
ABL448 | ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 | ryzen 5900 71x,ryzen 8x,processor,best | ryzen,ryzen 71x,ryzen 5900,best |
答案1
得分: 1
尝试:
- 按逗号 "," 拆分
bc_keywords
列 - 对该列使用
explode
操作,以获取每个关键词的一行 - 识别重复项
- 根据需要使用
groupby
和agg
进行聚合
df["bc_keywords"] = df["bc_keywords"].str.split(",")
df = df.explode("bc_keywords")
duplicates = df.apply(lambda row: row["bc_keywords"] in row["ab_keywords"].split(","), axis=1)
df["bc_new"] = df["bc_keywords"].where(~duplicates)
df["duplis"] = df["bc_keywords"].where(duplicates)
output = df.groupby("ID").agg({"ab_keywords": "first",
"bc_keywords": ",".join,
"duplis": "first",
"bc_new": lambda x: ",".join(x.dropna())})
>>> output
ID | ab_keywords | bc_keywords | duplis | bc_new |
---|---|---|---|---|
ABL345 | ryzen,ryzen 7x,ryzen 5800,ryzen 7x | ryzen,ryzen 71x,ryzen 5900,best | ryzen | ryzen 71x,ryzen 5900,best |
ABL448 | ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 | ryzen 5900 71x,ryzen 8x,processor,best | ryzen 8x | ryzen 5900 71x,processor,best |
英文:
Try:
- Split the
bc_keywords
column by "," explode
the column to get one row per keyword- Identify duplicates
groupby
andagg
-regate as needed
df["bc_keywords"] = df["bc_keywords"].str.split(",")
df = df.explode("bc_keywords")
duplicates = df.apply(lambda row: row["bc_keywords"] in row["ab_keywords"].split(","), axis=1)
df["bc_new"] = df["bc_keywords"].where(~duplicates)
df["duplis"] = df["bc_keywords"].where(duplicates)
output = df.groupby("ID").agg({"ab_keywords": "first",
"bc_keywords": ",".join,
"duplis": "first",
"bc_new": lambda x: ",".join(x.dropna())})
>>> output
ID | ab_keywords | bc_keywords | duplis | bc_new |
---|---|---|---|---|
ABL345 | ryzen,ryzen 7x,ryzen 5800,ryzen 7x | ryzen,ryzen 71x,ryzen 5900,best | ryzen | ryzen 71x,ryzen 5900,best |
ABL448 | ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 | ryzen 5900 71x,ryzen 8x,processor,best | ryzen 8x | ryzen 5900 71x,processor,best |
答案2
得分: 0
你应该将这个逻辑应用到你数据框中的每一行。
我根据你的示例创建了一个模拟数据集:
data = pd.DataFrame({
"ab_keywords": ["aaa,bbb,ccc", "bbb,ccc,dd,eee"],
"bc_keywords": ["bbb,ccc,rrr", "ccc,eee,fff,ggg"]
})
然后准备一个应用于每一行的函数:
def remove_duplicates(row):
return list(set(row['bc_keywords'].split(",")) - set(row['ab_keywords'].split(",")))
data["bc_new"] = data.apply(remove_duplicates, axis=1)
data
输出:
ab_keywords bc_keywords bc_new
0 aaa,bbb,ccc bbb,ccc,rrr [rrr]
1 bbb,ccc,dd,eee ccc,eee,fff,ggg [fff, ggg]
如果你的值是存储为字符串而不是列表,你也可以在应用函数之前或在函数内部将它们拆分成列表。编辑:我已经更新了代码以处理字符串值 - 添加了在应用函数之前将它们拆分成列表的步骤。
英文:
You should apply this logic to each row in your dataframe.
I've created a mock dataset based on your example:
data = pd.DataFrame({
"ab_keywords": ["aaa,bbb,ccc", "bbb,ccc,dd,eee"],
"bc_keywords": ["bbb,ccc,rrr", "ccc,eee,fff,ggg"]
})
Then prepare a function to apply to each row:
def remove_duplicates(row):
return list(set(row['bc_keywords'].split(",")) - set(row['ab_keywords'].split(",")))
data["bc_new"] = data.apply(remove_duplicates, axis=1)
data
Output:
ab_keywords bc_keywords bc_new
0 aaa,bbb,ccc bbb,ccc,rrr [rrr]
1 bbb,ccc,dd,eee ccc,eee,fff,ggg [fff, ggg]
If your values are stored as strings rather then lists, you should also split them into lists before applying the function or inside the function.
Edit: I've updated the code to deal with values as strings - added splitting them into lists first.
答案3
得分: 0
可能的解决方案是使用一个助手函数来对使用split
形成的关键字进行去重:
from functools import partial
def sjoin(kws, sep=","):
return sep.join(filter(None, kws))
def dedup(lst_ab, lst_bc):
dups, new_bc = zip(
*[(bc, None) if bc in lst_ab else (None, bc) for bc in lst_bc]
)
return sjoin(dups), sjoin(new_bc) # <-- 如有需要,请添加分隔符
keywords = df[["ab_keywords", "bc_keywords"]].apply(lambda x: x.str.split(","), axis=1)
df["duplis"], df["new_bc"] = zip(
*[dedup(lst_ab, lst_bc) for lst_ab, lst_bc in keywords.to_numpy()]
)
输出:
ID | ab_keywords | bc_keywords | duplis | new_bc |
---|---|---|---|---|
ABL345 | ryzen,ryzen 7x,ryzen 5800,ryzen 7x | ryzen,ryzen 71x,ryzen 5900,best | ryzen | ryzen 71x,ryzen 5900,best |
ABL448 | ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 | ryzen 5900 71x,ryzen 8x,processor,best | ryzen 8x | ryzen 5900 71x,processor,best |
英文:
A possible solution would be to use a helper func to dedup
the keywords formed with split
:
from functools import partial
def sjoin(kws, sep=","):
return sep.join(filter(None, kws))
def dedup(lst_ab, lst_bc):
dups, new_bc = zip(
*[(bc, None) if bc in lst_ab else (None, bc) for bc in lst_bc]
)
return sjoin(dups), sjoin(new_bc) # <-- add the sep(s) if needed
keywords = df[["ab_keywords", "bc_keywords"]].apply(lambda x: x.str.split(","), axis=1)
df["duplis"], df["new_bc"] = zip(
*[dedup(lst_ab, lst_bc) for lst_ab, lst_bc in keywords.to_numpy()]
)
Output :
ID | ab_keywords | bc_keywords | duplis | new_bc |
---|---|---|---|---|
ABL345 | ryzen,ryzen 7x,ryzen 5800,ryzen 7x | ryzen,ryzen 71x,ryzen 5900,best | ryzen | ryzen 71x,ryzen 5900,best |
ABL448 | ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 | ryzen 5900 71x,ryzen 8x,processor,best | ryzen 8x | ryzen 5900 71x,processor,best |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论