2023年6月22日 01:11:36go评论73阅读模式

英文:

Identifying and removing duplicates present in column 2 when checked with column 1

问题

以下是已翻译的内容：

ID	ab_keywords	bc_keywords	duplis	bc_new
ABL345	ryzen,ryzen 7x,ryzen 5800,ryzen 7x	ryzen,ryzen 71x,ryzen 5900,best	ryzen	ryzen 71x,ryzen 5900,best
ABL448	ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800	ryzen 5900 71x,ryzen 8x,processor,best	ryzen 8x	ryzen 5900 71x,processor,best

英文:

ID	ab_keywords	bc_keywords
ABL345	ryzen,ryzen 7x,ryzen 5800,ryzen 7x	ryzen,ryzen 71x,ryzen 5900,best
ABL448	ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800	ryzen 5900 71x,ryzen 8x,processor,best

This is the table, I want to identify and remove the duplicate values present in bc_keywords*,* if those are already present in the column: ab_keywords

For eg: "ryzen" is present in both the columns for ID: ABL345, so i want to identify it and remove it from the bc_keywords

so my expected table would look something like this:

ID	ab_keywords	bc_keywords	duplis	bc_new
ABL345	ryzen,ryzen 7x,ryzen 5800,ryzen 7x	ryzen,ryzen 71x,ryzen 5900,best	ryzen	ryzen 71x,ryzen 5900,best
ABL448	ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800	ryzen 5900 71x,ryzen 8x,processor,best	ryzen 8x	ryzen 5900 71x,processor,best

Is there any way that I can do it ?
column "duplis" is also not much needed, my main objective is to remove the duplis and add the new list of keywords in a new column.

I have tried using df.duplicated(), but definitely seems like I am doing something wrong, and did not get the answer that I was looking for

df.duplicated() just gave me a boolean series column

I have also tried the following method:

dof[&#39;new&#39;] = list(set(dof[&#39;bc&#39;]) - set(dof[&#39;ab&#39;]))
dof[&#39;new&#39;]
dof.head()

but the output seems weird:

ID	ab_keywords	bc_keywords	bc_new
ABL345	ryzen,ryzen 7x,ryzen 5800,ryzen 7x	ryzen,ryzen 71x,ryzen 5900,best	ryzen 5900 71x,ryzen 8x,processor,best
ABL448	ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800	ryzen 5900 71x,ryzen 8x,processor,best	ryzen,ryzen 71x,ryzen 5900,best

答案1

得分: 1

尝试：

按逗号 "," 拆分 bc_keywords 列
对该列使用 explode 操作，以获取每个关键词的一行
识别重复项
根据需要使用 groupby 和 agg 进行聚合

df["bc_keywords"] = df["bc_keywords"].str.split(",")
df = df.explode("bc_keywords")
duplicates = df.apply(lambda row: row["bc_keywords"] in row["ab_keywords"].split(","), axis=1)
df["bc_new"] = df["bc_keywords"].where(~duplicates)
df["duplis"] = df["bc_keywords"].where(duplicates)
output = df.groupby("ID").agg({"ab_keywords": "first", 
                               "bc_keywords": ",".join, 
                               "duplis": "first",
                               "bc_new": lambda x: ",".join(x.dropna())})

>>> output

ID	ab_keywords	bc_keywords	duplis	bc_new
ABL345	ryzen,ryzen 7x,ryzen 5800,ryzen 7x	ryzen,ryzen 71x,ryzen 5900,best	ryzen	ryzen 71x,ryzen 5900,best
ABL448	ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800	ryzen 5900 71x,ryzen 8x,processor,best	ryzen 8x	ryzen 5900 71x,processor,best

英文:

Try:

Split the bc_keywords column by ","
explode the column to get one row per keyword
Identify duplicates
groupby and agg-regate as needed

df[&quot;bc_keywords&quot;] = df[&quot;bc_keywords&quot;].str.split(&quot;,&quot;)
df = df.explode(&quot;bc_keywords&quot;)
duplicates = df.apply(lambda row: row[&quot;bc_keywords&quot;] in row[&quot;ab_keywords&quot;].split(&quot;,&quot;), axis=1)
df[&quot;bc_new&quot;] = df[&quot;bc_keywords&quot;].where(~duplicates)
df[&quot;duplis&quot;] = df[&quot;bc_keywords&quot;].where(duplicates)
output = df.groupby(&quot;ID&quot;).agg({&quot;ab_keywords&quot;: &quot;first&quot;, 
                               &quot;bc_keywords&quot;: &quot;,&quot;.join, 
                               &quot;duplis&quot;: &quot;first&quot;,
                               &quot;bc_new&quot;: lambda x: &quot;,&quot;.join(x.dropna())})

&gt;&gt;&gt; output

ID	ab_keywords	bc_keywords	duplis	bc_new
ABL345	ryzen,ryzen 7x,ryzen 5800,ryzen 7x	ryzen,ryzen 71x,ryzen 5900,best	ryzen	ryzen 71x,ryzen 5900,best
ABL448	ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800	ryzen 5900 71x,ryzen 8x,processor,best	ryzen 8x	ryzen 5900 71x,processor,best

答案2

得分: 0

你应该将这个逻辑应用到你数据框中的每一行。

我根据你的示例创建了一个模拟数据集：

data = pd.DataFrame({
    "ab_keywords": ["aaa,bbb,ccc", "bbb,ccc,dd,eee"],
    "bc_keywords": ["bbb,ccc,rrr", "ccc,eee,fff,ggg"]
})

然后准备一个应用于每一行的函数：

def remove_duplicates(row):
    return list(set(row['bc_keywords'].split(",")) - set(row['ab_keywords'].split(",")))

data["bc_new"] = data.apply(remove_duplicates, axis=1)
data

输出：

    ab_keywords     bc_keywords     bc_new
0   aaa,bbb,ccc     bbb,ccc,rrr     [rrr]
1   bbb,ccc,dd,eee ccc,eee,fff,ggg [fff, ggg]

如果你的值是存储为字符串而不是列表，你也可以在应用函数之前或在函数内部将它们拆分成列表。编辑：我已经更新了代码以处理字符串值 - 添加了在应用函数之前将它们拆分成列表的步骤。

英文:

You should apply this logic to each row in your dataframe.

I've created a mock dataset based on your example:

data = pd.DataFrame({
    &quot;ab_keywords&quot;: [&quot;aaa,bbb,ccc&quot;, &quot;bbb,ccc,dd,eee&quot;],
    &quot;bc_keywords&quot;: [&quot;bbb,ccc,rrr&quot;, &quot;ccc,eee,fff,ggg&quot;]
})

Then prepare a function to apply to each row:

def remove_duplicates(row):
    return list(set(row[&#39;bc_keywords&#39;].split(&quot;,&quot;)) - set(row[&#39;ab_keywords&#39;].split(&quot;,&quot;)))

data[&quot;bc_new&quot;] = data.apply(remove_duplicates, axis=1)
data

Output:


    ab_keywords	    bc_keywords	    bc_new
0	aaa,bbb,ccc	    bbb,ccc,rrr	    [rrr]
1	bbb,ccc,dd,eee	ccc,eee,fff,ggg	[fff, ggg]

If your values are stored as strings rather then lists, you should also split them into lists before applying the function or inside the function.

Edit: I've updated the code to deal with values as strings - added splitting them into lists first.

答案3

得分: 0

可能的解决方案是使用一个助手函数来对使用split形成的关键字进行去重：

from functools import partial

def sjoin(kws, sep=","):
    return sep.join(filter(None, kws))

def dedup(lst_ab, lst_bc):
    dups, new_bc = zip(
        *[(bc, None) if bc in lst_ab else (None, bc) for bc in lst_bc]
    )
    return sjoin(dups), sjoin(new_bc) # <-- 如有需要，请添加分隔符

keywords = df[["ab_keywords", "bc_keywords"]].apply(lambda x: x.str.split(","), axis=1)

df["duplis"], df["new_bc"] = zip(
    *[dedup(lst_ab, lst_bc) for lst_ab, lst_bc in keywords.to_numpy()]
)

输出：

ID	ab_keywords	bc_keywords	duplis	new_bc
ABL345	ryzen,ryzen 7x,ryzen 5800,ryzen 7x	ryzen,ryzen 71x,ryzen 5900,best	ryzen	ryzen 71x,ryzen 5900,best
ABL448	ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800	ryzen 5900 71x,ryzen 8x,processor,best	ryzen 8x	ryzen 5900 71x,processor,best

英文:

A possible solution would be to use a helper func to dedup the keywords formed with split :

from functools import partial

def sjoin(kws, sep=&quot;,&quot;):
    return sep.join(filter(None, kws))

def dedup(lst_ab, lst_bc):
    dups, new_bc = zip(
        *[(bc, None) if bc in lst_ab else (None, bc) for bc in lst_bc]
    )
    return sjoin(dups), sjoin(new_bc) # &lt;-- add the sep(s) if needed

keywords = df[[&quot;ab_keywords&quot;, &quot;bc_keywords&quot;]].apply(lambda x: x.str.split(&quot;,&quot;), axis=1)

df[&quot;duplis&quot;], df[&quot;new_bc&quot;] = zip(
    *[dedup(lst_ab, lst_bc) for lst_ab, lst_bc in keywords.to_numpy()]
)

Output :

ID	ab_keywords	bc_keywords	duplis	new_bc
ABL345	ryzen,ryzen 7x,ryzen 5800,ryzen 7x	ryzen,ryzen 71x,ryzen 5900,best	ryzen	ryzen 71x,ryzen 5900,best
ABL448	ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800	ryzen 5900 71x,ryzen 8x,processor,best	ryzen 8x	ryzen 5900 71x,processor,best

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在列1与列2进行比对时，识别并移除列2中的重复项。

问题

答案1

答案2

答案3

应用Groupby和np.where函数来检测模式。

python pyjnius; PythonActivity error: 我安装了JDK+Visual C++。

在OpenCV中将Roboflow模型推断限制为用户定义的区域

将访问权限转移到Telegram频道

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论