在列1与列2进行比对时,识别并移除列2中的重复项。

huangapple go评论73阅读模式
英文:

Identifying and removing duplicates present in column 2 when checked with column 1

问题

以下是已翻译的内容:

ID ab_keywords bc_keywords duplis bc_new
ABL345 ryzen,ryzen 7x,ryzen 5800,ryzen 7x ryzen,ryzen 71x,ryzen 5900,best ryzen ryzen 71x,ryzen 5900,best
ABL448 ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 ryzen 5900 71x,ryzen 8x,processor,best ryzen 8x ryzen 5900 71x,processor,best
英文:
ID ab_keywords bc_keywords
ABL345 ryzen,ryzen 7x,ryzen 5800,ryzen 7x ryzen,ryzen 71x,ryzen 5900,best
ABL448 ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 ryzen 5900 71x,ryzen 8x,processor,best

This is the table, I want to identify and remove the duplicate values present in bc_keywords*,* if those are already present in the column: ab_keywords

For eg: "ryzen" is present in both the columns for ID: ABL345, so i want to identify it and remove it from the bc_keywords

so my expected table would look something like this:

ID ab_keywords bc_keywords duplis bc_new
ABL345 ryzen,ryzen 7x,ryzen 5800,ryzen 7x ryzen,ryzen 71x,ryzen 5900,best ryzen ryzen 71x,ryzen 5900,best
ABL448 ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 ryzen 5900 71x,ryzen 8x,processor,best ryzen 8x ryzen 5900 71x,processor,best

Is there any way that I can do it ?
column "duplis" is also not much needed, my main objective is to remove the duplis and add the new list of keywords in a new column.

I have tried using df.duplicated(), but definitely seems like I am doing something wrong, and did not get the answer that I was looking for

df.duplicated() just gave me a boolean series column

I have also tried the following method:

dof['new'] = list(set(dof['bc']) - set(dof['ab']))
dof['new']
dof.head()

but the output seems weird:

ID ab_keywords bc_keywords bc_new
ABL345 ryzen,ryzen 7x,ryzen 5800,ryzen 7x ryzen,ryzen 71x,ryzen 5900,best ryzen 5900 71x,ryzen 8x,processor,best
ABL448 ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 ryzen 5900 71x,ryzen 8x,processor,best ryzen,ryzen 71x,ryzen 5900,best

答案1

得分: 1

尝试:

  1. 按逗号 "," 拆分 bc_keywords
  2. 对该列使用 explode 操作,以获取每个关键词的一行
  3. 识别重复项
  4. 根据需要使用 groupbyagg 进行聚合
df["bc_keywords"] = df["bc_keywords"].str.split(",")
df = df.explode("bc_keywords")
duplicates = df.apply(lambda row: row["bc_keywords"] in row["ab_keywords"].split(","), axis=1)
df["bc_new"] = df["bc_keywords"].where(~duplicates)
df["duplis"] = df["bc_keywords"].where(duplicates)
output = df.groupby("ID").agg({"ab_keywords": "first", 
                               "bc_keywords": ",".join, 
                               "duplis": "first",
                               "bc_new": lambda x: ",".join(x.dropna())})

>>> output
ID ab_keywords bc_keywords duplis bc_new
ABL345 ryzen,ryzen 7x,ryzen 5800,ryzen 7x ryzen,ryzen 71x,ryzen 5900,best ryzen ryzen 71x,ryzen 5900,best
ABL448 ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 ryzen 5900 71x,ryzen 8x,processor,best ryzen 8x ryzen 5900 71x,processor,best
英文:

Try:

  1. Split the bc_keywords column by ","
  2. explode the column to get one row per keyword
  3. Identify duplicates
  4. groupby and agg-regate as needed
df["bc_keywords"] = df["bc_keywords"].str.split(",")
df = df.explode("bc_keywords")
duplicates = df.apply(lambda row: row["bc_keywords"] in row["ab_keywords"].split(","), axis=1)
df["bc_new"] = df["bc_keywords"].where(~duplicates)
df["duplis"] = df["bc_keywords"].where(duplicates)
output = df.groupby("ID").agg({"ab_keywords": "first", 
                               "bc_keywords": ",".join, 
                               "duplis": "first",
                               "bc_new": lambda x: ",".join(x.dropna())})

>>> output
ID ab_keywords bc_keywords duplis bc_new
ABL345 ryzen,ryzen 7x,ryzen 5800,ryzen 7x ryzen,ryzen 71x,ryzen 5900,best ryzen ryzen 71x,ryzen 5900,best
ABL448 ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 ryzen 5900 71x,ryzen 8x,processor,best ryzen 8x ryzen 5900 71x,processor,best

答案2

得分: 0

你应该将这个逻辑应用到你数据框中的每一行。

我根据你的示例创建了一个模拟数据集:

data = pd.DataFrame({
    "ab_keywords": ["aaa,bbb,ccc", "bbb,ccc,dd,eee"],
    "bc_keywords": ["bbb,ccc,rrr", "ccc,eee,fff,ggg"]
})

然后准备一个应用于每一行的函数:

def remove_duplicates(row):
    return list(set(row['bc_keywords'].split(",")) - set(row['ab_keywords'].split(",")))

data["bc_new"] = data.apply(remove_duplicates, axis=1)
data

输出:

    ab_keywords     bc_keywords     bc_new
0   aaa,bbb,ccc     bbb,ccc,rrr     [rrr]
1   bbb,ccc,dd,eee ccc,eee,fff,ggg [fff, ggg]

如果你的值是存储为字符串而不是列表,你也可以在应用函数之前或在函数内部将它们拆分成列表。编辑:我已经更新了代码以处理字符串值 - 添加了在应用函数之前将它们拆分成列表的步骤。

英文:

You should apply this logic to each row in your dataframe.

I've created a mock dataset based on your example:

data = pd.DataFrame({
    "ab_keywords": ["aaa,bbb,ccc", "bbb,ccc,dd,eee"],
    "bc_keywords": ["bbb,ccc,rrr", "ccc,eee,fff,ggg"]
})

Then prepare a function to apply to each row:

def remove_duplicates(row):
    return list(set(row['bc_keywords'].split(",")) - set(row['ab_keywords'].split(",")))

data["bc_new"] = data.apply(remove_duplicates, axis=1)
data

Output:


    ab_keywords	    bc_keywords	    bc_new
0	aaa,bbb,ccc	    bbb,ccc,rrr	    [rrr]
1	bbb,ccc,dd,eee	ccc,eee,fff,ggg	[fff, ggg]

If your values are stored as strings rather then lists, you should also split them into lists before applying the function or inside the function.

Edit: I've updated the code to deal with values as strings - added splitting them into lists first.

答案3

得分: 0

可能的解决方案是使用一个助手函数来对使用split形成的关键字进行去重:

from functools import partial

def sjoin(kws, sep=","):
    return sep.join(filter(None, kws))

def dedup(lst_ab, lst_bc):
    dups, new_bc = zip(
        *[(bc, None) if bc in lst_ab else (None, bc) for bc in lst_bc]
    )
    return sjoin(dups), sjoin(new_bc) # <-- 如有需要,请添加分隔符

keywords = df[["ab_keywords", "bc_keywords"]].apply(lambda x: x.str.split(","), axis=1)

df["duplis"], df["new_bc"] = zip(
    *[dedup(lst_ab, lst_bc) for lst_ab, lst_bc in keywords.to_numpy()]
)

输出:

ID ab_keywords bc_keywords duplis new_bc
ABL345 ryzen,ryzen 7x,ryzen 5800,ryzen 7x ryzen,ryzen 71x,ryzen 5900,best ryzen ryzen 71x,ryzen 5900,best
ABL448 ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 ryzen 5900 71x,ryzen 8x,processor,best ryzen 8x ryzen 5900 71x,processor,best
英文:

A possible solution would be to use a helper func to dedup the keywords formed with split :

from functools import partial

def sjoin(kws, sep=&quot;,&quot;):
    return sep.join(filter(None, kws))

def dedup(lst_ab, lst_bc):
    dups, new_bc = zip(
        *[(bc, None) if bc in lst_ab else (None, bc) for bc in lst_bc]
    )
    return sjoin(dups), sjoin(new_bc) # &lt;-- add the sep(s) if needed

keywords = df[[&quot;ab_keywords&quot;, &quot;bc_keywords&quot;]].apply(lambda x: x.str.split(&quot;,&quot;), axis=1)

df[&quot;duplis&quot;], df[&quot;new_bc&quot;] = zip(
    *[dedup(lst_ab, lst_bc) for lst_ab, lst_bc in keywords.to_numpy()]
)

Output :

ID ab_keywords bc_keywords duplis new_bc
ABL345 ryzen,ryzen 7x,ryzen 5800,ryzen 7x ryzen,ryzen 71x,ryzen 5900,best ryzen ryzen 71x,ryzen 5900,best
ABL448 ryzen 5800 7x,ryzen 8x,cpu,ryzen 5800 ryzen 5900 71x,ryzen 8x,processor,best ryzen 8x ryzen 5900 71x,processor,best

huangapple
  • 本文由 发表于 2023年6月22日 01:11:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76525670.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定