标记一个数据框中是否找到另一个数据框中的模式。

huangapple go评论54阅读模式
英文:

mark one dataframe if pattern found in another dataframe

问题

I want to mark the second one if the first one contains a pattern. Very large of rows (>10000's)

date      | items 
20100605  | apple is red 
20110606  | orange is orange 
20120607  | apple is green

B: shorter with a few hundred rows.

id   |  color
123  |  is Red
234  |  not orange
235  |  is green

Result would be to flag all columns in B if pattern found in A, possibly adding a column to B like

B:
id   |  color       | found
123  |  is Red      | true
234  |  not orange  | false
235  |  is green    | true
英文:

I have two dataframes, and I want to mark the second one if the first one contains a pattern. Very large of rows (>10000's)

date      | items 
20100605  | apple is red 
20110606  | orange is orange 
20120607  | apple is green

B: shorter with a few hundred rows.

id   |  color
123  |  is Red
234  |  not orange
235  |  is green

Result would be to flag all columns in B if pattern found in A, possibly adding a column to B like

B:
id   |  color       | found
123  |  is Red      | true
234  |  not orange  | false
235  |  is green    | true

thinking of something like, dfB['found'] = dfB['color'].isin(dfA['items']) but don't see any way to ignore case. Also, with this approach it will change true to false. Don't want to change those which are already set true. Also, I believe it's inefficient to loop large dataframes more than once. Running through A once and marking B would be better way but not sure how to achieve that using isin(). Any other ways? Especially ignoring case sensitivity of pattern.

答案1

得分: 1

你可以使用类似以下的代码:

df2['check'] = df2['color'].apply(lambda x: True if any(x.casefold() in i.casefold() for i in df['items']) else False)

或者你可以使用 str.contains

df2['check'] = df2['color'].str.contains('|'.join(df['items'].str.split(" ").str[1] + ' ' + df['items'].str.split(" ").str[2]), case=False)

# 获取第二和第三个单词
英文:

You can use something like this:

df2['check'] = df2['color'].apply(lambda x: True if any(x.casefold() in i.casefold() for i in df['items']) else False)

or you can use str.contains:

df2['check'] = df2['color'].str.contains('|'.join(df['items'].str.split(" ").str[1] + ' ' + df['items'].str.split(" ").str[2]),case=False)

#get second and third words

huangapple
  • 本文由 发表于 2023年1月6日 13:00:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75027092.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定