我正在尝试检查一个缩写是否不是一个单词的一部分。

huangapple go评论83阅读模式
英文:

I'm trying to check if an abbreviation is not part of a word

问题

我正在尝试检查存储在abbr列中的缩写是否存在于同一实体(同一行)的公司名称中,但不包括缩写是单词的一部分的情况。

以下是一个示例:

name abbr
ALCON co
MIDTOWN BEEF CO co
SOUTH SHORE CO ACTION COUNCIL co

我的代码是:

display(df.filter(col('name').contains(col('abbr'))))

我只想得到MIDTOWN BEEF COSOUTH SHORE CO ACTION COUNCIL,但我的代码返回了所有结果。

英文:

I'm trying to check if an abbreviation which is stored in column abbr exist in the name of the company of the same entity(same row) but not the ones that the abbreviation is part of the a word.

Here is an example:

name abbr
ALCON co
MIDTOWN BEEF CO co
SOUTH SHORE CO ACTION COUNCIL co

and my code is:

display(df.filter(col('name').contains(col('abbr'))))

I want to get only MIDTOWN BEEF CO and SOUTH SHORE CO ACTION COUNCIL but my code returns all of them

答案1

得分: 0

你想检查name是否包含作为独立单词的abbr标记,而不仅仅是字符串包含。

有两种方法可以实现:

  1. name按空格分割,并使用array_contains() Spark SQL函数:
>>> df.filter(array_contains(split(df.name, ' '), df.abbr)).show()
+--------------------+----+
|                name|abbr|
+--------------------+----+
|     MIDTOWN BEEF CO|  CO|
|SOUTH SHORE CO AC...|  CO|
+--------------------+----+
  1. 或者使用带有单词边界\b匹配器的正则表达式:
>>> df.filter("name regexp concat('\\b', abbr, '\\b')").show()
+--------------------+----+
|                name|abbr|
+--------------------+----+
|     MIDTOWN BEEF CO|  CO|
|SOUTH SHORE CO AC...|  CO|
+--------------------+----+

如果你的名称中也包含标点符号,例如MIDTOWN BEEF CO.,选项#2更好(选项#1无法正常工作,但选项#2可以)。

英文:

You want to check if the name contains the word token abbr as a standalone word, not just a string-contains.

There are two ways to do this:

  1. Split the name on spaces and use the array_contains() spark SQL function:
>>> df.filter(array_contains(split(df.name, ' '), df.abbr)).show()
+--------------------+----+
|                name|abbr|
+--------------------+----+
|     MIDTOWN BEEF CO|  CO|
|SOUTH SHORE CO AC...|  CO|
+--------------------+----+
  1. Or use regex with a word boundary \b matcher:
>>> df.filter("name regexp concat('\\\\b', abbr, '\\\\b')").show()
+--------------------+----+
|                name|abbr|
+--------------------+----+
|     MIDTOWN BEEF CO|  CO|
|SOUTH SHORE CO AC...|  CO|
+--------------------+----+

Option #2 is better if your names have punctuation in them too such as MIDTOWN BEEF CO. (this will not work with Option #1 but it will with #2).

huangapple
  • 本文由 发表于 2023年8月9日 05:43:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76863381.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定