英文:
I'm trying to check if an abbreviation is not part of a word
问题
我正在尝试检查存储在abbr列中的缩写是否存在于同一实体(同一行)的公司名称中,但不包括缩写是单词的一部分的情况。
以下是一个示例:
name | abbr |
---|---|
ALCON | co |
MIDTOWN BEEF CO | co |
SOUTH SHORE CO ACTION COUNCIL | co |
我的代码是:
display(df.filter(col('name').contains(col('abbr'))))
我只想得到MIDTOWN BEEF CO
和SOUTH SHORE CO ACTION COUNCIL
,但我的代码返回了所有结果。
英文:
I'm trying to check if an abbreviation which is stored in column abbr exist in the name of the company of the same entity(same row) but not the ones that the abbreviation is part of the a word.
Here is an example:
name | abbr |
---|---|
ALCON | co |
MIDTOWN BEEF CO | co |
SOUTH SHORE CO ACTION COUNCIL | co |
and my code is:
display(df.filter(col('name').contains(col('abbr'))))
I want to get only MIDTOWN BEEF CO
and SOUTH SHORE CO ACTION COUNCIL
but my code returns all of them
答案1
得分: 0
你想检查name
是否包含作为独立单词的abbr
标记,而不仅仅是字符串包含。
有两种方法可以实现:
- 将
name
按空格分割,并使用array_contains()
Spark SQL函数:
>>> df.filter(array_contains(split(df.name, ' '), df.abbr)).show()
+--------------------+----+
| name|abbr|
+--------------------+----+
| MIDTOWN BEEF CO| CO|
|SOUTH SHORE CO AC...| CO|
+--------------------+----+
- 或者使用带有单词边界
\b
匹配器的正则表达式:
>>> df.filter("name regexp concat('\\b', abbr, '\\b')").show()
+--------------------+----+
| name|abbr|
+--------------------+----+
| MIDTOWN BEEF CO| CO|
|SOUTH SHORE CO AC...| CO|
+--------------------+----+
如果你的名称中也包含标点符号,例如MIDTOWN BEEF CO.
,选项#2更好(选项#1无法正常工作,但选项#2可以)。
英文:
You want to check if the name
contains the word token abbr
as a standalone word, not just a string-contains.
There are two ways to do this:
- Split the
name
on spaces and use thearray_contains()
spark SQL function:
>>> df.filter(array_contains(split(df.name, ' '), df.abbr)).show()
+--------------------+----+
| name|abbr|
+--------------------+----+
| MIDTOWN BEEF CO| CO|
|SOUTH SHORE CO AC...| CO|
+--------------------+----+
- Or use regex with a word boundary
\b
matcher:
>>> df.filter("name regexp concat('\\\\b', abbr, '\\\\b')").show()
+--------------------+----+
| name|abbr|
+--------------------+----+
| MIDTOWN BEEF CO| CO|
|SOUTH SHORE CO AC...| CO|
+--------------------+----+
Option #2 is better if your names have punctuation in them too such as MIDTOWN BEEF CO.
(this will not work with Option #1 but it will with #2).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论