提取第二个句点之后的所有内容。

huangapple go评论102阅读模式
英文:

Pyspark extract all that comes after the second period

问题

以下是您要翻译的内容:

I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.

If there are less than two '.' characters, then keep the entire string.

I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?

The desired result is the following.

  1. +--------------------+----------------+
  2. | host| domain|
  3. +--------------------+----------------+
  4. | google.com| google.com|
  5. |asdasdasd.google.com| google.com|
  6. | a.d.a.google.com| google.com|
  7. | www.google.com| google.com|
  8. +--------------------+----------------+
英文:

I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.

If there are less that two '.' characters, then keep the entire string.

I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?

  1. data = [
  2. ('google.com',),
  3. ('asdasdasd.google.com',),
  4. ('a.d.a.google.com',),
  5. ('www.google.com',)
  6. ]
  7. df = sc.parallelize(data).toDF(['host'])
  8. df.withColumn('domain', functions.regexp_extract(df['host'], r'\b\w+\.\w+\b', 0)).show()
  9. +--------------------+----------------+
  10. | host| domain|
  11. +--------------------+----------------+
  12. | google.com| google.com|
  13. |asdasdasd.google.com|asdasdasd.google|
  14. | a.d.a.google.com| a.d|
  15. | www.google.com| www.google|
  16. +--------------------+----------------+

The desired result is the following.

  1. +--------------------+----------------+
  2. | host| domain|
  3. +--------------------+----------------+
  4. | google.com| google.com|
  5. |asdasdasd.google.com| google.com|
  6. | a.d.a.google.com| google.com|
  7. | www.google.com| google.com|
  8. +--------------------+----------------+

答案1

得分: 1

首先使用 split 函数将字符串分割成一个数组,然后使用 slice 函数切片出最后两个元素,最后使用 array_join 连接这两个元素。

  1. import pyspark.sql.functions as F
  2. ...
  3. df = df.withColumn('domain', F.array_join(F.slice(F.split('host', '.'), -2, 2), '.'))
英文:

First use the split function to split the string into an array, then use the slice function to slice the last two elements, and finally use array_join to connect the two elements.

  1. import pyspark.sql.functions as F
  2. ...
  3. df = df.withColumn('domain', F.array_join(F.slice(F.split('host', '\\.'), -2, 2), '.'))

答案2

得分: 1

只需使用 substring_index

  1. df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)
  2. +--------------------+----------+
  3. |host |domain |
  4. +--------------------+----------+
  5. |google.com |google.com|
  6. |asdasdasd.google.com|google.com|
  7. |a.d.a.google.com |google.com|
  8. |www.google.com |google.com|
  9. +--------------------+----------+
英文:

Simply use the substring_index.

  1. df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)
  2. +--------------------+----------+
  3. |host |domain |
  4. +--------------------+----------+
  5. |google.com |google.com|
  6. |asdasdasd.google.com|google.com|
  7. |a.d.a.google.com |google.com|
  8. |www.google.com |google.com|
  9. +--------------------+----------+

答案3

得分: 0

以下是您要翻译的代码部分:

  1. import re
  2. data = [
  3. ('google.com',),
  4. ('asdasdasd.google.com',),
  5. ('a.d.a.google.com',),
  6. ('www.google.com',)
  7. ]
  8. #使用可选的回顾后查,以便如果字符串中只有一个'.',它仍然会被接受
  9. regex = re.compile(r"(?<=\.)?[^\.]*\.[^\.]*$")
  10. for item in data:
  11. string = item[0]
  12. match = regex.search(string)
  13. if match:
  14. start, end = match.span(0)
  15. print(string[:start], string[start:end], sep="//")
  16. #输出
  17. //google.com
  18. asdasdasd.//google.com
  19. a.d.a.//google.com
  20. www.//google.com

您要求的部分已经被翻译,不包含其他内容。

英文:

You can check for a . character with &quot;\.&quot; and "not a . character" with [^\.]. Combining that with $ marking the "end of string", we can get the last two . (use the re.MULTILINE flag if you want to accept end of line too). However, since it is possible that there is only one . in the string, we can specify an "optional lookbehind" with (?&lt;=\.)?.

  1. import re
  2. data = [
  3. (&#39;google.com&#39;,),
  4. (&#39;asdasdasd.google.com&#39;,),
  5. (&#39;a.d.a.google.com&#39;,),
  6. (&#39;www.google.com&#39;,)
  7. ]
  8. #using an optional lookback so that if there is only one &#39;.&#39; like in the first example it will still accept
  9. regex = re.compile(r&quot;(?&lt;=\.)?[^\.]*\.[^\.]*$&quot;)
  10. for item in data:
  11. string = item[0]
  12. match = regex.search(string)
  13. if match:
  14. start, end = match.span(0)
  15. print(string[:start], string[start:end], sep=&quot;//&quot;)
  16. #output
  17. //google.com
  18. asdasdasd.//google.com
  19. a.d.a.//google.com
  20. www.//google.com

You can also do match.group(0) to get the matched string. In this example that would be &quot;google.com&quot;. The print in my example code is mostly to show where the split occurs.

Something to note is that if there is no . at all, this regex won't work. The regex if you want to accept a string without a . at all would instead be (?&lt;=\.)?[^\.]*(\.)?[^\.]*$. There's also one which considers newline markers if you do want to use the re.MULTILINE flag which is (?&lt;=\.)?[^\.\n]*(\.)?[^\.\n]*$. Here's a [regexr link to test it].

huangapple
  • 本文由 发表于 2023年2月24日 09:04:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75551763.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定