提取第二个句点之后的所有内容。

huangapple go评论71阅读模式
英文:

Pyspark extract all that comes after the second period

问题

以下是您要翻译的内容:

I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.

If there are less than two '.' characters, then keep the entire string.

I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?

The desired result is the following.

+--------------------+----------------+
|                host|          domain|
+--------------------+----------------+
|          google.com|      google.com|
|asdasdasd.google.com|      google.com|
|    a.d.a.google.com|      google.com|
|      www.google.com|      google.com|
+--------------------+----------------+
英文:

I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.

If there are less that two '.' characters, then keep the entire string.

I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?

data = [
('google.com',),
('asdasdasd.google.com',),
('a.d.a.google.com',),        
('www.google.com',)
]

df = sc.parallelize(data).toDF(['host'])
df.withColumn('domain', functions.regexp_extract(df['host'], r'\b\w+\.\w+\b', 0)).show()
+--------------------+----------------+
|                host|          domain|
+--------------------+----------------+
|          google.com|      google.com|
|asdasdasd.google.com|asdasdasd.google|
|    a.d.a.google.com|             a.d|
|      www.google.com|      www.google|
+--------------------+----------------+

The desired result is the following.

+--------------------+----------------+
|                host|          domain|
+--------------------+----------------+
|          google.com|      google.com|
|asdasdasd.google.com|      google.com|
|    a.d.a.google.com|      google.com|
|      www.google.com|      google.com|
+--------------------+----------------+

答案1

得分: 1

首先使用 split 函数将字符串分割成一个数组,然后使用 slice 函数切片出最后两个元素,最后使用 array_join 连接这两个元素。

import pyspark.sql.functions as F

...
df = df.withColumn('domain', F.array_join(F.slice(F.split('host', '.'), -2, 2), '.'))
英文:

First use the split function to split the string into an array, then use the slice function to slice the last two elements, and finally use array_join to connect the two elements.

import pyspark.sql.functions as F

...
df = df.withColumn('domain', F.array_join(F.slice(F.split('host', '\\.'), -2, 2), '.'))

答案2

得分: 1

只需使用 substring_index

df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)

+--------------------+----------+
|host                |domain    |
+--------------------+----------+
|google.com          |google.com|
|asdasdasd.google.com|google.com|
|a.d.a.google.com    |google.com|
|www.google.com      |google.com|
+--------------------+----------+
英文:

Simply use the substring_index.

df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)

+--------------------+----------+
|host                |domain    |
+--------------------+----------+
|google.com          |google.com|
|asdasdasd.google.com|google.com|
|a.d.a.google.com    |google.com|
|www.google.com      |google.com|
+--------------------+----------+

答案3

得分: 0

以下是您要翻译的代码部分:

import re
data = [
('google.com',),
('asdasdasd.google.com',),
('a.d.a.google.com',),        
('www.google.com',)
]

#使用可选的回顾后查,以便如果字符串中只有一个'.',它仍然会被接受
regex = re.compile(r"(?<=\.)?[^\.]*\.[^\.]*$")
for item in data:
    string = item[0]
    match = regex.search(string)
    if match:
        start, end = match.span(0)
        print(string[:start], string[start:end], sep="//")

#输出
//google.com
asdasdasd.//google.com
a.d.a.//google.com
www.//google.com

您要求的部分已经被翻译,不包含其他内容。

英文:

You can check for a . character with &quot;\.&quot; and "not a . character" with [^\.]. Combining that with $ marking the "end of string", we can get the last two . (use the re.MULTILINE flag if you want to accept end of line too). However, since it is possible that there is only one . in the string, we can specify an "optional lookbehind" with (?&lt;=\.)?.

import re
data = [
(&#39;google.com&#39;,),
(&#39;asdasdasd.google.com&#39;,),
(&#39;a.d.a.google.com&#39;,),        
(&#39;www.google.com&#39;,)
]

#using an optional lookback so that if there is only one &#39;.&#39; like in the first example it will still accept
regex = re.compile(r&quot;(?&lt;=\.)?[^\.]*\.[^\.]*$&quot;)
for item in data:
    string = item[0]
    match = regex.search(string)
    if match:
        start, end = match.span(0)
        print(string[:start], string[start:end], sep=&quot;//&quot;)

#output
//google.com
asdasdasd.//google.com
a.d.a.//google.com
www.//google.com

You can also do match.group(0) to get the matched string. In this example that would be &quot;google.com&quot;. The print in my example code is mostly to show where the split occurs.

Something to note is that if there is no . at all, this regex won't work. The regex if you want to accept a string without a . at all would instead be (?&lt;=\.)?[^\.]*(\.)?[^\.]*$. There's also one which considers newline markers if you do want to use the re.MULTILINE flag which is (?&lt;=\.)?[^\.\n]*(\.)?[^\.\n]*$. Here's a [regexr link to test it].

huangapple
  • 本文由 发表于 2023年2月24日 09:04:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75551763.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定