英文:
Pyspark extract all that comes after the second period
问题
以下是您要翻译的内容:
I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.
If there are less than two '.' characters, then keep the entire string.
I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?
The desired result is the following.
+--------------------+----------------+
| host| domain|
+--------------------+----------------+
| google.com| google.com|
|asdasdasd.google.com| google.com|
| a.d.a.google.com| google.com|
| www.google.com| google.com|
+--------------------+----------------+
英文:
I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.
If there are less that two '.' characters, then keep the entire string.
I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?
data = [
('google.com',),
('asdasdasd.google.com',),
('a.d.a.google.com',),
('www.google.com',)
]
df = sc.parallelize(data).toDF(['host'])
df.withColumn('domain', functions.regexp_extract(df['host'], r'\b\w+\.\w+\b', 0)).show()
+--------------------+----------------+
| host| domain|
+--------------------+----------------+
| google.com| google.com|
|asdasdasd.google.com|asdasdasd.google|
| a.d.a.google.com| a.d|
| www.google.com| www.google|
+--------------------+----------------+
The desired result is the following.
+--------------------+----------------+
| host| domain|
+--------------------+----------------+
| google.com| google.com|
|asdasdasd.google.com| google.com|
| a.d.a.google.com| google.com|
| www.google.com| google.com|
+--------------------+----------------+
答案1
得分: 1
首先使用 split 函数将字符串分割成一个数组,然后使用 slice 函数切片出最后两个元素,最后使用 array_join 连接这两个元素。
import pyspark.sql.functions as F
...
df = df.withColumn('domain', F.array_join(F.slice(F.split('host', '.'), -2, 2), '.'))
英文:
First use the split function to split the string into an array, then use the slice function to slice the last two elements, and finally use array_join to connect the two elements.
import pyspark.sql.functions as F
...
df = df.withColumn('domain', F.array_join(F.slice(F.split('host', '\\.'), -2, 2), '.'))
答案2
得分: 1
只需使用 substring_index。
df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)
+--------------------+----------+
|host |domain |
+--------------------+----------+
|google.com |google.com|
|asdasdasd.google.com|google.com|
|a.d.a.google.com |google.com|
|www.google.com |google.com|
+--------------------+----------+
英文:
Simply use the substring_index.
df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)
+--------------------+----------+
|host |domain |
+--------------------+----------+
|google.com |google.com|
|asdasdasd.google.com|google.com|
|a.d.a.google.com |google.com|
|www.google.com |google.com|
+--------------------+----------+
答案3
得分: 0
以下是您要翻译的代码部分:
import re
data = [
('google.com',),
('asdasdasd.google.com',),
('a.d.a.google.com',),
('www.google.com',)
]
#使用可选的回顾后查,以便如果字符串中只有一个'.',它仍然会被接受
regex = re.compile(r"(?<=\.)?[^\.]*\.[^\.]*$")
for item in data:
string = item[0]
match = regex.search(string)
if match:
start, end = match.span(0)
print(string[:start], string[start:end], sep="//")
#输出
//google.com
asdasdasd.//google.com
a.d.a.//google.com
www.//google.com
您要求的部分已经被翻译,不包含其他内容。
英文:
You can check for a . character with "\." and "not a . character" with [^\.]. Combining that with $ marking the "end of string", we can get the last two . (use the re.MULTILINE flag if you want to accept end of line too). However, since it is possible that there is only one . in the string, we can specify an "optional lookbehind" with (?<=\.)?.
import re
data = [
('google.com',),
('asdasdasd.google.com',),
('a.d.a.google.com',),
('www.google.com',)
]
#using an optional lookback so that if there is only one '.' like in the first example it will still accept
regex = re.compile(r"(?<=\.)?[^\.]*\.[^\.]*$")
for item in data:
string = item[0]
match = regex.search(string)
if match:
start, end = match.span(0)
print(string[:start], string[start:end], sep="//")
#output
//google.com
asdasdasd.//google.com
a.d.a.//google.com
www.//google.com
You can also do match.group(0) to get the matched string. In this example that would be "google.com". The print in my example code is mostly to show where the split occurs.
Something to note is that if there is no . at all, this regex won't work. The regex if you want to accept a string without a . at all would instead be (?<=\.)?[^\.]*(\.)?[^\.]*$. There's also one which considers newline markers if you do want to use the re.MULTILINE flag which is (?<=\.)?[^\.\n]*(\.)?[^\.\n]*$. Here's a [regexr link to test it].
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论