英文:
Pyspark extract all that comes after the second period
问题
以下是您要翻译的内容:
I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.
If there are less than two '.' characters, then keep the entire string.
I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?
The desired result is the following.
+--------------------+----------------+
| host| domain|
+--------------------+----------------+
| google.com| google.com|
|asdasdasd.google.com| google.com|
| a.d.a.google.com| google.com|
| www.google.com| google.com|
+--------------------+----------------+
英文:
I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.
If there are less that two '.' characters, then keep the entire string.
I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?
data = [
('google.com',),
('asdasdasd.google.com',),
('a.d.a.google.com',),
('www.google.com',)
]
df = sc.parallelize(data).toDF(['host'])
df.withColumn('domain', functions.regexp_extract(df['host'], r'\b\w+\.\w+\b', 0)).show()
+--------------------+----------------+
| host| domain|
+--------------------+----------------+
| google.com| google.com|
|asdasdasd.google.com|asdasdasd.google|
| a.d.a.google.com| a.d|
| www.google.com| www.google|
+--------------------+----------------+
The desired result is the following.
+--------------------+----------------+
| host| domain|
+--------------------+----------------+
| google.com| google.com|
|asdasdasd.google.com| google.com|
| a.d.a.google.com| google.com|
| www.google.com| google.com|
+--------------------+----------------+
答案1
得分: 1
首先使用 split
函数将字符串分割成一个数组,然后使用 slice
函数切片出最后两个元素,最后使用 array_join
连接这两个元素。
import pyspark.sql.functions as F
...
df = df.withColumn('domain', F.array_join(F.slice(F.split('host', '.'), -2, 2), '.'))
英文:
First use the split
function to split the string into an array, then use the slice
function to slice the last two elements, and finally use array_join
to connect the two elements.
import pyspark.sql.functions as F
...
df = df.withColumn('domain', F.array_join(F.slice(F.split('host', '\\.'), -2, 2), '.'))
答案2
得分: 1
只需使用 substring_index
。
df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)
+--------------------+----------+
|host |domain |
+--------------------+----------+
|google.com |google.com|
|asdasdasd.google.com|google.com|
|a.d.a.google.com |google.com|
|www.google.com |google.com|
+--------------------+----------+
英文:
Simply use the substring_index
.
df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)
+--------------------+----------+
|host |domain |
+--------------------+----------+
|google.com |google.com|
|asdasdasd.google.com|google.com|
|a.d.a.google.com |google.com|
|www.google.com |google.com|
+--------------------+----------+
答案3
得分: 0
以下是您要翻译的代码部分:
import re
data = [
('google.com',),
('asdasdasd.google.com',),
('a.d.a.google.com',),
('www.google.com',)
]
#使用可选的回顾后查,以便如果字符串中只有一个'.',它仍然会被接受
regex = re.compile(r"(?<=\.)?[^\.]*\.[^\.]*$")
for item in data:
string = item[0]
match = regex.search(string)
if match:
start, end = match.span(0)
print(string[:start], string[start:end], sep="//")
#输出
//google.com
asdasdasd.//google.com
a.d.a.//google.com
www.//google.com
您要求的部分已经被翻译,不包含其他内容。
英文:
You can check for a .
character with "\."
and "not a .
character" with [^\.]
. Combining that with $
marking the "end of string", we can get the last two .
(use the re.MULTILINE flag if you want to accept end of line too). However, since it is possible that there is only one .
in the string, we can specify an "optional lookbehind" with (?<=\.)?
.
import re
data = [
('google.com',),
('asdasdasd.google.com',),
('a.d.a.google.com',),
('www.google.com',)
]
#using an optional lookback so that if there is only one '.' like in the first example it will still accept
regex = re.compile(r"(?<=\.)?[^\.]*\.[^\.]*$")
for item in data:
string = item[0]
match = regex.search(string)
if match:
start, end = match.span(0)
print(string[:start], string[start:end], sep="//")
#output
//google.com
asdasdasd.//google.com
a.d.a.//google.com
www.//google.com
You can also do match.group(0)
to get the matched string. In this example that would be "google.com"
. The print in my example code is mostly to show where the split occurs.
Something to note is that if there is no .
at all, this regex won't work. The regex if you want to accept a string without a .
at all would instead be (?<=\.)?[^\.]*(\.)?[^\.]*$
. There's also one which considers newline markers if you do want to use the re.MULTILINE flag which is (?<=\.)?[^\.\n]*(\.)?[^\.\n]*$
. Here's a [regexr link to test it].
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论