2023年2月24日 09:04:07go评论102阅读模式

英文:

Pyspark extract all that comes after the second period

问题

以下是您要翻译的内容：

I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.

If there are less than two '.' characters, then keep the entire string.

I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?

The desired result is the following.

+--------------------+----------------+
|                host|          domain|
+--------------------+----------------+
|          google.com|      google.com|
|asdasdasd.google.com|      google.com|
|    a.d.a.google.com|      google.com|
|      www.google.com|      google.com|
+--------------------+----------------+

英文:

I am looking to create a new column that contains all characters after the second last occurrence of the '.' character.

If there are less that two '.' characters, then keep the entire string.

I am looking to do this in spark 2.4.8 without using a UDF. Any ideas?

data = [
(&#39;google.com&#39;,),
(&#39;asdasdasd.google.com&#39;,),
(&#39;a.d.a.google.com&#39;,),        
(&#39;www.google.com&#39;,)
]
df = sc.parallelize(data).toDF([&#39;host&#39;])
df.withColumn(&#39;domain&#39;, functions.regexp_extract(df[&#39;host&#39;], r&#39;\b\w+\.\w+\b&#39;, 0)).show()
+--------------------+----------------+
|                host|          domain|
+--------------------+----------------+
|          google.com|      google.com|
|asdasdasd.google.com|asdasdasd.google|
|    a.d.a.google.com|             a.d|
|      www.google.com|      www.google|
+--------------------+----------------+

The desired result is the following.

+--------------------+----------------+
|                host|          domain|
+--------------------+----------------+
|          google.com|      google.com|
|asdasdasd.google.com|      google.com|
|    a.d.a.google.com|      google.com|
|      www.google.com|      google.com|
+--------------------+----------------+

答案1

得分: 1

首先使用 split 函数将字符串分割成一个数组，然后使用 slice 函数切片出最后两个元素，最后使用 array_join 连接这两个元素。

import pyspark.sql.functions as F
...
df = df.withColumn('domain', F.array_join(F.slice(F.split('host', '.'), -2, 2), '.'))

英文:

First use the split function to split the string into an array, then use the slice function to slice the last two elements, and finally use array_join to connect the two elements.

import pyspark.sql.functions as F
...
df = df.withColumn(&#39;domain&#39;, F.array_join(F.slice(F.split(&#39;host&#39;, &#39;\\.&#39;), -2, 2), &#39;.&#39;))

答案2

得分: 1

只需使用 substring_index。

df.withColumn('domain', f.substring_index('host', '.', -2)).show(truncate=False)
+--------------------+----------+
|host                |domain    |
+--------------------+----------+
|google.com          |google.com|
|asdasdasd.google.com|google.com|
|a.d.a.google.com    |google.com|
|www.google.com      |google.com|
+--------------------+----------+

英文:

Simply use the substring_index.

df.withColumn(&#39;domain&#39;, f.substring_index(&#39;host&#39;, &#39;.&#39;, -2)).show(truncate=False)
+--------------------+----------+
|host                |domain    |
+--------------------+----------+
|google.com          |google.com|
|asdasdasd.google.com|google.com|
|a.d.a.google.com    |google.com|
|www.google.com      |google.com|
+--------------------+----------+

答案3

得分: 0

以下是您要翻译的代码部分：

import re
data = [
('google.com',),
('asdasdasd.google.com',),
('a.d.a.google.com',),        
('www.google.com',)
]
#使用可选的回顾后查，以便如果字符串中只有一个'.'，它仍然会被接受
regex = re.compile(r"(?<=\.)?[^\.]*\.[^\.]*$")
for item in data:
    string = item[0]
    match = regex.search(string)
    if match:
        start, end = match.span(0)
        print(string[:start], string[start:end], sep="//")
#输出
//google.com
asdasdasd.//google.com
a.d.a.//google.com
www.//google.com

您要求的部分已经被翻译，不包含其他内容。

英文:

You can check for a . character with "\." and "not a . character" with [^\.]. Combining that with $ marking the "end of string", we can get the last two . (use the re.MULTILINE flag if you want to accept end of line too). However, since it is possible that there is only one . in the string, we can specify an "optional lookbehind" with (?<=\.)?.

import re
data = [
(&#39;google.com&#39;,),
(&#39;asdasdasd.google.com&#39;,),
(&#39;a.d.a.google.com&#39;,),        
(&#39;www.google.com&#39;,)
]
#using an optional lookback so that if there is only one &#39;.&#39; like in the first example it will still accept
regex = re.compile(r&quot;(?&lt;=\.)?[^\.]*\.[^\.]*$&quot;)
for item in data:
    string = item[0]
    match = regex.search(string)
    if match:
        start, end = match.span(0)
        print(string[:start], string[start:end], sep=&quot;//&quot;)
#output
//google.com
asdasdasd.//google.com
a.d.a.//google.com
www.//google.com

You can also do match.group(0) to get the matched string. In this example that would be "google.com". The print in my example code is mostly to show where the split occurs.

Something to note is that if there is no . at all, this regex won't work. The regex if you want to accept a string without a . at all would instead be (?<=\.)?[^\.]*(\.)?[^\.]*$. There's also one which considers newline markers if you do want to use the re.MULTILINE flag which is (?<=\.)?[^\.\n]*(\.)?[^\.\n]*$. Here's a [regexr link to test it].

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

提取第二个句点之后的所有内容。

问题

答案1

答案2

答案3

`sort.Slice` 的排序顺序是不确定的。

为什么在这个特定的代码中使用 “if” 而不是 “elif” ？

使用 Chquopy 从 Python 返回多个列表给 Java。

For each group in Pandas dataframe, return the most common value if it shows up more than `x%` of the time

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。