2023年4月13日 20:16:48go评论70阅读模式

英文:

PySpark : regexp_extract 5 next words after a match

问题

是的，这是可能的。

英文:

I have a dataset like this:

column1	column2
First	a a a a b c d e f c d s
Second	d f g r b d s z e r a e
Thirs	d f g v c x w b c x s d f e

I want to extract the 5 next words after the "b" value
to obtain this using regexp_extract :

column1	column2
First	c d e f c
Second	d s z e r
Thirs	c x s d f

Is it possible ? Thanks

答案1

得分: 3

你可以使用以下代码：

df.withColumn("column2", F.regexp_extract(F.col("column2"), "(?<=b )(\w\W){4}\w", 0))

输出：

|column1|  column2|
+-------+---------+
|  First|c d e f c|
| Second|d s z e r|
|  Thirs|c x s d f|
+-------+---------+

英文:

You can use this:

df.withColumn(&quot;column2&quot;, F.regexp_extract(F.col(&quot;column2&quot;),&quot;(?&lt;=b )(\w\W){4}\w&quot;,0))

Output:

|column1|  column2|
+-------+---------+
|  First|c d e f c|
| Second|d s z e r|
|  Thirs|c x s d f|
+-------+---------+

答案2

得分: 1

你可以使用这个正则表达式提取b后的5个单词：

pattern = "(?i)\\b(?:b\\W+)(\\w+\\W+\\w+\\W+\\w+\\W+\\w+\\W+\\w+)\\b"
df = df.withColumn("column2", regexp_extract(col("column2"), pattern, 1))
df.show(truncate=False)

结果：

+-------+---------+
|column1|column2  |
+-------+---------+
|First  |c d e f c|
|Second |d s z e r|
|Thirs  |c x s d f|
+-------+---------+

英文:

You can use this regex to extract the 5 next words after b:

pattern = &quot;(?i)\\b(?:b\\W+)(\\w+\\W+\\w+\\W+\\w+\\W+\\w+\\W+\\w+)\\b&quot;
df = df.withColumn(&quot;column2&quot;, regexp_extract(col(&quot;column2&quot;), pattern, 1))
df.show(truncate=False)

Result:

+-------+---------+
|column1|column2  |
+-------+---------+
|First  |c d e f c|
|Second |d s z e r|
|Thirs  |c x s d f|
+-------+---------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PySpark：在匹配后提取5个下一个单词

问题

答案1

答案2

Git push heroku main命令错误，pywin32错误。

设置一个类的属性等于同一类的方法的输出有什么意义？

`.get()`函数未被识别。

Variables and their values are stored externally in a YAML file. How to read them as if I declare them internally?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论