PySpark:在匹配后提取5个下一个单词

huangapple go评论64阅读模式
英文:

PySpark : regexp_extract 5 next words after a match

问题

是的,这是可能的。

英文:

I have a dataset like this:

column1 column2
First a a a a b c d e f c d s
Second d f g r b d s z e r a e
Thirs d f g v c x w b c x s d f e

I want to extract the 5 next words after the "b" value
to obtain this using regexp_extract :

column1 column2
First c d e f c
Second d s z e r
Thirs c x s d f

Is it possible ? Thanks

答案1

得分: 3

你可以使用以下代码:

df.withColumn("column2", F.regexp_extract(F.col("column2"), "(?<=b )(\w\W){4}\w", 0))

输出:

|column1|  column2|
+-------+---------+
|  First|c d e f c|
| Second|d s z e r|
|  Thirs|c x s d f|
+-------+---------+
英文:

You can use this:

df.withColumn(&quot;column2&quot;, F.regexp_extract(F.col(&quot;column2&quot;),&quot;(?&lt;=b )(\w\W){4}\w&quot;,0))

Output:

|column1|  column2|
+-------+---------+
|  First|c d e f c|
| Second|d s z e r|
|  Thirs|c x s d f|
+-------+---------+

答案2

得分: 1

你可以使用这个正则表达式提取b后的5个单词:

pattern = "(?i)\\b(?:b\\W+)(\\w+\\W+\\w+\\W+\\w+\\W+\\w+\\W+\\w+)\\b"
df = df.withColumn("column2", regexp_extract(col("column2"), pattern, 1))
df.show(truncate=False)

结果:

+-------+---------+
|column1|column2  |
+-------+---------+
|First  |c d e f c|
|Second |d s z e r|
|Thirs  |c x s d f|
+-------+---------+
英文:

You can use this regex to extract the 5 next words after b:

pattern = &quot;(?i)\\b(?:b\\W+)(\\w+\\W+\\w+\\W+\\w+\\W+\\w+\\W+\\w+)\\b&quot;
df = df.withColumn(&quot;column2&quot;, regexp_extract(col(&quot;column2&quot;), pattern, 1))
df.show(truncate=False)

Result:

+-------+---------+
|column1|column2  |
+-------+---------+
|First  |c d e f c|
|Second |d s z e r|
|Thirs  |c x s d f|
+-------+---------+

huangapple
  • 本文由 发表于 2023年4月13日 20:16:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76005319.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定