英文:
PySpark : regexp_extract 5 next words after a match
问题
是的,这是可能的。
英文:
I have a dataset like this:
column1 | column2 |
---|---|
First | a a a a b c d e f c d s |
Second | d f g r b d s z e r a e |
Thirs | d f g v c x w b c x s d f e |
I want to extract the 5 next words after the "b" value
to obtain this using regexp_extract :
column1 | column2 |
---|---|
First | c d e f c |
Second | d s z e r |
Thirs | c x s d f |
Is it possible ? Thanks
答案1
得分: 3
你可以使用以下代码:
df.withColumn("column2", F.regexp_extract(F.col("column2"), "(?<=b )(\w\W){4}\w", 0))
输出:
|column1| column2|
+-------+---------+
| First|c d e f c|
| Second|d s z e r|
| Thirs|c x s d f|
+-------+---------+
英文:
You can use this:
df.withColumn("column2", F.regexp_extract(F.col("column2"),"(?<=b )(\w\W){4}\w",0))
Output:
|column1| column2|
+-------+---------+
| First|c d e f c|
| Second|d s z e r|
| Thirs|c x s d f|
+-------+---------+
答案2
得分: 1
你可以使用这个正则表达式提取b后的5个单词:
pattern = "(?i)\\b(?:b\\W+)(\\w+\\W+\\w+\\W+\\w+\\W+\\w+\\W+\\w+)\\b"
df = df.withColumn("column2", regexp_extract(col("column2"), pattern, 1))
df.show(truncate=False)
结果:
+-------+---------+
|column1|column2 |
+-------+---------+
|First |c d e f c|
|Second |d s z e r|
|Thirs |c x s d f|
+-------+---------+
英文:
You can use this regex to extract the 5 next words after b:
pattern = "(?i)\\b(?:b\\W+)(\\w+\\W+\\w+\\W+\\w+\\W+\\w+\\W+\\w+)\\b"
df = df.withColumn("column2", regexp_extract(col("column2"), pattern, 1))
df.show(truncate=False)
Result:
+-------+---------+
|column1|column2 |
+-------+---------+
|First |c d e f c|
|Second |d s z e r|
|Thirs |c x s d f|
+-------+---------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论