2023年2月7日 00:48:24go评论58阅读模式

英文:

Pyspark extracting exactly 4 consecutive numeric digit from a column and return it in a new column

问题

I am very new in using pyshark and have no idea how to do it

I am trying to extract from a title column.

Some value in the title column are:

Under Ground2(1990)
Waterword(1995)
Incredible
Skate (1991) board
That girl 2002”

I am trying to get:

1990
1995
1991
2002

This is what i have tried :

import pyspark.sql.functions as F
from pyspark.sql.functions import split
from pyspark.sql.functions import regexp_replace

movies_DF=movies_DF.withColumn('title', regexp_replace(movies_DF.title, "(", ""))
movies_DF=movies_DF.withColumn('title', regexp_replace(movies_DF.title, ")", ""))
movies_DF=movies_DF.withColumn('yearOfRelease',(f.expr('substring(title,-4)')))

My output column that have:

1990

1995

board

2002”

英文:

I am very new in using pyshark and have no idea how to do it

I am trying to extract from a title column.

Some value in the title column are:

Under Ground2(1990)
Waterword(1995)
Incredible
Skate (1991) board
That girl 2002”
I am trying to get:

1990
1995
1991
2002

This is what i have tried :

import pyspark.sql.functions as F
from pyspark.sql.functions import split
from pyspark.sql.functions import      regexp_replace

movies_DF=movies_DF.withColumn(&#39;title&#39;,   regexp_replace(movies_DF.title, &quot;\(&quot;,&quot;&quot;))
movies_DF=movies_DF.withColumn(&#39;title&#39;, regexp_replace(movies_DF.title, &quot;\)&quot;,&quot;&quot;))
movies_DF=movies_DF.withColumn(&#39;yearOfRelease&#39;,(f.expr(&#39;substring(title,-4)&#39;)))

My output column that have:

1990

1995

board

2002”

dible

答案1

得分: 1

使用 regexp_extract 函数：

from pyspark.sql.functions import regexp_extract, col

df = df.withColumn('Year', regexp_extract(col('Title'), r'\((\d{4})\)$', 1))
df.show()

+-------------------+----+
|              Title|Year|
+-------------------+----+
|Under Ground2(1990)|1990|
|    Waterword(1995)|1995|
+-------------------+----+

英文:

Use regexp_extract function:

from pyspark.sql.functions import regexp_extract, col

df = df.withColumn(&#39;Year&#39;, regexp_extract(col(&#39;Title&#39;), r&#39;\((\d{4})\)$&#39;, 1))
df.show()

+-------------------+----+
|              Title|Year|
+-------------------+----+
|Under Ground2(1990)|1990|
|    Waterword(1995)|1995|
+-------------------+----+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pyspark 从一列中提取完全连续的4个数字，并将其返回到新列中。

问题

答案1

如何在Spark中读取选定的分区

Is combining executeCompaction() and executeZOrderBy() in Databricks sensible?

如何使用以数组结构作为参数的Spark UDF来构建新列？

如何在pyspark中迭代’Row’值？ “`python # 代码不需要翻译 “`

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论