Pyspark 从一列中提取完全连续的4个数字,并将其返回到新列中。

huangapple go评论52阅读模式
英文:

Pyspark extracting exactly 4 consecutive numeric digit from a column and return it in a new column

问题

I am very new in using pyshark and have no idea how to do it

I am trying to extract from a title column.

Some value in the title column are:

Under Ground2(1990)
Waterword(1995)
Incredible
Skate (1991) board
That girl 2002”

I am trying to get:

1990
1995
1991
2002

This is what i have tried :

import pyspark.sql.functions as F
from pyspark.sql.functions import split
from pyspark.sql.functions import regexp_replace

movies_DF=movies_DF.withColumn('title', regexp_replace(movies_DF.title, "(", ""))
movies_DF=movies_DF.withColumn('title', regexp_replace(movies_DF.title, ")", ""))
movies_DF=movies_DF.withColumn('yearOfRelease',(f.expr('substring(title,-4)')))

My output column that have:

1990

1995

board

2002”

英文:

I am very new in using pyshark and have no idea how to do it

I am trying to extract from a title column.

Some value in the title column are:

Under Ground2(1990)
Waterword(1995)
Incredible
Skate (1991) board
That girl 2002”
I am trying to get:

1990
1995
1991
2002

This is what i have tried :

import pyspark.sql.functions as F
from pyspark.sql.functions import split
from pyspark.sql.functions import      regexp_replace

movies_DF=movies_DF.withColumn('title',   regexp_replace(movies_DF.title, "\(",""))
movies_DF=movies_DF.withColumn('title', regexp_replace(movies_DF.title, "\)",""))
movies_DF=movies_DF.withColumn('yearOfRelease',(f.expr('substring(title,-4)')))

My output column that have:

1990

1995

board

2002”

dible

答案1

得分: 1

使用 regexp_extract 函数:

from pyspark.sql.functions import regexp_extract, col

df = df.withColumn('Year', regexp_extract(col('Title'), r'\((\d{4})\)$', 1))
df.show()

+-------------------+----+
|              Title|Year|
+-------------------+----+
|Under Ground2(1990)|1990|
|    Waterword(1995)|1995|
+-------------------+----+
英文:

Use regexp_extract function:

from pyspark.sql.functions import regexp_extract, col

df = df.withColumn('Year', regexp_extract(col('Title'), r'\((\d{4})\)$', 1))
df.show()

+-------------------+----+
|              Title|Year|
+-------------------+----+
|Under Ground2(1990)|1990|
|    Waterword(1995)|1995|
+-------------------+----+

huangapple
  • 本文由 发表于 2023年2月7日 00:48:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75364235.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定