如何在pyspark数组中获取第一个带有数字的字符串值。

huangapple go评论60阅读模式
英文:

How to get first string value with numbers in pyspark array

问题

I want to extract just the first numeric instance from the languages field as a date in another column of the pyspark dataframe.

Sample data

data = [
 ("James","Java_Scala_C++_20230510_2023051345"),
 ("Mindy", "Spark_Java_20211014_20211014255_C++"),
 ("Julia", "CSharp_20200115_VB")
]

from pyspark.sql.types import StringType, ArrayType,StructType,StructField
schema = StructType([ 
    StructField("name",StringType(),True), 
    StructField("languages",StringType(),True)
  ])

df = spark.createDataFrame(data=data,schema=schema)
df.display()

By using split on the column, I can split the field into an array with what I'm looking for. I can use to_date to convert the string to a date, but would like help selecting the first instance of the numeric field without hardcoding an index which wouldn't work since the number values are in different indexes. I tried regexp_extract which doesn't work with arrays.

from pyspark.sql.functions import *
df = df\
.withColumn('languages_split', split(col('languages'), '_'))
df.display()

Desired output two columns with the following values. String names and dates.

James: 20230510
Mindy: 20211014
Julia: 20200115
英文:

I want to extract just the first numeric instance from the languages field as a date in another column of the pyspark dataframe.

Sample data

data = [
 ("James","Java_Scala_C++_20230510_2023051345"),
 ("Mindy", "Spark_Java_20211014_20211014255_C++"),
 ("Julia", "CSharp_20200115_VB")
]

from pyspark.sql.types import StringType, ArrayType,StructType,StructField
schema = StructType([ 
    StructField("name",StringType(),True), 
    StructField("languages",StringType(),True)
  ])

df = spark.createDataFrame(data=data,schema=schema)
df.display()

By using split on the column, I can split the field into an array with what I'm looking for. I can use to_date to convert the string to a date, but would like help selecting the first instance of the numeric field without hardcoding an index which wouldn't work since the number values are in different indexes. I tried regexp_extract which doesn't work with arrays.

from pyspark.sql.functions import *
df = df\
.withColumn('languages_split', split(col('languages'), '_'))
df.display()

Desired output two columns with the following values. String names and dates.

James: 20230510
Mindy: 20211014
Julia: 20200115

答案1

得分: 1

尝试使用**regexp_extract**和第1捕获组。

_(\d{1,8})_ -> 捕获以_开头,包含8位数字,以_结尾的组。

示例:

data = [
 ("James","Java_Scala_C++_20230510_2023051345"),
 ("Mindy", "Spark_Java_20211014_20211014255_C++"),
 ("Julia", "CSharp_20200115_VB")
]

from pyspark.sql.types import StringType, ArrayType,StructType,StructField
schema = StructType([ 
    StructField("name",StringType(),True), 
    StructField("languages",StringType(),True)
  ])

df = spark.createDataFrame(data=data,schema=schema)

from pyspark.sql.functions import *
df = df\
.withColumn('languages_split', regexp_extract(col("languages"),"_(\d{1,8})_",1))
df.show(10,False)
#+-----+-----------------------------------+---------------+
#|name |languages                          |languages_split|
#+-----+-----------------------------------+---------------+
#|James|Java_Scala_C++_20230510_2023051345 |20230510       |
#|Mindy|Spark_Java_20211014_20211014255_C++|20211014       | 
#|Julia|CSharp_20200115_VB                 |20200115       |
#+-----+-----------------------------------+---------------+
英文:

Try with regexp_extract with the 1 capture group.

_(\d{1,8})_ -> capture the group starts with _ and get 8 digits and end with _.

Example:

data = [
 ("James","Java_Scala_C++_20230510_2023051345"),
 ("Mindy", "Spark_Java_20211014_20211014255_C++"),
 ("Julia", "CSharp_20200115_VB")
]

from pyspark.sql.types import StringType, ArrayType,StructType,StructField
schema = StructType([ 
    StructField("name",StringType(),True), 
    StructField("languages",StringType(),True)
  ])

df = spark.createDataFrame(data=data,schema=schema)

from pyspark.sql.functions import *
df = df\
.withColumn('languages_split', regexp_extract(col("languages"),"_(\d{1,8})_",1))
df.show(10,False)
#+-----+-----------------------------------+---------------+
#|name |languages                          |languages_split|
#+-----+-----------------------------------+---------------+
#|James|Java_Scala_C++_20230510_2023051345 |20230510       |
#|Mindy|Spark_Java_20211014_20211014255_C++|20211014       | 
#|Julia|CSharp_20200115_VB                 |20200115       |
#+-----+-----------------------------------+---------------+

答案2

得分: 1

如果您需要提取和解析日期可以使用以下代码
```python
import datetime as dt
from pyspark.sql.functions import *
from pyspark.sql.types import DateType

def parse_date(x):
    try:
        return dt.datetime.strptime(x, '%Y%m%d')
    except ValueError as err:
        return None

parse_date_udf = udf(lambda arr: next(parse_date(x) for x in arr if parse_date(x) is not None), DateType())

df = df.withColumn('date', parse_date_udf(split(col('languages'), '_')))

df.show(100, False)
+-----+-----------------------------------+----------+
|name |languages                          |date      |
+-----+-----------------------------------+----------+
|James|Java_Scala_C++_20230510_2023051345 |2023-05-10|
|Mindy|Spark_Java_20211014_20211014255_C++|2021-10-14|
|Julia|CSharp_20200115_VB                 |2020-01-15|
+-----+-----------------------------------+----------+

<details>
<summary>英文:</summary>

If you need to extract and parse the date you can do:
```python
import datetime as dt
from pyspark.sql.functions import *
from pyspark.sql.types import DateType

def parse_date(x):
    try:
        return dt.datetime.strptime(x, &#39;%Y%m%d&#39;)
    except ValueError as err:
        return None

parse_date_udf = udf(lambda arr: next(parse_date(x) for x in arr if parse_date(x) is not None), DateType())

df = df.withColumn(&#39;date&#39;, parse_date_udf(split(col(&#39;languages&#39;), &#39;_&#39;)))

df.show(100, False)
+-----+-----------------------------------+----------+
|name |languages                          |date      |
+-----+-----------------------------------+----------+
|James|Java_Scala_C++_20230510_2023051345 |2023-05-10|
|Mindy|Spark_Java_20211014_20211014255_C++|2021-10-14|
|Julia|CSharp_20200115_VB                 |2020-01-15|
+-----+-----------------------------------+----------+

答案3

得分: 0

以下是您要翻译的内容:

"for folks not wanting a regex solution, you can split the string and use filter higher order function.

see example

data_sdf. \
    withColumn('lang_split_arr', func.split('lang', '_')). \
    withColumn('lang_arr_nums_only', 
               func.filter('lang_split_arr', lambda x: x.cast('int').isNotNull())
               ). \
    withColumn('lang_dt_only', 
               func.to_date(func.col('lang_arr_nums_only').getItem(0), 'yyyyMMdd')
               ). \
    show(truncate=False)

# +-----+-----------------------------------+-----------------------------------------+----------------------+------------+
# |name |lang                               |lang_split_arr                           |lang_arr_nums_only    |lang_dt_only|
# +-----+-----------------------------------+-----------------------------------------+----------------------+------------+
# |James|Java_Scala_C++_20230510_2023051345 |[Java, Scala, C++, 20230510, 2023051345] |[20230510, 2023051345]|2023-05-10  |
# |Mindy|Spark_Java_20211014_20211014255_C++|[Spark, Java, 20211014, 20211014255, C++]|[20211014]            |2021-10-14  |
# |Julia|CSharp_20200115_VB                 |[CSharp, 20200115, VB]                   |[20200115]            |2020-01-15  |
# +-----+-----------------------------------+-----------------------------------------+----------------------+------------+
英文:

for folks not wanting a regex solution, you can split the string and use filter higher order function.

see example

data_sdf. \
    withColumn(&#39;lang_split_arr&#39;, func.split(&#39;lang&#39;, &#39;_&#39;)). \
    withColumn(&#39;lang_arr_nums_only&#39;, 
               func.filter(&#39;lang_split_arr&#39;, lambda x: x.cast(&#39;int&#39;).isNotNull())
               ). \
    withColumn(&#39;lang_dt_only&#39;, 
               func.to_date(func.col(&#39;lang_arr_nums_only&#39;).getItem(0), &#39;yyyyMMdd&#39;)
               ). \
    show(truncate=False)

# +-----+-----------------------------------+-----------------------------------------+----------------------+------------+
# |name |lang                               |lang_split_arr                           |lang_arr_nums_only    |lang_dt_only|
# +-----+-----------------------------------+-----------------------------------------+----------------------+------------+
# |James|Java_Scala_C++_20230510_2023051345 |[Java, Scala, C++, 20230510, 2023051345] |[20230510, 2023051345]|2023-05-10  |
# |Mindy|Spark_Java_20211014_20211014255_C++|[Spark, Java, 20211014, 20211014255, C++]|[20211014]            |2021-10-14  |
# |Julia|CSharp_20200115_VB                 |[CSharp, 20200115, VB]                   |[20200115]            |2020-01-15  |
# +-----+-----------------------------------+-----------------------------------------+----------------------+------------+

huangapple
  • 本文由 发表于 2023年6月1日 05:09:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76377324.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定