如何在PySpark中解析一个以字符串表示的列表?

huangapple go评论50阅读模式
英文:

how to parse a list (represented as a string) in pyspark?

问题

使用Pyspark,我正在加载一个具有非常简单结构的Parquet文件:


|-- mydata: 字符串(可为空 = true)
|-- timestamps: 字符串(可为空 = true)
|-- mylist: 字符串(可为空 = true)

问题在于mylist最初是Python中的一个列表,已被表示为一个字符串。因此,第一个元素看起来像

['first', 'second', 'third']

如何将此列表示为正确的列表列(一个结构?),以便我可以访问其第n个元素并展开它?

谢谢!

英文:

Using Pyspark I am loading a parquet file which has a very simple structure:

root
 |-- mydata: string (nullable = true)
 |-- timestamps: string (nullable = true)
 |-- mylist: string (nullable = true)

The problem is that mylist was originally a list in python, which has been represented into a string. So the first element looks like

['first', 'second', 'third']

How can I represent this column as a proper list-column (a struct?) so that I can access its nth element and explode it?

Thanks!

答案1

得分: 4

你可以使用from_json函数并传递解析格式。

以下是一个示例:

spark.sparkContext.parallelize([(1, '["first", "second", "third"]')]).toDF(['id', 'mylist']). \
    withColumn('parsed_mylist', func.from_json('mylist', 'array<string>')). \
    show(truncate=False)

# +---+----------------------------+----------------------+
# |id |mylist                      |parsed_mylist         |
# +---+----------------------------+----------------------+
# |1  |[&quot;first&quot;, &quot;second&quot;, &quot;third&quot;]|[first, second, third]|
# +---+----------------------------+----------------------+

# root
#  |-- id: long (nullable = true)
#  |-- mylist: string (nullable = true)
#  |-- parsed_mylist: array (nullable = true)
#  |    |-- element: string (containsNull = true)
英文:

you can use from_json and pass the parsing format.

here's an example

spark.sparkContext.parallelize([(1, &#39;[&quot;first&quot;, &quot;second&quot;, &quot;third&quot;]&#39;)]).toDF([&#39;id&#39;, &#39;mylist&#39;]). \
    withColumn(&#39;parsed_mylist&#39;, func.from_json(&#39;mylist&#39;, &#39;array&lt;string&gt;&#39;)). \
    show(truncate=False)

# +---+----------------------------+----------------------+
# |id |mylist                      |parsed_mylist         |
# +---+----------------------------+----------------------+
# |1  |[&quot;first&quot;, &quot;second&quot;, &quot;third&quot;]|[first, second, third]|
# +---+----------------------------+----------------------+

# root
#  |-- id: long (nullable = true)
#  |-- mylist: string (nullable = true)
#  |-- parsed_mylist: array (nullable = true)
#  |    |-- element: string (containsNull = true)

huangapple
  • 本文由 发表于 2023年3月15日 19:31:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/75744112.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定