英文:
how to parse a list (represented as a string) in pyspark?
问题
使用Pyspark,我正在加载一个具有非常简单结构的Parquet文件:
根
|-- mydata: 字符串(可为空 = true)
|-- timestamps: 字符串(可为空 = true)
|-- mylist: 字符串(可为空 = true)
问题在于mylist
最初是Python中的一个列表,已被表示为一个字符串。因此,第一个元素看起来像
['first', 'second', 'third']
如何将此列表示为正确的列表列(一个结构?),以便我可以访问其第n个元素并展开它?
谢谢!
英文:
Using Pyspark I am loading a parquet file which has a very simple structure:
root
|-- mydata: string (nullable = true)
|-- timestamps: string (nullable = true)
|-- mylist: string (nullable = true)
The problem is that mylist
was originally a list in python, which has been represented into a string. So the first element looks like
['first', 'second', 'third']
How can I represent this column as a proper list-column (a struct?) so that I can access its nth element and explode it?
Thanks!
答案1
得分: 4
你可以使用from_json
函数并传递解析格式。
以下是一个示例:
spark.sparkContext.parallelize([(1, '["first", "second", "third"]')]).toDF(['id', 'mylist']). \
withColumn('parsed_mylist', func.from_json('mylist', 'array<string>')). \
show(truncate=False)
# +---+----------------------------+----------------------+
# |id |mylist |parsed_mylist |
# +---+----------------------------+----------------------+
# |1 |["first", "second", "third"]|[first, second, third]|
# +---+----------------------------+----------------------+
# root
# |-- id: long (nullable = true)
# |-- mylist: string (nullable = true)
# |-- parsed_mylist: array (nullable = true)
# | |-- element: string (containsNull = true)
英文:
you can use from_json
and pass the parsing format.
here's an example
spark.sparkContext.parallelize([(1, '["first", "second", "third"]')]).toDF(['id', 'mylist']). \
withColumn('parsed_mylist', func.from_json('mylist', 'array<string>')). \
show(truncate=False)
# +---+----------------------------+----------------------+
# |id |mylist |parsed_mylist |
# +---+----------------------------+----------------------+
# |1 |["first", "second", "third"]|[first, second, third]|
# +---+----------------------------+----------------------+
# root
# |-- id: long (nullable = true)
# |-- mylist: string (nullable = true)
# |-- parsed_mylist: array (nullable = true)
# | |-- element: string (containsNull = true)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论