英文:
How to convert single column to multiple columns using spark dataframe
问题
+---+-------------+------+----+
| id| name|salary|dept|
+---+-------------+------+----+
| 1| John| 10000| IT|
| 2|Mindhack Diva| 20000| IT|
| 3| Michel| 30000| IT|
| 4| Ryan| 40000| IT|
| 5| Sahoo| 10000| IT|
+---+-------------+------+----+
英文:
I have a dataframe like below, I need to show first row as header with corresponding values:
+------------------------+
|cleaned |
+------------------------+
|id,name,salary,dept |
|1,John,10000,IT |
|2,Mindhack Diva,20000,IT|
|3,Michel,30000,IT |
|4,Ryan,40000,IT |
|5,Sahoo,10000,IT |
+------------------------+
And I need output like the dataframe below:
+---+-------------+------+----+
| id| name|salary|dept|
+---+-------------+------+----+
| 1| John| 10000| IT|
| 2|Mindhack Diva| 20000| IT|
| 3| Michel| 30000| IT|
| 4| Ryan| 40000| IT|
| 5| Sahoo| 10000| IT|
+---+-------------+------+----+
Thanks!
答案1
得分: 0
I'd advice to clean the input before reading so you don't bother with the first row, especially if you read a csv, there are header and delimiter options to help you achieve exactly what you're asking without extra code. Now if you still want to handle your specific case it's possible.
It's hard to determine the first row of a dataset as the processing is distributed and your data partitioned... So you can't really rely on first() method unless you're in one partition.
If you're sure that the "header" is always the same you could just filter it out first, then split your data into columns, so something like that using Spark SQL split function:
df.where(col("cleaned") !== "id,name,salary,dept") // filter the "header" row first
.select(split(col("cleaned"), ",").getItem(0).as("id"), \
split(col("cleaned"), ",").getItem(1).as("name"), \
split(col("cleaned"), ",").getItem(2).as("salary"), \
split(col("cleaned"), ",").getItem(3).as("dept"))
Of course, you can do it in two steps, and / or more intelligently but this is the principle.
See more: Spark SQL split function documentation
英文:
I'd advice to clean the input before reading so you don't bother with the first row, especially if you read a csv, there are header and delimiter options to help you achieve exactly what you're asking without extra code. Now if you still want to handle your specific case it's possible.
It's hard to determine the first row of a dataset as the processing is distributed and your data partitioned... So you can't really rely on first() method unless you're in one partition.
If you're sure that the "header" is always the same you could just filter it out first, then split your data into columns, so something like that using Spark SQL split function:
df.where(col("cleaned") !== "id,name,salary,dept") // filter the "header" row first
.select(split(col("cleaned"),",").getItem(0).as("id"), \
split(col("cleaned"),",").getItem(1).as("name"), \
split(col("cleaned"),",").getItem(2).as("salary"), \
split(col("cleaned"),",").getItem(3).as("dept"))
Of course, you can do it in two steps, and / or more intelligently but this is the principle.
See more: https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.split.html
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论