如何使用Spark DataFrame将单列转换为多列

huangapple go评论72阅读模式
英文:

How to convert single column to multiple columns using spark dataframe

问题

+---+-------------+------+----+
| id|         name|salary|dept|
+---+-------------+------+----+
|  1|         John| 10000|  IT|
|  2|Mindhack Diva| 20000|  IT|
|  3|       Michel| 30000|  IT|
|  4|         Ryan| 40000|  IT|
|  5|        Sahoo| 10000|  IT|
+---+-------------+------+----+
英文:

I have a dataframe like below, I need to show first row as header with corresponding values:

+------------------------+
|cleaned                 |
+------------------------+
|id,name,salary,dept     |
|1,John,10000,IT         |
|2,Mindhack Diva,20000,IT|
|3,Michel,30000,IT       |
|4,Ryan,40000,IT         |
|5,Sahoo,10000,IT        |
+------------------------+

And I need output like the dataframe below:

+---+-------------+------+----+
| id|         name|salary|dept|
+---+-------------+------+----+
|  1|         John| 10000|  IT|
|  2|Mindhack Diva| 20000|  IT|
|  3|       Michel| 30000|  IT|
|  4|         Ryan| 40000|  IT|
|  5|        Sahoo| 10000|  IT|
+---+-------------+------+----+

Thanks!

答案1

得分: 0

I'd advice to clean the input before reading so you don't bother with the first row, especially if you read a csv, there are header and delimiter options to help you achieve exactly what you're asking without extra code. Now if you still want to handle your specific case it's possible.

It's hard to determine the first row of a dataset as the processing is distributed and your data partitioned... So you can't really rely on first() method unless you're in one partition.

If you're sure that the "header" is always the same you could just filter it out first, then split your data into columns, so something like that using Spark SQL split function:

df.where(col("cleaned") !== "id,name,salary,dept") // filter the "header" row first
  .select(split(col("cleaned"), ",").getItem(0).as("id"), \
          split(col("cleaned"), ",").getItem(1).as("name"), \
          split(col("cleaned"), ",").getItem(2).as("salary"), \
          split(col("cleaned"), ",").getItem(3).as("dept"))

Of course, you can do it in two steps, and / or more intelligently but this is the principle.

See more: Spark SQL split function documentation

英文:

I'd advice to clean the input before reading so you don't bother with the first row, especially if you read a csv, there are header and delimiter options to help you achieve exactly what you're asking without extra code. Now if you still want to handle your specific case it's possible.

It's hard to determine the first row of a dataset as the processing is distributed and your data partitioned... So you can't really rely on first() method unless you're in one partition.

If you're sure that the "header" is always the same you could just filter it out first, then split your data into columns, so something like that using Spark SQL split function:

df.where(col("cleaned") !== "id,name,salary,dept") // filter the "header" row first
  .select(split(col("cleaned"),",").getItem(0).as("id"), \ 
          split(col("cleaned"),",").getItem(1).as("name"), \     
          split(col("cleaned"),",").getItem(2).as("salary"), \ 
          split(col("cleaned"),",").getItem(3).as("dept"))

Of course, you can do it in two steps, and / or more intelligently but this is the principle.

See more: https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.split.html

huangapple
  • 本文由 发表于 2023年7月6日 17:39:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76627482.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定