2023年7月6日 17:39:53go评论104阅读模式

英文:

How to convert single column to multiple columns using spark dataframe

问题

+---+-------------+------+----+
| id|         name|salary|dept|
+---+-------------+------+----+
|  1|         John| 10000|  IT|
|  2|Mindhack Diva| 20000|  IT|
|  3|       Michel| 30000|  IT|
|  4|         Ryan| 40000|  IT|
|  5|        Sahoo| 10000|  IT|
+---+-------------+------+----+

英文:

I have a dataframe like below, I need to show first row as header with corresponding values:

+------------------------+
|cleaned                 |
+------------------------+
|id,name,salary,dept     |
|1,John,10000,IT         |
|2,Mindhack Diva,20000,IT|
|3,Michel,30000,IT       |
|4,Ryan,40000,IT         |
|5,Sahoo,10000,IT        |
+------------------------+

And I need output like the dataframe below:

+---+-------------+------+----+
| id|         name|salary|dept|
+---+-------------+------+----+
|  1|         John| 10000|  IT|
|  2|Mindhack Diva| 20000|  IT|
|  3|       Michel| 30000|  IT|
|  4|         Ryan| 40000|  IT|
|  5|        Sahoo| 10000|  IT|
+---+-------------+------+----+

Thanks!

答案1

得分: 0

I'd advice to clean the input before reading so you don't bother with the first row, especially if you read a csv, there are header and delimiter options to help you achieve exactly what you're asking without extra code. Now if you still want to handle your specific case it's possible.

It's hard to determine the first row of a dataset as the processing is distributed and your data partitioned... So you can't really rely on first() method unless you're in one partition.

If you're sure that the "header" is always the same you could just filter it out first, then split your data into columns, so something like that using Spark SQL split function:

df.where(col("cleaned") !== "id,name,salary,dept") // filter the "header" row first
  .select(split(col("cleaned"), ",").getItem(0).as("id"), \
          split(col("cleaned"), ",").getItem(1).as("name"), \
          split(col("cleaned"), ",").getItem(2).as("salary"), \
          split(col("cleaned"), ",").getItem(3).as("dept"))

Of course, you can do it in two steps, and / or more intelligently but this is the principle.

See more: Spark SQL split function documentation

英文:

It's hard to determine the first row of a dataset as the processing is distributed and your data partitioned... So you can't really rely on first() method unless you're in one partition.

If you're sure that the "header" is always the same you could just filter it out first, then split your data into columns, so something like that using Spark SQL split function:

df.where(col(&quot;cleaned&quot;) !== &quot;id,name,salary,dept&quot;) // filter the &quot;header&quot; row first
  .select(split(col(&quot;cleaned&quot;),&quot;,&quot;).getItem(0).as(&quot;id&quot;), \ 
          split(col(&quot;cleaned&quot;),&quot;,&quot;).getItem(1).as(&quot;name&quot;), \     
          split(col(&quot;cleaned&quot;),&quot;,&quot;).getItem(2).as(&quot;salary&quot;), \ 
          split(col(&quot;cleaned&quot;),&quot;,&quot;).getItem(3).as(&quot;dept&quot;))

Of course, you can do it in two steps, and / or more intelligently but this is the principle.

See more: https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.split.html

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Spark DataFrame将单列转换为多列

问题

答案1

rsd在pyspark的approx_count_distinct中的解释是什么，以及更改它会有什么后果？

如何比较两个Spark数据集

从主数据框中查找值并总结。

Databricks NameError: name ‘expr’ is defined.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。