2023年6月22日 13:20:51go评论112阅读模式

英文:

How can I unpivot two sets of columns in Spark?

问题

我正在寻找将两组列解压缩以使其变为：

我尝试使用stack函数解压两组列，但我只能解压一组。每当我使用两个stack函数时，都会引发异常，表示我只能使用一个生成器函数：

unpivotedDf = df.selectExpr(
    "date",
    "stack(2, 'day', sales_day, 'night', sales_night) as (timeOfDay, sales)",
    "stack(2, 'day', customers_day, 'night', customers_night) as (timeOfDay, customers)"
)

我还可以分别解压缩两组列，然后在日期和时间上连接单独的表，但对于大型数据集来说效率非常低。有没有其他的选择？

英文:

I'm looking to unpivot two sets of columns so that this:

Is transformed into this:

I have tried using the stack function to unpivot the two sets of columns, but I can only unpivot one set. Whenever I use two stack functions, an exception is raised saying I can only use one generator function:

unpivotedDf = df.selectExpr(
    &quot;date&quot;,
    &quot;stack(2, &#39;day&#39;, sales_day, &#39;night&#39;, sales_night) as (timeOfDay, sales)&quot;,
    &quot;stack(2, &#39;day&#39;, customers_day, &#39;night&#39;, customers_night) as (timeOfDay, customers)&quot;
)

I can also unpivot both sets of columns separately and then join the separate tables on date and on timeOfDay, but this is really inefficient for large datasets. Is there an alternative?

答案1

得分: 1

一种方法是将整个数据进行反规范化，然后再逆转所需的列。

这里有一个示例：

data_sdf. \
    selectExpr('date', 
               'stack(4, \
                      "sales_day", sales_day, \
                      "sales_night", sales_night, \
                      "cust_day", cust_day, \
                      "cust_night", cust_night\
                      ) as (key, val)'
               ). \
    withColumn('key_split', func.split('key', '_')). \
    selectExpr('date', 'key_split[0] as attr', 'key_split[1] as time_of_day', 'val'). \
    groupBy('date', 'time_of_day'). \
    pivot('attr', values=['cust', 'sales']). \
    agg(func.first('val')). \
    orderBy('date', 'time_of_day'). \
    show()
# +-------+-----------+----+-----+
# |   date|time_of_day|cust|sales|
# +-------+-----------+----+-----+
# |june 15|        day|  11|    1|
# |june 15|      night|  12|    2|
# |june 16|        day|  13|    3|
# |june 16|      night|  14|    4|
# |june 17|        day|  15|    5|
# |june 17|      night|  15|    5|
# +-------+-----------+----+-----+

请注意，这只是翻译了您提供的代码部分，不包括问题或其他内容。

英文:

one way is to unpivot the whole data, and then pivot the columns required.

here's an example

data_sdf. \
    selectExpr(&#39;date&#39;, 
               &#39;stack(4, \
                      &quot;sales_day&quot;, sales_day, \
                      &quot;sales_night&quot;, sales_night, \
                      &quot;cust_day&quot;, cust_day, \
                      &quot;cust_night&quot;, cust_night\
                      ) as (key, val)&#39;
               ). \
    withColumn(&#39;key_split&#39;, func.split(&#39;key&#39;, &#39;_&#39;)). \
    selectExpr(&#39;date&#39;, &#39;key_split[0] as attr&#39;, &#39;key_split[1] as time_of_day&#39;, &#39;val&#39;). \
    groupBy(&#39;date&#39;, &#39;time_of_day&#39;). \
    pivot(&#39;attr&#39;, values=[&#39;cust&#39;, &#39;sales&#39;]). \
    agg(func.first(&#39;val&#39;)). \
    orderBy(&#39;date&#39;, &#39;time_of_day&#39;). \
    show()
# +-------+-----------+----+-----+
# |   date|time_of_day|cust|sales|
# +-------+-----------+----+-----+
# |june 15|        day|  11|    1|
# |june 15|      night|  12|    2|
# |june 16|        day|  13|    3|
# |june 16|      night|  14|    4|
# |june 17|        day|  15|    5|
# |june 17|      night|  15|    5|
# +-------+-----------+----+-----+

答案2

得分: 1

让我们创建一个将时间与相应列对应的映射。然后，我们可以“展开”这个映射并选择所需的列。

from itertools import chain
c = chain(*[(F.lit(c), F.array(f'sales_{c}', f'customers_{c}')) for c in ('day', 'night')])
result = (
    df
    .select('date', F.explode(F.create_map(*c)))
    .selectExpr('date', 'key as time_of_day', 'value[0] as sales', 'value[1] as customers')
)

结果

+-------+-----------+-----+---------+
|   date|time_of_day|sales|customers|
+-------+-----------+-----+---------+
|June 15|        day|    1|       11|
|June 15|      night|    2|       12|
|June 16|        day|    3|       13|
|June 16|      night|    4|       14|
|June 17|        day|    5|       15|
|June 17|      night|    6|       16|
+-------+-----------+-----+---------+

英文:

Lets create a map of time of day to the corresponding columns. Then we can explode the map and select the required columns.

from itertools import chain
c = chain(*[(F.lit(c), F.array(f&#39;sales_{c}&#39;, f&#39;customers_{c}&#39;)) for c in (&#39;day&#39;, &#39;night&#39;)])
result = (
    df
    .select(&#39;date&#39;, F.explode(F.create_map(*c)))
    .selectExpr(&#39;date&#39;, &#39;key as time_of_day&#39;, &#39;value[0] as sales&#39;, &#39;value[1] as customers&#39;)
)

Result

+-------+-----------+-----+---------+
|   date|time_of_day|sales|customers|
+-------+-----------+-----+---------+
|June 15|        day|    1|       11|
|June 15|      night|    2|       12|
|June 16|        day|    3|       13|
|June 16|      night|    4|       14|
|June 17|        day|    5|       15|
|June 17|      night|    6|       16|
+-------+-----------+-----+---------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How can I unpivot two sets of columns in Spark?

问题

答案1

答案2

java.io.FileNotFoundException error in Apache Spark even though my file exists

如何为特定数值设置特定颜色。

Efficient way to compute several thousands of averages from time segments of one single TimeSeries DataFrame

如何使用PySimpleGUI将图像设置为窗口的背景？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。