问题

我有一个PySpark数据框，其中包含一个包含列表的列。列表项可能在行之间重叠。我需要按'orderCol'列排序，沿着行计算唯一列表元素的累积和。在我的应用程序中，可能有数百万行和每个列表中的数百个项。我似乎无法理解如何在PySpark中做到这一点，以便它能够扩展，并且会非常感谢任何关于如何解决这个问题的大或小的想法。

我已经发布了输入和期望的输出，以便您了解我试图实现的目标。

英文:

I have a PySpark dataframe with a column containing lists. The list items might overlap across rows. I need the cumulative sum of unique list elements down through the rows ordered by 'orderCol' column. In my application there might be millions of rows and hundreds of items in each list. I can't seem to wrap my brain around how to do this in PySpark so that it scales and would be grateful for any ideas big or small on how to solve it.

I have posted input and desired output to give an idea of what I'm trying to achieve.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName(&quot;myApp&quot;) \
    .getOrCreate()

data = [{&quot;node&quot;: &#39;r1&#39;, &quot;items&quot;: [&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,&#39;d&#39;], &quot;orderCol&quot;: 1},
        {&quot;node&quot;: &#39;r2&#39;, &quot;items&quot;: [&#39;e&#39;,&#39;f&#39;,&#39;g&#39;,&#39;a&#39;], &quot;orderCol&quot;: 2},
        {&quot;node&quot;: &#39;r3&#39;, &quot;items&quot;: [&#39;h&#39;,&#39;i&#39;,&#39;g&#39;,&#39;b&#39;], &quot;orderCol&quot;: 3},
        {&quot;node&quot;: &#39;r4&#39;, &quot;items&quot;: [&#39;j&#39;,&#39;i&#39;,&#39;f&#39;,&#39;c&#39;], &quot;orderCol&quot;: 4},
        ]

df = spark.createDataFrame(data)
df.show()

data_out = [{&quot;node&quot;: &#39;r1&#39;, &quot;items&quot;: [&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,&#39;d&#39;], &quot;orderCol&quot;: 1, &quot;cumulative_item_count&quot;: 4},
        {&quot;node&quot;: &#39;r2&#39;, &quot;items&quot;: [&#39;e&#39;,&#39;f&#39;,&#39;g&#39;,&#39;a&#39;], &quot;orderCol&quot;: 2, &quot;cumulative_item_count&quot;: 7},
        {&quot;node&quot;: &#39;r3&#39;, &quot;items&quot;: [&#39;h&#39;,&#39;i&#39;,&#39;g&#39;,&#39;b&#39;], &quot;orderCol&quot;: 3, &quot;cumulative_item_count&quot;: 9},
        {&quot;node&quot;: &#39;r4&#39;, &quot;items&quot;: [&#39;j&#39;,&#39;i&#39;,&#39;f&#39;,&#39;c&#39;], &quot;orderCol&quot;: 4, &quot;cumulative_item_count&quot;: 10},
        ]

df_out = spark.createDataFrame(data_out)
df_out.show()

答案1

得分: 2

尝试使用窗口函数，使用unboundedPreceeding到currentRow。

然后将嵌套数组展开。

最后，我们将使用array_distinct和size函数来计算数组中的不同元素数量。

示例代码：

from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.types import *

data = [{"node": 'r1', "items": ['a','b','c','d'], "orderCol": 1},
        {"node": 'r2', "items": ['e','f','g','a'], "orderCol": 2},
        {"node": 'r3', "items": ['h','i','g','b'], "orderCol": 3},
        {"node": 'r4', "items": ['j','i','f','c'], "orderCol": 4}
       ]

w = Window.partitionBy(lit(1)).orderBy("orderCol").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df = spark.createDataFrame(data).\
  withColumn("temp_col", collect_list(col("items")).over(w)).\
  withColumn("cumulative_item_count", size(array_distinct(flatten(col("temp_col")))))
df.show(20, False)

这段代码的作用是对给定数据进行一系列操作，最终计算累积的不同元素数量。

英文:

Try with window function using unboundedPreceeding to currentRow.

Then flatten the nested array.

Finally we will array_distinct+ size functions to count distinct elements in the array.

Example:

from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.types import *
data = [{&quot;node&quot;: &#39;r1&#39;, &quot;items&quot;: [&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,&#39;d&#39;], &quot;orderCol&quot;: 1},
        {&quot;node&quot;: &#39;r2&#39;, &quot;items&quot;: [&#39;e&#39;,&#39;f&#39;,&#39;g&#39;,&#39;a&#39;], &quot;orderCol&quot;: 2},
        {&quot;node&quot;: &#39;r3&#39;, &quot;items&quot;: [&#39;h&#39;,&#39;i&#39;,&#39;g&#39;,&#39;b&#39;], &quot;orderCol&quot;: 3},
        {&quot;node&quot;: &#39;r4&#39;, &quot;items&quot;: [&#39;j&#39;,&#39;i&#39;,&#39;f&#39;,&#39;c&#39;], &quot;orderCol&quot;: 4},
        ]

w=Window.partitionBy(lit(1)).orderBy(&quot;orderCol&quot;).rowsBetween(Window.unboundedPreceding, Window.currentRow)
df = spark.createDataFrame(data).\
  withColumn(&quot;temp_col&quot;,collect_list(col(&quot;items&quot;)).over(w)).\
  withColumn(&quot;cumulative_item_count&quot;,size(array_distinct(flatten(col(&quot;temp_col&quot;)))))
df.show(20,False)

#+------------+----+--------+--------------------------------------------------------+---------------------+
#|items       |node|orderCol|temp_col                                                |cumulative_item_count|
#+------------+----+--------+--------------------------------------------------------+---------------------+
#|[a, b, c, d]|r1  |1       |[[a, b, c, d]]                                          |4                    |
#|[e, f, g, a]|r2  |2       |[[a, b, c, d], [e, f, g, a]]                            |7                    |
#|[h, i, g, b]|r3  |3       |[[a, b, c, d], [e, f, g, a], [h, i, g, b]]              |9                    |
#|[j, i, f, c]|r4  |4       |[[a, b, c, d], [e, f, g, a], [h, i, g, b], [j, i, f, c]]|10                   |

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在pySpark中计算非唯一列表元素的累积和。

问题

答案1

SSH是否会将代码保存在远程计算机中？

如何使用Python BigQuery客户端更新BigQuery分区过期时间？

“class ClassName : pass” 在Python中是什么意思？

CNN模型的准确性出现奇怪的波动。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论