问题

data = [(1, None), (2, 3), (3, None), (4, None), (5, None), (6, None), (7, 6), (8, None), (9, None)]
df = spark.createDataFrame(data, ["col_a", "col_b"])

window_spec = Window.orderBy("col_a")

df = df.withColumn('col_b', 
              when(
                  (F.col('col_b').isNull()) & (F.lag(F.col('col_b')).over(window_spec) != 0),
                  (F.coalesce(F.lag(F.col('col_b')).over(window_spec), 0) - 1)
                  ).otherwise(F.col('col_b'))
             )

df.show()

英文:

I am trying to reduce by 1 and assign the last row of column B to the current row until I reach 0 or a non-null row.

col_a	col_b
1	null
2	3
3	null
4	null
5	null
6	null
7	6
8	null
9	null

Here's what I'm hoping to get.

col_a	col_b
Cell 1	null
Cell 2	3
Cell 3	2
Cell 4	1
Cell 5	0
Cell 6	null
Cell 7	6
Cell 8	5
Cell 9	4

My code so far

data = [(1, None), (2, 3), (3, None), (4, None), (5, None), (6, None), (7, 6), (8, None), (9, None)]
df = spark.createDataFrame(data, [&quot;col_a&quot;, &quot;col_b&quot;])

window_spec = Window.orderBy(&quot;col_a&quot;)

df = df.withColumn(&#39;col_b&#39;, 
              when(
                  (F.col(&#39;col_b&#39;).isNull()) &amp; (F.lag(F.col(&#39;col_b&#39;)).over(window_spec) != 0),
                  (F.lag(F.col(&#39;col_b&#39;)).over(window_spec) - 1)
                  ).otherwise(F.col(&#39;col_b&#39;))
             )

df.show()

My code only gets the last value that is currently present, and not those recently assigned. How do I get around this.

I know I can collect this column, process it, and add it back to the df but this is currently too computationally expensive for me since the dataset is very large.

答案1

得分: 1

以下是翻译好的部分：

reptval字段将列值重复直到出现非空值。
changeflag标记发生值变化（相对于前一行）的行。
cflag_csum字段是变化标志的累积和。这是为了创建分区。
rn是每个cflag_csum分区的行号（从0开始）。请在partitionBy中添加实际的分区列。
然后，需要将重复的值与生成的行号相减。

英文:

see following example that can help. note that i've created separate column for each calculation, but you can merge a few to make it concise.

from pyspark.sql.window import Window as wd
import pyspark.sql.functions as func
import sys

wSpec = wd.partitionBy(&#39;id&#39;).orderBy(&#39;c1&#39;)

data_sdf. \
    withColumn(&#39;id&#39;, func.lit(&#39;dummy_id&#39;)). \
    withColumn(&#39;reptval&#39;,
               func.last(&#39;c2&#39;, ignorenulls=True).over(wSpec.rowsBetween(-sys.maxsize, 0))
               ). \
    withColumn(&#39;changeflag&#39;,
               func.coalesce(func.col(&#39;reptval&#39;) != func.lag(&#39;reptval&#39;).over(wSpec), func.lit(True)).cast(&#39;int&#39;)
               ). \
    withColumn(&#39;cflag_csum&#39;,
               func.sum(&#39;changeflag&#39;).over(wSpec.rowsBetween(-sys.maxsize, 0))
               ). \
    withColumn(&#39;rn&#39;,
               func.row_number().over(wd.partitionBy(&#39;id&#39;, &#39;cflag_csum&#39;).orderBy(&#39;c1&#39;)) - 1
               ). \
    withColumn(&#39;c_interim&#39;, func.col(&#39;reptval&#39;) - func.col(&#39;rn&#39;)). \
    withColumn(&#39;c_fnl&#39;,
               func.when(func.col(&#39;c_interim&#39;) &lt; 0, func.lit(None)).
               otherwise(func.col(&#39;c_interim&#39;))
               ). \
    show()

# +---+----+--------+-------+----------+----------+---+---------+-----+
# | c1|  c2|      id|reptval|changeflag|cflag_csum| rn|c_interim|c_fnl|
# +---+----+--------+-------+----------+----------+---+---------+-----+
# |  1|null|dummy_id|   null|         1|         1|  0|     null| null|
# |  2|   3|dummy_id|      3|         1|         2|  0|        3|    3|
# |  3|null|dummy_id|      3|         0|         2|  1|        2|    2|
# |  4|null|dummy_id|      3|         0|         2|  2|        1|    1|
# |  5|null|dummy_id|      3|         0|         2|  3|        0|    0|
# |  6|null|dummy_id|      3|         0|         2|  4|       -1| null|
# |  7|   6|dummy_id|      6|         1|         3|  0|        6|    6|
# |  8|null|dummy_id|      6|         0|         3|  1|        5|    5|
# |  9|null|dummy_id|      6|         0|         3|  2|        4|    4|
# +---+----+--------+-------+----------+----------+---+---------+-----+

reptval field repeats the column value till the non-null value.
changeflag flags the row where a change in value (w.r.t prev row) occurs
cflag_csum field is the cumulative sum of the change flag. this is done to create partitions
rn is the row number (starting 0) for each cflag_csum partition. please add the actual partition columns as well within the partitionBy.
then, all that's needed is to subtract the repeated values with the generated row number

答案2

得分: 0

你可以使用函数row_number()，并减去(-1，因为它从1开始)，而不是从前一行减去1。但是，为了使此方法有效，您需要对窗口进行分区。要获得分组，添加一列具有累积和的值，当col_b中有非空值时增加（您可以找到其他stackoverflow问题，显示如何执行累积求和）。希望这对您有所帮助。

英文:

You can use the function row_number(), and subtract that (-1 because it starts at 1) instead of subtracting 1 from the previous row. However, for this to work you need to partition your window. To get the grouping, add a column with a cumulative sum, that increases when there's a non-null value in col_b (you can find other stackoverflow questions that show how to do a cumulative sum). Hope this helps.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Pyspark中迭代地评估当前行的前一行数值。

问题

答案1

答案2

如何基于ElasticSearch（Python）中两个子聚合指标的比较来筛选存储桶？

用Python Pillow库裁剪/模糊.png图像，而不改变其他任何内容。

在Langchain中为嵌套的JSON定义一个输出模式。

使用numpy.unique从先前计算的计数中恢复计数。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论