使用Spark Scala将连续的行分组,其中行重复。

huangapple go评论52阅读模式
英文:

Group consecutive rows using spark scala with rows repeating

问题

--------------+-------------------------+
| space_id   |template   |frequency| day         |timestamp               |
+------------------------------------+-----------+---------+-----------
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:00:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:15:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:30:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:15:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:30:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:45:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:30:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:45:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T10:00:00+05:30|
英文:
--------------+-------------------------+
| space_id   |template   |frequency| day         |timestamp               |
+------------------------------------+-----------+---------+-----------
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:00:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:15:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:30:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:45:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T10:00:00+05:30|

Here I have a unique id as space_id, Template(which may have temperature, humidity, CO2), frequency column which says what is the frequency in which I receive the data from a sensor, a day column and finally a timestamp column
Here I need to group the data in 30 minute batch according to the timestamp

I am able to find 30minutes batches as 09:00:00,09:15:00 & 09:30:00 in one batch and next 09:30:00,09:45:00,10:00:00 so on.
But what I need is 09:00:00,09:15:00 & 09:30:00 and 09:15:00, 09:30:00 ,09:45:00 , 09:30:00 ,09:45:00, 10:00:00 so on
I need to make slots for 30minute batch for each timestamp value
In Simple words. From above table. I need groups of rows(1,2,3), rows(2,3,4),row(3,4,5) so on..

答案1

得分: 1

你要找的窗口设置是:

from pyspark.sql import Window

w = Window.partitionBy('space_id').orderBy('timestamp').rowsBetween(Window.currentRow, Window.currentRow + 2)
英文:

The window setting you're looking for is:

from pyspark.sql import Window

w = Window.partitionBy('space_id').orderBy('timestamp').rowsBetween(Window.currentRow, Window.currentRow + 2)

huangapple
  • 本文由 发表于 2023年2月24日 03:53:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/75549695.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定