删除或标记,在pyspark中记录具有最大日期的记录。

huangapple go评论53阅读模式
英文:

Delete or mark, record with max date in pyspark

问题

我是新手使用 Databricks,我尝试使用 "drop duplicate" 方法来删除 DF 中 "Patient_id" 列的重复记录。

我想知道是否有一种方法可以根据 DF 中的另一列 "time_stamp" 来删除 "patient_id" 列中的重复记录。

所以,基本上我想保留具有最大时间戳的 "patient_id",当有重复时,然后删除其余记录。

提前感谢。

英文:

I am new to data bricks, and I am trying to get rid of duplicate records for column "Patient_id" in a DF by using the "drop duplicate method.

I'm wondering if there is a way to delete duplicate records in the patient_id column depending on the time_stamp column, which is another column in DF.

So what I basically want is to keep the patient_idrd, which has the maximum time stamp, when Iop duplicates, and then delete the rest.

Thanks in advance

答案1

得分: 1

你需要使用窗口函数。您可以基于每个PatientID的降序日期定义row_number。然后,在row_number = 1的记录中进行筛选,即对于每个患者ID选择最大日期并仅筛选这些记录。

date = ['2022-10-16 17:00:00', '2022-10-16 18:00:00', '2022-10-16 21:00:00', '2022-10-16 22:00:00']
id = [1, 1, 2, 2]
df = spark.createDataFrame(list(zip(id, date)), ['id', 'dt'])

import pyspark.sql.functions as F 
from pyspark.sql.window import Window

df = df.withColumn("rn", F.row_number().over(
    Window.partitionBy("id").orderBy(col("dt").desc())
))
df.where("rn = 1").select("id", "dt").show()

输出 -

+---+-------------------+
| id|                 dt|
+---+-------------------+
|  1|2022-10-16 18:00:00|
|  2|2022-10-16 22:00:00|
+---+-------------------+
英文:

You need to use window functions. You can define row_number based on descending date for each PatientID. Then filtering the records wherever row_number = 1, i.e., for each patient id select the maximum date and filter only those records.

date = ['2022-10-16 17:00:00', '2022-10-16 18:00:00', '2022-10-16 21:00:00', '2022-10-16 22:00:00']
id = [1, 1, 2, 2]
df = spark.createDataFrame(list(zip(id, date)), ['id', 'dt'])

import pyspark.sql.functions as F 
from pyspark.sql.window import Window

df = df.withColumn("rn", F.row_number().over(
    Window.partitionBy("id").orderBy(col("dt").desc())
))
df.where("rn = 1").select("id","dt").show()

Output -

+---+-------------------+
| id|                 dt|
+---+-------------------+
|  1|2022-10-16 18:00:00|
|  2|2022-10-16 22:00:00|
+---+-------------------+

huangapple
  • 本文由 发表于 2023年6月1日 12:15:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76378621.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定