英文:
Delete or mark, record with max date in pyspark
问题
我是新手使用 Databricks,我尝试使用 "drop duplicate" 方法来删除 DF 中 "Patient_id" 列的重复记录。
我想知道是否有一种方法可以根据 DF 中的另一列 "time_stamp" 来删除 "patient_id" 列中的重复记录。
所以,基本上我想保留具有最大时间戳的 "patient_id",当有重复时,然后删除其余记录。
提前感谢。
英文:
I am new to data bricks, and I am trying to get rid of duplicate records for column "Patient_id" in a DF by using the "drop duplicate method.
I'm wondering if there is a way to delete duplicate records in the patient_id column depending on the time_stamp column, which is another column in DF.
So what I basically want is to keep the patient_idrd, which has the maximum time stamp, when Iop duplicates, and then delete the rest.
Thanks in advance
答案1
得分: 1
你需要使用窗口函数。您可以基于每个PatientID的降序日期定义row_number。然后,在row_number = 1的记录中进行筛选,即对于每个患者ID选择最大日期并仅筛选这些记录。
date = ['2022-10-16 17:00:00', '2022-10-16 18:00:00', '2022-10-16 21:00:00', '2022-10-16 22:00:00']
id = [1, 1, 2, 2]
df = spark.createDataFrame(list(zip(id, date)), ['id', 'dt'])
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = df.withColumn("rn", F.row_number().over(
Window.partitionBy("id").orderBy(col("dt").desc())
))
df.where("rn = 1").select("id", "dt").show()
输出 -
+---+-------------------+
| id| dt|
+---+-------------------+
| 1|2022-10-16 18:00:00|
| 2|2022-10-16 22:00:00|
+---+-------------------+
英文:
You need to use window functions. You can define row_number based on descending date for each PatientID. Then filtering the records wherever row_number = 1, i.e., for each patient id select the maximum date and filter only those records.
date = ['2022-10-16 17:00:00', '2022-10-16 18:00:00', '2022-10-16 21:00:00', '2022-10-16 22:00:00']
id = [1, 1, 2, 2]
df = spark.createDataFrame(list(zip(id, date)), ['id', 'dt'])
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = df.withColumn("rn", F.row_number().over(
Window.partitionBy("id").orderBy(col("dt").desc())
))
df.where("rn = 1").select("id","dt").show()
Output -
+---+-------------------+
| id| dt|
+---+-------------------+
| 1|2022-10-16 18:00:00|
| 2|2022-10-16 22:00:00|
+---+-------------------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论