英文:
How to filter a dataframe with range having partitions as year, month, date and hour?
问题
以下是翻译好的部分:
Input:
year | month | date | hour |
---|---|---|---|
2023 | 12 | 31 | 18 |
2024 | 1 | 1 | 10 |
2023 | 12 | 31 | 14 |
2024 | 1 | 1 | 14 |
Start Timestamp: 2023-12-31 15:00:00(包含)
End Timestamp: 2024-01-01 14:00:00(不包含)
Expected Output:
year | month | date | hour |
---|---|---|---|
2023 | 12 | 31 | 18 |
2024 | 1 | 1 | 10 |
Tried below:
Try 1:
val filteredDf = rawDF.where(($"year" >= startTimeLocal.getYear && $"month" >= startTimeLocal.getMonthValue && $"day" >= startTimeLocal.getDayOfMonth && $"hour" >= startTimeLocal.getHour) && ($"year" <= endTimeLocal.getYear && $"month" <= endTimeLocal.getMonthValue && $"day" <= endTimeLocal.getDayOfMonth && $"hour" < endTimeLocal.getHour))
This condition fails as the hour values greater than 14 for the 31st day would be skipped.
Try 2:
val yearDf = rawDF.where($"year" >= startTimeLocal.getYear && $"year" <= endTimeLocal.getYear)
val monthDf = yearDf.where(($"year" === startTimeLocal.getYear && $"month" >= startTimeLocal.getMonthValue) || ($"year" === endTimeLocal.getYear && $"month" <= endTimeLocal.getMonthValue))
val dayDf = monthDf.where(($"day" >= startTimeLocal.getDayOfMonth && $"hour" >= startTimeLocal.getHour) || ($"day" <= endTimeLocal.getDayOfMonth && $"hour" < endTimeLocal.getHour))
Try 3:
val finalDf = rawDF.where(($"year" >= startTimeLocal.getYear && $"day" >= startTimeLocal.getDayOfMonth && $"hour" >= startTimeLocal.getHour && $"day" >= startTimeLocal.getDayOfMonth) || ($"year" <= endTimeLocal.getYear && $"month" <= endTimeLocal.getMonthValue && $"day" <= endTimeLocal.getDayOfMonth && $"hour" < endTimeLocal.getHour))
英文:
I have to read a dataframe from a table with partitions as follows - year, month, date and hour.
Input :
year | month | date | hour |
---|---|---|---|
2023 | 12 | 31 | 18 |
2024 | 1 | 1 | 10 |
2023 | 12 | 31 | 14 |
2024 | 1 | 1 | 14 |
Need to filter a range of partitions from the table based on my audit range - start and end timestamps.
Start Timestamp : 2023-12-31 15:00:00 (Inclusive)
End Timestamp : 2024-01-01 14:00:00 (Exclusive)
Expected Output :
year | month | date | hour |
---|---|---|---|
2023 | 12 | 31 | 18 |
2024 | 1 | 1 | 10 |
Tried below:
Try 1:
val filteredDf = rawDF.where(($"year" >= startTimeLocal.getYear && $"month" >= startTimeLocal.getMonthValue && $"day" >= startTimeLocal.getDayOfMonth && $"hour" >= startTimeLocal.getHour) && ($"year" <= endTimeLocal.getYear && $"month" <= endTimeLocal.getMonthValue && $"day" <= endTimeLocal.getDayOfMonth && $"hour" < endTimeLocal.getHour))
This condition fails as the hour values greater than 14 for 31st day would be skipped.
Try 2:
val yearDf = rawDF.where($"year" >= startTimeLocal.getYear && $"year" <= endTimeLocal.getYear)
val monthDf = yearDf.where(($"year" === startTimeLocal.getYear && $"month" >= startTimeLocal.getMonthValue) || ($"year" === endTimeLocal.getYear && $"month" <= endTimeLocal.getMonthValue))
val dayDf = monthDf.where(($"day" >= startTimeLocal.getDayOfMonth && $"hour" >= startTimeLocal.getHour) || ($"day" <= endTimeLocal.getDayOfMonth && $"hour" < endTimeLocal.getHour))
Try 3:
val final4Df = rawDF.where(($"year" >= startTimeLocal.getYear && $"day" >= startTimeLocal.getDayOfMonth && $"hour" >= startTimeLocal.getHour && $"day" >= startTimeLocal.getDayOfMonth) || ($"year" <= endTimeLocal.getYear && $"month" <= endTimeLocal.getMonthValue && $"day" <= endTimeLocal.getDayOfMonth && $"hour" < endTimeLocal.getHour))
答案1
得分: 0
我认为您可以从起始年份到结束年份之间包含所有数据,然后在细粒度级别进行筛选。
val startYear = startTimeLocal.getYear
val startMonth = startTimeLocal.getMonthValue
val startDay = startTimeLocal.getDayOfMonth
val startHour = startTimeLocal.getHour
val endYear = endTimeLocal.getYear
val endMonth = endTimeLocal.getMonthValue
val endDay = endTimeLocal.getDayOfMonth
val endHour = endTimeLocal.getHour
val filteredDf = rawDF.where(
($"year" >= startYear && $"year" <= endYear)
&& !(
($"year" === startYear && $"month" < startMonth) // 起始日期前的所有月份
|| ($"year" === startYear && $"month" === startMonth && $"day" < startDay) // 起始日期前的所有天
|| ($"year" === startYear && $"month" === startMonth && $"day" === startDay && $"hour" < startHour) // 起始日期前的所有小时
|| ($"year" === endYear && $"month" > endMonth) // 结束日期后的所有月份
|| ($"year" === endYear && $"month" === endMonth && $"day" > endDay) // 结束日期后的所有天
|| ($"year" === endYear && $"month" === endMonth && $"day" === endDay && $"hour" >= endHour) // 结束日期后的所有小时
)
)
英文:
I think you can start by including all data between the start and end years and then filter out on granular level
val startYear = startTimeLocal.getYear
val startMonth = startTimeLocal.getMonthValue
val startDay = startTimeLocal.getDayOfMonth
val startHour = startTimeLocal.getHour
val endYear = endTimeLocal.getYear
val endMonth = endTimeLocal.getMonthValue
val endDay = endTimeLocal.getDayOfMonth
val endHour = endTimeLocal.getHour
val filteredDf = rawDF.where(
($"year" >= startDay && $"year" <= endYear)
&& !(
($"year" === startYear && $"month" < startMonth) // All months before start date
|| ($"year" === startYear && $"month" === startMonth && $"day" < startDay) // All days before start day
|| ($"year" === startYear && $"month" === startMonth && $"day" === startDay && $"hour" < startHour) // All hours before start hour
|| ($"year" === endYear && $"month" > endMonth) // All months after end date
|| ($"year" === endYear && $"month" === endMonth && $"day" > endDay) // All days after end day
|| ($"year" === endYear && $"month" === endMonth && $"day" === endDay && $"hour" >= endHour) // All hours after end hour
)
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论