如何使用年、月、日期和小时作为分区来筛选数据框?

huangapple go评论63阅读模式
英文:

How to filter a dataframe with range having partitions as year, month, date and hour?

问题

以下是翻译好的部分:

Input:

year month date hour
2023 12 31 18
2024 1 1 10
2023 12 31 14
2024 1 1 14

Start Timestamp: 2023-12-31 15:00:00(包含)
End Timestamp: 2024-01-01 14:00:00(不包含)

Expected Output:

year month date hour
2023 12 31 18
2024 1 1 10

Tried below:
Try 1:

val filteredDf = rawDF.where(($"year" >= startTimeLocal.getYear && $"month" >= startTimeLocal.getMonthValue && $"day" >= startTimeLocal.getDayOfMonth && $"hour" >= startTimeLocal.getHour) && ($"year" <= endTimeLocal.getYear && $"month" <= endTimeLocal.getMonthValue && $"day" <= endTimeLocal.getDayOfMonth && $"hour" < endTimeLocal.getHour))

This condition fails as the hour values greater than 14 for the 31st day would be skipped.

Try 2:

val yearDf = rawDF.where($"year" >= startTimeLocal.getYear && $"year" <= endTimeLocal.getYear)
val monthDf = yearDf.where(($"year" === startTimeLocal.getYear && $"month" >= startTimeLocal.getMonthValue) || ($"year" === endTimeLocal.getYear && $"month" <= endTimeLocal.getMonthValue))
val dayDf = monthDf.where(($"day" >= startTimeLocal.getDayOfMonth && $"hour" >= startTimeLocal.getHour) || ($"day" <= endTimeLocal.getDayOfMonth && $"hour" < endTimeLocal.getHour))

Try 3:

val finalDf = rawDF.where(($"year" >= startTimeLocal.getYear && $"day" >= startTimeLocal.getDayOfMonth && $"hour" >= startTimeLocal.getHour && $"day" >= startTimeLocal.getDayOfMonth) || ($"year" <= endTimeLocal.getYear && $"month" <= endTimeLocal.getMonthValue && $"day" <= endTimeLocal.getDayOfMonth && $"hour" < endTimeLocal.getHour))
英文:

I have to read a dataframe from a table with partitions as follows - year, month, date and hour.

Input :

year month date hour
2023 12 31 18
2024 1 1 10
2023 12 31 14
2024 1 1 14

Need to filter a range of partitions from the table based on my audit range - start and end timestamps.

Start Timestamp : 2023-12-31 15:00:00 (Inclusive)
End Timestamp : 2024-01-01 14:00:00 (Exclusive)

Expected Output :

year month date hour
2023 12 31 18
2024 1 1 10

Tried below:

Try 1:

val filteredDf = rawDF.where(($&quot;year&quot; &gt;= startTimeLocal.getYear &amp;&amp; $&quot;month&quot; &gt;= startTimeLocal.getMonthValue &amp;&amp; $&quot;day&quot; &gt;= startTimeLocal.getDayOfMonth &amp;&amp; $&quot;hour&quot; &gt;= startTimeLocal.getHour) &amp;&amp; ($&quot;year&quot; &lt;= endTimeLocal.getYear &amp;&amp; $&quot;month&quot; &lt;= endTimeLocal.getMonthValue &amp;&amp; $&quot;day&quot; &lt;= endTimeLocal.getDayOfMonth &amp;&amp; $&quot;hour&quot; &lt; endTimeLocal.getHour))

This condition fails as the hour values greater than 14 for 31st day would be skipped.

Try 2:

val yearDf = rawDF.where($&quot;year&quot; &gt;= startTimeLocal.getYear &amp;&amp; $&quot;year&quot; &lt;= endTimeLocal.getYear)
val monthDf = yearDf.where(($&quot;year&quot; === startTimeLocal.getYear &amp;&amp; $&quot;month&quot; &gt;= startTimeLocal.getMonthValue) || ($&quot;year&quot; === endTimeLocal.getYear &amp;&amp; $&quot;month&quot; &lt;= endTimeLocal.getMonthValue))
val dayDf = monthDf.where(($&quot;day&quot; &gt;= startTimeLocal.getDayOfMonth &amp;&amp; $&quot;hour&quot; &gt;= startTimeLocal.getHour) || ($&quot;day&quot; &lt;= endTimeLocal.getDayOfMonth &amp;&amp; $&quot;hour&quot; &lt; endTimeLocal.getHour))

Try 3:

val final4Df = rawDF.where(($&quot;year&quot; &gt;= startTimeLocal.getYear &amp;&amp; $&quot;day&quot; &gt;= startTimeLocal.getDayOfMonth &amp;&amp; $&quot;hour&quot; &gt;= startTimeLocal.getHour &amp;&amp; $&quot;day&quot; &gt;= startTimeLocal.getDayOfMonth) || ($&quot;year&quot; &lt;= endTimeLocal.getYear &amp;&amp; $&quot;month&quot; &lt;= endTimeLocal.getMonthValue &amp;&amp; $&quot;day&quot; &lt;= endTimeLocal.getDayOfMonth &amp;&amp; $&quot;hour&quot; &lt; endTimeLocal.getHour))

答案1

得分: 0

我认为您可以从起始年份到结束年份之间包含所有数据,然后在细粒度级别进行筛选。

val startYear = startTimeLocal.getYear
val startMonth = startTimeLocal.getMonthValue
val startDay = startTimeLocal.getDayOfMonth
val startHour = startTimeLocal.getHour

val endYear = endTimeLocal.getYear
val endMonth = endTimeLocal.getMonthValue
val endDay = endTimeLocal.getDayOfMonth
val endHour = endTimeLocal.getHour

val filteredDf = rawDF.where(
    ($"year" >= startYear && $"year" <= endYear)
    && !(
        ($"year" === startYear && $"month" < startMonth) // 起始日期前的所有月份
        || ($"year" === startYear && $"month" === startMonth && $"day" < startDay) // 起始日期前的所有天
        || ($"year" === startYear && $"month" === startMonth && $"day" === startDay && $"hour" < startHour) // 起始日期前的所有小时
        || ($"year" === endYear && $"month" > endMonth) // 结束日期后的所有月份
        || ($"year" === endYear && $"month" === endMonth && $"day" > endDay) // 结束日期后的所有天
        || ($"year" === endYear && $"month" === endMonth && $"day" === endDay && $"hour" >= endHour) // 结束日期后的所有小时
    )
)
英文:

I think you can start by including all data between the start and end years and then filter out on granular level

val startYear = startTimeLocal.getYear
val startMonth = startTimeLocal.getMonthValue
val startDay = startTimeLocal.getDayOfMonth
val startHour = startTimeLocal.getHour

val endYear = endTimeLocal.getYear
val endMonth = endTimeLocal.getMonthValue
val endDay = endTimeLocal.getDayOfMonth
val endHour = endTimeLocal.getHour

val filteredDf = rawDF.where(
    ($&quot;year&quot; &gt;= startDay &amp;&amp; $&quot;year&quot; &lt;= endYear)
    &amp;&amp; !(
        ($&quot;year&quot; === startYear &amp;&amp; $&quot;month&quot; &lt; startMonth) // All months before start date
        || ($&quot;year&quot; === startYear &amp;&amp; $&quot;month&quot; === startMonth &amp;&amp; $&quot;day&quot; &lt; startDay) // All days before start day
        || ($&quot;year&quot; === startYear &amp;&amp; $&quot;month&quot; === startMonth &amp;&amp; $&quot;day&quot; === startDay &amp;&amp; $&quot;hour&quot; &lt; startHour) // All hours before start hour
        || ($&quot;year&quot; === endYear &amp;&amp; $&quot;month&quot; &gt; endMonth) // All months after end date
        || ($&quot;year&quot; === endYear &amp;&amp; $&quot;month&quot; === endMonth &amp;&amp; $&quot;day&quot; &gt; endDay) // All days after end day
        || ($&quot;year&quot; === endYear &amp;&amp; $&quot;month&quot; === endMonth &amp;&amp; $&quot;day&quot; === endDay &amp;&amp; $&quot;hour&quot; &gt;= endHour) // All hours after end hour
    )
)

huangapple
  • 本文由 发表于 2023年2月27日 00:44:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/75573502.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定