有没有更有效的方法来使用Pyspark筛选上个月(或X个上个月)的数据?

huangapple go评论96阅读模式
英文:

Is there a more efficient way to filter previous month's (or X previous months') data using Pyspark?

问题

以下是您要翻译的内容:

"I am new to Pyspark, so I apologize if this is a basic of a question. I have looked at a few StackOverflow posts (like this one), but I am still not sure if I'm following the best (most efficient) way of doing things here.

I have a dataframe and a string parameter/variable representing dataset date. Using the dataset date, I would like to filter data from the dataframe for the last 'X' month (for simplicity, let's just say I'd like to fetch just the last month's data). What I have tried is as below.

customer_panel_s3_location = f's3://my-bucket/region_id={region_id}/marketplace_id={marketplace_id}/'
customer_panel_table = spark.read.parquet(customer_panel_s3_location)
customer_panel_table.show(20, False)

#+-----------+---------+--------------+------------+-------------+---------+-------------+-----------+

|customer_id|region_id|marketplace_id|snapshot_day|45d_yield|plan_type|payment_offer|target_date|

+-----------+---------+--------------+------------+-------------+---------+-------------+-----------+

|493595720 |1 |1 |2021-12-31 |0 |ANNUAL |PAID |2023-02-11 |

|277499930 |1 |1 |2022-01-04 |0 |ANNUAL |PAID |2023-02-11 |

|790961616 |1 |1 |2021-12-28 |0 |ANNUAL |PAID |2023-02-11 |

|2111203 |1 |1 |2022-12-14 |0 |MONTHLY |TRIAL |2023-02-11 |

|4673190 |1 |1 |2022-11-30 |1 |MONTHLY |TRIAL |2023-02-11 |

|2117974 |1 |1 |2022-12-15 |1 |ANNUAL |PAID |2023-02-11 |

dataset_date = '2023-03-16'
df_customer_panel_table = (
spark.read.parquet(customer_panel_s3_location)
.withColumn("dataset_date", dataset_date)
.filter(F.date_format(col("snapshot_day"), "yyyyMM") == F.date_format(F.add_months(F.to_date(col("dataset_date"), "yyyy-MM-dd"), month_offset_for_snapshot_day), "yyyyMM"))"

希望这对您有所帮助。

英文:

I am new to Pyspark, so I apologize if this is a basic of a question. I have looked at a few StackOverflow posts (like this one), but I am still not sure if I'm following the best (most efficient) way of doing things here.

I have a dataframe and a string parameter/variable representing dataset date. Using the dataset date, I would like to filter data from the dataframe for the last 'X' month (for simplicity, let's just say I'd like to fetch just the last month's data). What I have tried is as below.

  1. customer_panel_s3_location = f"s3://my-bucket/region_id={region_id}/marketplace_id={marketplace_id}/"
  2. customer_panel_table = spark.read.parquet(customer_panel_s3_location)
  3. customer_panel_table.show(20, False)
  4. #+-----------+---------+--------------+------------+-------------+---------+-------------+-----------+
  5. # |customer_id|region_id|marketplace_id|snapshot_day|45d_yield|plan_type|payment_offer|target_date|
  6. # +-----------+---------+--------------+------------+-------------+---------+-------------+-----------+
  7. # |493595720 |1 |1 |2021-12-31 |0 |ANNUAL |PAID |2023-02-11 |
  8. # |277499930 |1 |1 |2022-01-04 |0 |ANNUAL |PAID |2023-02-11 |
  9. # |790961616 |1 |1 |2021-12-28 |0 |ANNUAL |PAID |2023-02-11 |
  10. # |2111203 |1 |1 |2022-12-14 |0 |MONTHLY |TRIAL |2023-02-11 |
  11. # |4673190 |1 |1 |2022-11-30 |1 |MONTHLY |TRIAL |2023-02-11 |
  12. # |2117974 |1 |1 |2022-12-15 |1 |ANNUAL |PAID |2023-02-11 |
  13. dataset_date = '2023-03-16'
  14. df_customer_panel_table = (
  15. spark.read.parquet(customer_panel_s3_location)
  16. .withColumn("dataset_date", dataset_date)
  17. .filter(F.date_format(col("snapshot_day"), "yyyyMM") == F.date_format(F.add_months(F.to_date(col("dataset_date"), "yyyy-MM-dd"), month_offset_for_snapshot_day), "yyyyMM"))

Is there a cleaner (more Pyspark-y) way of doing this? I felt like it's kind of clunky that I am converting snapshot_day (a date type column) to yyyyMM and thus, might be introducing inefficiencies in conversion, if any.

Thank you in advance for your answers!

答案1

得分: 1

以下是您的DataFrame的翻译:

  1. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+
  2. |customer_id|region_id|marketplace_id|snapshot_day|45d_yield|plan_type|payment_offer|target_date|
  3. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+
  4. | 493595720| 1| 1| 2021-12-31| 0| ANNUAL| PAID| 2023-02-11|
  5. | 277499930| 1| 1| 2022-01-04| 0| ANNUAL| PAID| 2023-02-11|
  6. | 790961616| 1| 1| 2021-12-28| 0| ANNUAL| PAID| 2023-02-11|
  7. | 2111203| 1| 1| 2022-12-14| 0| MONTHLY| TRIAL| 2023-02-11|
  8. | 4673190| 1| 1| 2022-11-30| 1| MONTHLY| TRIAL| 2023-02-11|
  9. | 2117974| 1| 1| 2022-12-15| 1| ANNUAL| PAID| 2023-02-11|
  10. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+

导入必要的包的翻译:

  1. from pyspark.sql.functions import lit, floor, months_between
  1. 添加一个具有给定日期的列的翻译:
  1. dataset_day = "2023-03-16";
  2. customer_panel_df = customer_panel_df.withColumn("dataset_day", lit(dataset_day).cast("DATE"))
  1. 使用months_between()函数计算快照日期与数据集日期之间的月数的翻译:
  1. customer_panel_df = customer_panel_df.withColumn("month_gap", floor(months_between("dataset_day", "snapshot_day")))
  1. 过滤最近的10个月的记录的翻译:
  1. customer_panel_df.filter("month_gap <= 10").drop("dataset_day", "month_gap").show()

输出

  1. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+
  2. |customer_id|region_id|marketplace_id|snapshot_day|45d_yield|plan_type|payment_offer|target_date|
  3. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+
  4. | 2111203| 1| 1| 2022-12-14| 0| MONTHLY| TRIAL| 2023-02-11|
  5. | 4673190| 1| 1| 2022-11-30| 1| MONTHLY| TRIAL| 2023-02-11|
  6. | 2117974| 1| 1| 2022-12-15| 1| ANNUAL| PAID| 2023-02-11|
  7. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+
英文:

Your DataFrame

  1. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+
  2. |customer_id|region_id|marketplace_id|snapshot_day|45d_yield|plan_type|payment_offer|target_date|
  3. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+
  4. | 493595720| 1| 1| 2021-12-31| 0| ANNUAL| PAID| 2023-02-11|
  5. | 277499930| 1| 1| 2022-01-04| 0| ANNUAL| PAID| 2023-02-11|
  6. | 790961616| 1| 1| 2021-12-28| 0| ANNUAL| PAID| 2023-02-11|
  7. | 2111203| 1| 1| 2022-12-14| 0| MONTHLY| TRIAL| 2023-02-11|
  8. | 4673190| 1| 1| 2022-11-30| 1| MONTHLY| TRIAL| 2023-02-11|
  9. | 2117974| 1| 1| 2022-12-15| 1| ANNUAL| PAID| 2023-02-11|
  10. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+

Importing necessary package

  1. from pyspark.sql.functions import lit, floor, months_between
  1. Adding a column with given date
  1. dataset_day = &quot;2023-03-16&quot;
  2. customer_panel_df = customer_panel_df.withColumn(&quot;dataset_day&quot;, lit(dataset_day).cast(&quot;DATE&quot;))
  1. Calculate the months between Snapshot date with the dataset day using months_between() function
  1. customer_panel_df = customer_panel_df.withColumn(&quot;month_gap&quot;, floor(months_between(&quot;dataset_day&quot;, &quot;snapshot_day&quot;)))
  1. Filter last 10 months record
  1. customer_panel_df.filter(&quot;month_gap &lt;= 10&quot;).drop(&quot;dataset_day&quot;, &quot;month_gap&quot;).show()

Output

  1. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+
  2. |customer_id|region_id|marketplace_id|snapshot_day|45d_yield|plan_type|payment_offer|target_date|
  3. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+
  4. | 2111203| 1| 1| 2022-12-14| 0| MONTHLY| TRIAL| 2023-02-11|
  5. | 4673190| 1| 1| 2022-11-30| 1| MONTHLY| TRIAL| 2023-02-11|
  6. | 2117974| 1| 1| 2022-12-15| 1| ANNUAL| PAID| 2023-02-11|
  7. +-----------+---------+--------------+------------+---------+---------+-------------+-----------+

huangapple
  • 本文由 发表于 2023年3月21日 00:19:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/75792796.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定