PySpark多条件筛选

huangapple go评论76阅读模式
英文:

PySpark filtering on multiple criteria

问题

以下是您要求的代码部分的中文翻译:

from pyspark.sql import functions as f
from pyspark.sql import Window

df = spark.createDataFrame(data).toDF('Name', 'ID', 'ContractDate', 'LoanSum', 'ClosingDate')
df.show()

cols = df.columns
w = Window.partitionBy('ID').orderBy('ContractDate')

df.withColumn('PreviousContractDate', f.lag('ContractDate').over(w)) \
  .withColumn('Target', f.expr('datediff(ContractDate, PreviousContractDate) == 0')) \
  .withColumn('Target', f.col('Target') | f.lead('Target').over(w)) \
  .filter('Target == True')

请注意,这是您提供的代码的翻译部分。如果您需要进一步的帮助或有其他问题,请随时提出。

英文:

My example (dataframe) is like this:

Name    ID     ContractDate LoanSum ClosingDate
A       ID1    2022-10-10   10      2022-10-16 
A       ID1    2022-10-10   15      2022-10-18
A       ID1    2022-10-20   20      2022-10-31
A       ID1    2022-10-20   20      2022-10-30
A       ID1    2022-11-10   14      2022-11-22
A       ID1    2022-11-10   15      2022-11-22

B       ID2    2022-11-11   15      2022-11-15
B       ID2    2022-11-11   30      2022-11-18
B       ID2    2022-11-17   35      2022-11-22
B       ID2    2022-11-17   35      2022-11-24

C       ID3    2022-12-19   19      2022-11-10

My goal is to create a new dataframe that contains all loans issued to specific borrowers (group by unique ID) given the following conditions:

  • two loans should be granted at the same day;
  • next two loans should be also granted at the same day but the time difference between ClosingDate of any of the previously issued loans should not exceed 5 days from the newly issued two loans.

In other words my desirable outcome is like this:

Name    ID     ContractDate LoanSum ClosingDate
A       ID1    2022-10-10   10      2022-10-16 
A       ID1    2022-10-10   15      2022-10-18
A       ID1    2022-10-20   20      2022-10-31
A       ID1    2022-10-20   20      2022-10-30

(the difference between first ClosingDate and next ContractDate in this example is 4 (2020-10-20 minus 2022-10-16))

I've performed the following code that gives me an opportunity to get two or more loans issued to a specific borrower at the same date (grouped by ID):

from pyspark.sql import functions as f
from pyspark.sql import Window

df = spark.createDataFrame(data).toDF('Name','ID','ContractDate','LoanSum','ClosingDate')
df.show()

cols = df.columns
w = Window.partitionBy('ID').orderBy('ContractDate')

df.withColumn('PreviousContractDate', f.lag('ContractDate').over(w)) \
  .withColumn('Target', f.expr('datediff(ContractDate, PreviousContractDate) == 0')) \
  .withColumn('Target', f.col('Target') | f.lead('Target').over(w)) \
  .filter('Target == True')

But I am stuck to filter.

Any help is highly appreciated!

答案1

得分: 1

使用groupBy来获取一天的贷款信息,并使用Window应用条件。

w = Window.partitionBy('ID').orderBy('MaxClosingDate')

df.groupBy('Name', 'ID', 'ContractDate') \
  .agg(
      f.collect_list(f.struct('LoanSum', 'ClosingDate')).alias('LoanInfo'),
      f.max('ClosingDate').alias('MaxClosingDate')
  ) \
  .withColumn('is2LoanTarget', f.size('LoanInfo') == 2) \
  .withColumn('is5DaysTarget', f.datediff(f.col('ContractDate'), f.lag('MaxClosingDate').over(w)).between(0, 5)) \
  .withColumn('is5DaysTarget', f.col('is5DaysTarget') | f.lead('is5DaysTarget').over(w)) \
  .filter('is2LoanTarget and is5DaysTarget') \
  .withColumn('LoanInfo', f.explode('LoanInfo')) \
  .select('Name', 'ID', 'ContractDate', 'LoanInfo.*') \
  .show()

+----+---+------------+-------+-----------+
|Name| ID|ContractDate|LoanSum|ClosingDate|
+----+---+------------+-------+-----------+
| A|ID1| 2022-10-10| 10| 2022-10-16|
| A|ID1| 2022-10-10| 15| 2022-10-18|
| A|ID1| 2022-10-20| 20| 2022-10-31|
| A|ID1| 2022-10-20| 20| 2022-10-30|
+----+---+------------+-------+-----------+


<details>
<summary>英文:</summary>

Use `groupBy` to get the loans info for a day and apply the conditions by using the `Window`.

w = Window.partitionBy('ID').orderBy('MaxClosingDate')

df.groupBy('Name', 'ID', 'ContractDate')
.agg(
f.collect_list(f.struct('LoanSum', 'ClosingDate')).alias('LoanInfo'),
f.max('ClosingDate').alias('MaxClosingDate')
)
.withColumn('is2LoanTarget', f.size('LoanInfo') == 2)
.withColumn('is5DaysTarget', f.datediff(f.col('ContractDate'), f.lag('MaxClosingDate').over(w)).between(0, 5))
.withColumn('is5DaysTarget', f.col('is5DaysTarget') | f.lead('is5DaysTarget').over(w))
.filter('is2LoanTarget and is5DaysTarget')
.withColumn('LoanInfo', f.explode('LoanInfo'))
.select('Name', 'ID', 'ContractDate', 'LoanInfo.*')
.show()

+----+---+------------+-------+-----------+
|Name| ID|ContractDate|LoanSum|ClosingDate|
+----+---+------------+-------+-----------+
| A|ID1| 2022-10-10| 10| 2022-10-16|
| A|ID1| 2022-10-10| 15| 2022-10-18|
| A|ID1| 2022-10-20| 20| 2022-10-31|
| A|ID1| 2022-10-20| 20| 2022-10-30|
+----+---+------------+-------+-----------+


</details>



huangapple
  • 本文由 发表于 2023年7月28日 05:24:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76783485.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定