英文:
PySpark filtering on multiple criteria
问题
以下是您要求的代码部分的中文翻译:
from pyspark.sql import functions as f
from pyspark.sql import Window
df = spark.createDataFrame(data).toDF('Name', 'ID', 'ContractDate', 'LoanSum', 'ClosingDate')
df.show()
cols = df.columns
w = Window.partitionBy('ID').orderBy('ContractDate')
df.withColumn('PreviousContractDate', f.lag('ContractDate').over(w)) \
.withColumn('Target', f.expr('datediff(ContractDate, PreviousContractDate) == 0')) \
.withColumn('Target', f.col('Target') | f.lead('Target').over(w)) \
.filter('Target == True')
请注意,这是您提供的代码的翻译部分。如果您需要进一步的帮助或有其他问题,请随时提出。
英文:
My example (dataframe) is like this:
Name ID ContractDate LoanSum ClosingDate
A ID1 2022-10-10 10 2022-10-16
A ID1 2022-10-10 15 2022-10-18
A ID1 2022-10-20 20 2022-10-31
A ID1 2022-10-20 20 2022-10-30
A ID1 2022-11-10 14 2022-11-22
A ID1 2022-11-10 15 2022-11-22
B ID2 2022-11-11 15 2022-11-15
B ID2 2022-11-11 30 2022-11-18
B ID2 2022-11-17 35 2022-11-22
B ID2 2022-11-17 35 2022-11-24
C ID3 2022-12-19 19 2022-11-10
My goal is to create a new dataframe that contains all loans issued to specific borrowers (group by unique ID) given the following conditions:
- two loans should be granted at the same day;
- next two loans should be also granted at the same day but the time difference between ClosingDate of any of the previously issued loans should not exceed 5 days from the newly issued two loans.
In other words my desirable outcome is like this:
Name ID ContractDate LoanSum ClosingDate
A ID1 2022-10-10 10 2022-10-16
A ID1 2022-10-10 15 2022-10-18
A ID1 2022-10-20 20 2022-10-31
A ID1 2022-10-20 20 2022-10-30
(the difference between first ClosingDate and next ContractDate in this example is 4 (2020-10-20 minus 2022-10-16))
I've performed the following code that gives me an opportunity to get two or more loans issued to a specific borrower at the same date (grouped by ID):
from pyspark.sql import functions as f
from pyspark.sql import Window
df = spark.createDataFrame(data).toDF('Name','ID','ContractDate','LoanSum','ClosingDate')
df.show()
cols = df.columns
w = Window.partitionBy('ID').orderBy('ContractDate')
df.withColumn('PreviousContractDate', f.lag('ContractDate').over(w)) \
.withColumn('Target', f.expr('datediff(ContractDate, PreviousContractDate) == 0')) \
.withColumn('Target', f.col('Target') | f.lead('Target').over(w)) \
.filter('Target == True')
But I am stuck to filter.
Any help is highly appreciated!
答案1
得分: 1
使用groupBy
来获取一天的贷款信息,并使用Window
应用条件。
w = Window.partitionBy('ID').orderBy('MaxClosingDate')
df.groupBy('Name', 'ID', 'ContractDate') \
.agg(
f.collect_list(f.struct('LoanSum', 'ClosingDate')).alias('LoanInfo'),
f.max('ClosingDate').alias('MaxClosingDate')
) \
.withColumn('is2LoanTarget', f.size('LoanInfo') == 2) \
.withColumn('is5DaysTarget', f.datediff(f.col('ContractDate'), f.lag('MaxClosingDate').over(w)).between(0, 5)) \
.withColumn('is5DaysTarget', f.col('is5DaysTarget') | f.lead('is5DaysTarget').over(w)) \
.filter('is2LoanTarget and is5DaysTarget') \
.withColumn('LoanInfo', f.explode('LoanInfo')) \
.select('Name', 'ID', 'ContractDate', 'LoanInfo.*') \
.show()
+----+---+------------+-------+-----------+
|Name| ID|ContractDate|LoanSum|ClosingDate|
+----+---+------------+-------+-----------+
| A|ID1| 2022-10-10| 10| 2022-10-16|
| A|ID1| 2022-10-10| 15| 2022-10-18|
| A|ID1| 2022-10-20| 20| 2022-10-31|
| A|ID1| 2022-10-20| 20| 2022-10-30|
+----+---+------------+-------+-----------+
<details>
<summary>英文:</summary>
Use `groupBy` to get the loans info for a day and apply the conditions by using the `Window`.
w = Window.partitionBy('ID').orderBy('MaxClosingDate')
df.groupBy('Name', 'ID', 'ContractDate')
.agg(
f.collect_list(f.struct('LoanSum', 'ClosingDate')).alias('LoanInfo'),
f.max('ClosingDate').alias('MaxClosingDate')
)
.withColumn('is2LoanTarget', f.size('LoanInfo') == 2)
.withColumn('is5DaysTarget', f.datediff(f.col('ContractDate'), f.lag('MaxClosingDate').over(w)).between(0, 5))
.withColumn('is5DaysTarget', f.col('is5DaysTarget') | f.lead('is5DaysTarget').over(w))
.filter('is2LoanTarget and is5DaysTarget')
.withColumn('LoanInfo', f.explode('LoanInfo'))
.select('Name', 'ID', 'ContractDate', 'LoanInfo.*')
.show()
+----+---+------------+-------+-----------+
|Name| ID|ContractDate|LoanSum|ClosingDate|
+----+---+------------+-------+-----------+
| A|ID1| 2022-10-10| 10| 2022-10-16|
| A|ID1| 2022-10-10| 15| 2022-10-18|
| A|ID1| 2022-10-20| 20| 2022-10-31|
| A|ID1| 2022-10-20| 20| 2022-10-30|
+----+---+------------+-------+-----------+
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论