英文:
Append Value in rows with existing value in Databricks
问题
我是Databricks的新手,如果我听起来很蠢,请谅解。
我有一个需求,我正在对一个数据框进行验证,我目前为每个验证定义了函数,比如一个用于检查空值,一个用于检查日期范围,每当我的函数符合验证规则时,它应该在Validation remarks列中标记为1,如下所示:
Name ID Date_of Birth position Validation remarks
dam 1 02-04-1992 Manager
dana 02-04-1992 Associate 1
rich 3 02-04-1992 VP
danial 4 02-04-1992 CEO
mathew 02-04-1910 Manager 1
但问题在于,我无法确定函数标记为1的原因是什么,是因为ID列为空,还是因为出生日期超过100年,或者两者都可能。
所以我想知道是否可以附加原因,如下所示。
Name ID Date_of Birth position Validation remarks
dam 1 02-04-1992 Manager
dana 02-04-1992 Associate ID为空
rich 3 02-04-1992 VP
danial 4 02-04-1992 CEO
mathew 02-04-1910 Manager ['ID为空', '出生日期超过100年']
也就是说,如果行的ID为空,那么稍后如果它还具有出生日期超过100年,也附加该值,就像上面一样。
我只想知道如何将值附加到Validation remarks。
英文:
I am new to Databricks so bear with me if I sound stupid.
I have a requirement wherein I am doing validations to a data frame and I currently have defined functions for each validation like one for null-check one for Date_range, every time my function meets the validation rules it should mark 1 in **Validation_remark **column like below
Name ID Date_of Birth position Validation remarks
dam 1 02-04-1992 Manager
dana 02-04-1992 Associate 1
rich 3 02-04-1992 VP
danial 4 02-04-1992 CEO
mathew 02-04-1910 Manager 1
but the problem here here is i am not able to figure out what is the reason for the function to mark it 1 whether it's because ID col is empty or whether it's because Date_of_birth os > 100 years or may be both.
So I want to know if I can append the reason like below.
Name ID Date_of Birth position Validation remarks
dam 1 02-04-1992 Manager
dana 02-04-1992 Associate ID id null
rich 3 02-04-1992 VP
danial 4 02-04-1992 CEO
mathew 02-04-1910 Manager ['ID is null', 'Date_of_Birth is > 100 years']
i.e if the row have blank ID then note that late if it also has Date_of Birth > 100 years append that value as well like above.
I just want to know how do I append the values to Validation remarks
答案1
得分: 0
以下是代码的翻译部分:
您可以在PySpark中使用自定义UDF函数来执行此操作。
```python
from pyspark.sql.functions import *
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# 创建示例数据框
df = spark.createDataFrame([("dam", "1", "02-04-1992", "Manager"),
("dana", "", "02-04-1992", "Associate"),
("rich", "3", "02-04-1992", "VP"),
("danial", "4", "02-04-1992", "CEO"),
("mathew", "", "02-04-1910", "Manager")],
["Name", "ID", "Date_of_birth", "position"])
df.show()
# +------+---+-------------+---------+
# | Name| ID|Date_of_birth| position|
# +------+---+-------------+---------+
# | dam| 1| 02-04-1992| Manager|
# | dana| | 02-04-1992|Associate|
# | rich| 3| 02-04-1992| VP|
# |danial| 4| 02-04-1992| CEO|
# |mathew| | 02-04-1910| Manager|
# +------+---+-------------+---------+
# Python函数检查ID和出生日期,返回备注列表
def validate_row(row):
remarks = []
if row.ID == "":
remarks.append("ID为空")
if (row.Date_of_birth is not None and
int(row.Date_of_birth.split("-")[-1]) <= 2023 - 100):
remarks.append("出生日期大于100年") # 我将检查当前年份2023
return remarks
# 从上面的Python函数创建UDF函数
validate_udf = udf(validate_row, ArrayType(StringType()))
# 将上述函数应用于DF中的每一行
df = df.withColumn("验证备注", validate_udf(struct(df.columns)))
df.show()
# +------+---+-------------+---------+--------------------+
# | Name| ID|Date_of_birth| position| 验证备注|
# +------+---+-------------+---------+--------------------+
# | dam| 1| 02-04-1992| Manager| []|
# | dana| | 02-04-1992|Associate| [ID为空]|
# | rich| 3| 02-04-1992| VP| []|
# |danial| 4| 02-04-1992| CEO| []|
# |mathew| | 02-04-1910| Manager|[ID为空, 出生日期大于100年]|
# +------+---+-------------+---------+--------------------+
希望这对您有所帮助。
英文:
You could use a custom udf function in PySpark to do this.
from pyspark.sql.functions import *
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Create example DataFrame
df = spark.createDataFrame([("dam", "1", "02-04-1992", "Manager"),
("dana", "", "02-04-1992", "Associate"),
("rich", "3", "02-04-1992", "VP"),
("danial", "4", "02-04-1992", "CEO"),
("mathew", "", "02-04-1910", "Manager")],
["Name", "ID", "Date_of_birth", "position"])
df.show()
# +------+---+-------------+---------+
# | Name| ID|Date_of_birth| position|
# +------+---+-------------+---------+
# | dam| 1| 02-04-1992| Manager|
# | dana| | 02-04-1992|Associate|
# | rich| 3| 02-04-1992| VP|
# |danial| 4| 02-04-1992| CEO|
# |mathew| | 02-04-1910| Manager|
# +------+---+-------------+---------+
# Python func to check ID and date of birth, return list of remarks
def validate_row(row):
remarks = []
if row.ID == "":
remarks.append("ID is null")
if (row.Date_of_birth is not None and
int(row.Date_of_birth.split("-")[-1]) <= 2023 - 100):
remarks.append("Date_of_Birth is > 100 years") # I will check with current year 2023
return remarks
# Create an UDF func from Python func above
validate_udf = udf(validate_row, ArrayType(StringType()))
# Apply the func above to each row in DF
df = df.withColumn("Validation remarks", validate_udf(struct(df.columns)))
df.show()
# +------+---+-------------+---------+--------------------+
# | Name| ID|Date_of_birth| position| Validation remarks|
# +------+---+-------------+---------+--------------------+
# | dam| 1| 02-04-1992| Manager| []|
# | dana| | 02-04-1992|Associate| [ID is null]|
# | rich| 3| 02-04-1992| VP| []|
# |danial| 4| 02-04-1992| CEO| []|
# |mathew| | 02-04-1910| Manager|[ID is null, Date_of_Birth is > 100 years|
# +------+---+-------------+---------+--------------------+
答案2
得分: 0
以下是翻译好的内容:
- 创建一个如下所示的数据框:
+------+----+-------------+---------+
| Name| ID|Date_of_Birth| position|
+------+----+-------------+---------+
| dam| 1| 1992-04-02| Manager|
| dana|null| 1992-04-02|Associate|
| rich| 3| 1992-04-02| VP|
|danial| 4| 1992-04-02| CEO|
|mathew|null| 1910-04-02| Manager|
+------+----+-------------+---------+
- 将所有条件放入列表
cond_result_list
中:
df_input = df_input.withColumn('Validation remarks', lit(''))
cond_result_list = [
["id is null",'Id is Null'],
["DATEDIFF(CURRENT_DATE(), Date_of_Birth) / 365.25 > 100",'Date_of_Birth is > 100 years']
]
for i in range(len(cond_result_list)):
df_input = df_input.withColumn('Validation remarks',
concat(col('Validation remarks'),
expr("""case when {} then '{}'{}' else '' end """.format(cond_result_list[i][0],'|',cond_result_list[i][1]))
)
)
df_input = df_input.withColumn("Validation remarks", split(df_input["Validation remarks"], "\|"))
df_input = df_input.withColumn("Validation remarks", array_remove(df_input["Validation remarks"], ""))
- 打印数据框:
df_input.show(truncate=False)
+------+----+-------------+---------+------------------------------------------+
|Name |ID |Date_of_Birth|position |Validation remarks |
+------+----+-------------+---------+------------------------------------------+
|dam |1 |1992-04-02 |Manager |[] |
|dana |null|1992-04-02 |Associate|[Id is Null] |
|rich |3 |1992-04-02 |VP |[] |
|danial|4 |1992-04-02 |CEO |[] |
|mathew|null|1910-04-02 |Manager |[Id is Null, Date_of_Birth is > 100 years]|
+------+----+-------------+---------+------------------------------------------+
请注意,代码部分未进行翻译。
英文:
Here are my 2 cents:
-
Created a dataframe as follows:
+------+----+-------------+---------+ | Name| ID|Date_of_Birth| position| +------+----+-------------+---------+ | dam| 1| 1992-04-02| Manager| | dana|null| 1992-04-02|Associate| | rich| 3| 1992-04-02| VP| |danial| 4| 1992-04-02| CEO| |mathew|null| 1910-04-02| Manager| +------+----+-------------+---------+
-
Put all your conditions in the list(cond_result_list):
df_input = df_input.withColumn('Validation remarks',lit((''))) cond_result_list = [ ["id is null",'Id is Null'], ["DATEDIFF(CURRENT_DATE(), Date_of_Birth) / 365.25 > 100",'Date_of_Birth is > 100 years'] ] for i in range(len(cond_result_list)): df_input = df_input.withColumn('Validation remarks', concat(col('Validation remarks'), expr("""case when {} then '{}{}' else '' end """.format(cond_result_list[i][0],'|',cond_result_list[i][1])) ) ) df_input= df_input.withColumn("Validation remarks", split(df_input["Validation remarks"], "\|")) df_input = df_input.withColumn("Validation remarks", array_remove(df_input["Validation remarks"], ""))
-
Print the dataframe:
df_input.show(truncate=False) +------+----+-------------+---------+------------------------------------------+ |Name |ID |Date_of_Birth|position |Validation remarks | +------+----+-------------+---------+------------------------------------------+ |dam |1 |1992-04-02 |Manager |[] | |dana |null|1992-04-02 |Associate|[Id is Null] | |rich |3 |1992-04-02 |VP |[] | |danial|4 |1992-04-02 |CEO |[] | |mathew|null|1910-04-02 |Manager |[Id is Null, Date_of_Birth is > 100 years]| +------+----+-------------+---------+------------------------------------------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论