在Databricks中追加值到已存在值的行中。

huangapple go评论64阅读模式
英文:

Append Value in rows with existing value in Databricks

问题

我是Databricks的新手,如果我听起来很蠢,请谅解。

我有一个需求,我正在对一个数据框进行验证,我目前为每个验证定义了函数,比如一个用于检查空值,一个用于检查日期范围,每当我的函数符合验证规则时,它应该在Validation remarks列中标记为1,如下所示:

Name	ID	Date_of Birth	position	Validation remarks
dam 	1	02-04-1992	Manager 	
dana		02-04-1992	Associate	1
rich	3	02-04-1992	VP	
danial	4	02-04-1992	CEO	
mathew		02-04-1910	Manager 	1

但问题在于,我无法确定函数标记为1的原因是什么,是因为ID列为空,还是因为出生日期超过100年,或者两者都可能。

所以我想知道是否可以附加原因,如下所示。

Name	ID	Date_of Birth	position	Validation remarks
dam 	1	02-04-1992	Manager 	
dana		02-04-1992	Associate	ID为空
rich	3	02-04-1992	VP	
danial	4	02-04-1992	CEO	
mathew		02-04-1910	Manager 	['ID为空', '出生日期超过100年']

也就是说,如果行的ID为空,那么稍后如果它还具有出生日期超过100年,也附加该值,就像上面一样。

我只想知道如何将值附加到Validation remarks

英文:

I am new to Databricks so bear with me if I sound stupid.
I have a requirement wherein I am doing validations to a data frame and I currently have defined functions for each validation like one for null-check one for Date_range, every time my function meets the validation rules it should mark 1 in **Validation_remark **column like below

Name	ID	Date_of Birth	position	Validation remarks
dam 	1	02-04-1992	Manager 	
dana		02-04-1992	Associate	1
rich	3	02-04-1992	VP	
danial	4	02-04-1992	CEO	
mathew		02-04-1910	Manager 	1

but the problem here here is i am not able to figure out what is the reason for the function to mark it 1 whether it's because ID col is empty or whether it's because Date_of_birth os > 100 years or may be both.

So I want to know if I can append the reason like below.

Name	ID	Date_of Birth	position	Validation remarks
dam 	1	02-04-1992	Manager 	
dana		02-04-1992	Associate	ID id null
rich	3	02-04-1992	VP	
danial	4	02-04-1992	CEO	
mathew		02-04-1910	Manager 	['ID is null', 'Date_of_Birth is > 100 years']

i.e if the row have blank ID then note that late if it also has Date_of Birth > 100 years append that value as well like above.

I just want to know how do I append the values to Validation remarks

答案1

得分: 0

以下是代码的翻译部分:

您可以在PySpark中使用自定义UDF函数来执行此操作

```python
from pyspark.sql.functions import *
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
# 创建示例数据框
df = spark.createDataFrame([("dam", "1", "02-04-1992", "Manager"),
                            ("dana", "", "02-04-1992", "Associate"),
                            ("rich", "3", "02-04-1992", "VP"),
                            ("danial", "4", "02-04-1992", "CEO"),
                            ("mathew", "", "02-04-1910", "Manager")],
                           ["Name", "ID", "Date_of_birth", "position"])
df.show()
# +------+---+-------------+---------+
# |  Name| ID|Date_of_birth| position|
# +------+---+-------------+---------+
# |   dam|  1|   02-04-1992|  Manager|
# |  dana|   |   02-04-1992|Associate|
# |  rich|  3|   02-04-1992|       VP|
# |danial|  4|   02-04-1992|      CEO|
# |mathew|   |   02-04-1910|  Manager|
# +------+---+-------------+---------+
# Python函数检查ID和出生日期,返回备注列表
def validate_row(row):
    remarks = []
    if row.ID == "":
        remarks.append("ID为空")
    if (row.Date_of_birth is not None and
        int(row.Date_of_birth.split("-")[-1]) <= 2023 - 100):
        remarks.append("出生日期大于100年") # 我将检查当前年份2023
    return remarks

# 从上面的Python函数创建UDF函数
validate_udf = udf(validate_row, ArrayType(StringType()))

# 将上述函数应用于DF中的每一行
df = df.withColumn("验证备注", validate_udf(struct(df.columns)))
df.show()
# +------+---+-------------+---------+--------------------+
# |  Name| ID|Date_of_birth| position|  验证备注|
# +------+---+-------------+---------+--------------------+
# |   dam|  1|   02-04-1992|  Manager|                  []|
# |  dana|   |   02-04-1992|Associate|        [ID为空]|
# |  rich|  3|   02-04-1992|       VP|                  []|
# |danial|  4|   02-04-1992|      CEO|                  []|
# |mathew|   |   02-04-1910|  Manager|[ID为空, 出生日期大于100年]|
# +------+---+-------------+---------+--------------------+

希望这对您有所帮助。

英文:

You could use a custom udf function in PySpark to do this.

from pyspark.sql.functions import *
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Create example DataFrame
df = spark.createDataFrame([(&quot;dam&quot;, &quot;1&quot;, &quot;02-04-1992&quot;, &quot;Manager&quot;),
(&quot;dana&quot;, &quot;&quot;, &quot;02-04-1992&quot;, &quot;Associate&quot;),
(&quot;rich&quot;, &quot;3&quot;, &quot;02-04-1992&quot;, &quot;VP&quot;),
(&quot;danial&quot;, &quot;4&quot;, &quot;02-04-1992&quot;, &quot;CEO&quot;),
(&quot;mathew&quot;, &quot;&quot;, &quot;02-04-1910&quot;, &quot;Manager&quot;)],
[&quot;Name&quot;, &quot;ID&quot;, &quot;Date_of_birth&quot;, &quot;position&quot;])
df.show()
# +------+---+-------------+---------+
# |  Name| ID|Date_of_birth| position|
# +------+---+-------------+---------+
# |   dam|  1|   02-04-1992|  Manager|
# |  dana|   |   02-04-1992|Associate|
# |  rich|  3|   02-04-1992|       VP|
# |danial|  4|   02-04-1992|      CEO|
# |mathew|   |   02-04-1910|  Manager|
# +------+---+-------------+---------+
# Python func to check ID and date of birth, return list of remarks
def validate_row(row):
remarks = []
if row.ID == &quot;&quot;:
remarks.append(&quot;ID is null&quot;)
if (row.Date_of_birth is not None and
int(row.Date_of_birth.split(&quot;-&quot;)[-1]) &lt;= 2023 - 100):
remarks.append(&quot;Date_of_Birth is &gt; 100 years&quot;) # I will check with current year 2023
return remarks
# Create an UDF func from Python func above
validate_udf = udf(validate_row, ArrayType(StringType()))
# Apply the func above to each row in DF
df = df.withColumn(&quot;Validation remarks&quot;, validate_udf(struct(df.columns)))
df.show()
# +------+---+-------------+---------+--------------------+
# |  Name| ID|Date_of_birth| position|  Validation remarks|
# +------+---+-------------+---------+--------------------+
# |   dam|  1|   02-04-1992|  Manager|                  []|
# |  dana|   |   02-04-1992|Associate|        [ID is null]|
# |  rich|  3|   02-04-1992|       VP|                  []|
# |danial|  4|   02-04-1992|      CEO|                  []|
# |mathew|   |   02-04-1910|  Manager|[ID is null, Date_of_Birth is &gt; 100 years|
# +------+---+-------------+---------+--------------------+

答案2

得分: 0

以下是翻译好的内容:

  1. 创建一个如下所示的数据框:
+------+----+-------------+---------+
|  Name|  ID|Date_of_Birth| position|
+------+----+-------------+---------+
|   dam|   1|   1992-04-02|  Manager|
|  dana|null|   1992-04-02|Associate|
|  rich|   3|   1992-04-02|       VP|
|danial|   4|   1992-04-02|      CEO|
|mathew|null|   1910-04-02|  Manager|
+------+----+-------------+---------+
  1. 将所有条件放入列表 cond_result_list 中:
df_input = df_input.withColumn('Validation remarks', lit(''))
cond_result_list = [
["id is null",'Id is Null'],
["DATEDIFF(CURRENT_DATE(), Date_of_Birth) / 365.25 > 100",'Date_of_Birth is > 100 years']
]
for i in range(len(cond_result_list)):
df_input = df_input.withColumn('Validation remarks',
concat(col('Validation remarks'), 
expr("""case when {} then '{}'{}' else '' end """.format(cond_result_list[i][0],'|',cond_result_list[i][1]))
)
)
df_input = df_input.withColumn("Validation remarks", split(df_input["Validation remarks"], "\|"))
df_input = df_input.withColumn("Validation remarks", array_remove(df_input["Validation remarks"], ""))
  1. 打印数据框:
df_input.show(truncate=False)
+------+----+-------------+---------+------------------------------------------+
|Name  |ID  |Date_of_Birth|position |Validation remarks                        |
+------+----+-------------+---------+------------------------------------------+
|dam   |1   |1992-04-02   |Manager  |[]                                        |
|dana  |null|1992-04-02   |Associate|[Id is Null]                              |
|rich  |3   |1992-04-02   |VP       |[]                                        |
|danial|4   |1992-04-02   |CEO      |[]                                        |
|mathew|null|1910-04-02   |Manager  |[Id is Null, Date_of_Birth is > 100 years]|
+------+----+-------------+---------+------------------------------------------+

请注意,代码部分未进行翻译。

英文:

Here are my 2 cents:

  1. Created a dataframe as follows:

     +------+----+-------------+---------+
    |  Name|  ID|Date_of_Birth| position|
    +------+----+-------------+---------+
    |   dam|   1|   1992-04-02|  Manager|
    |  dana|null|   1992-04-02|Associate|
    |  rich|   3|   1992-04-02|       VP|
    |danial|   4|   1992-04-02|      CEO|
    |mathew|null|   1910-04-02|  Manager|
    +------+----+-------------+---------+
    
  2. Put all your conditions in the list(cond_result_list):

     df_input = df_input.withColumn(&#39;Validation remarks&#39;,lit((&#39;&#39;)))
    cond_result_list = [
    [&quot;id is null&quot;,&#39;Id is Null&#39;],
    [&quot;DATEDIFF(CURRENT_DATE(), Date_of_Birth) / 365.25 &gt; 100&quot;,&#39;Date_of_Birth is &gt; 100 years&#39;]
    ]
    for i in range(len(cond_result_list)):
    df_input = df_input.withColumn(&#39;Validation remarks&#39;,
    concat(col(&#39;Validation remarks&#39;), 
    expr(&quot;&quot;&quot;case when {} then &#39;{}{}&#39; else &#39;&#39; end &quot;&quot;&quot;.format(cond_result_list[i][0],&#39;|&#39;,cond_result_list[i][1]))
    )
    )
    df_input= df_input.withColumn(&quot;Validation remarks&quot;, split(df_input[&quot;Validation remarks&quot;], &quot;\|&quot;))
    df_input = df_input.withColumn(&quot;Validation remarks&quot;, array_remove(df_input[&quot;Validation remarks&quot;], &quot;&quot;))
    
  3. Print the dataframe:

    df_input.show(truncate=False)
    +------+----+-------------+---------+------------------------------------------+
    |Name  |ID  |Date_of_Birth|position |Validation remarks                        |
    +------+----+-------------+---------+------------------------------------------+
    |dam   |1   |1992-04-02   |Manager  |[]                                        |
    |dana  |null|1992-04-02   |Associate|[Id is Null]                              |
    |rich  |3   |1992-04-02   |VP       |[]                                        |
    |danial|4   |1992-04-02   |CEO      |[]                                        |
    |mathew|null|1910-04-02   |Manager  |[Id is Null, Date_of_Birth is &gt; 100 years]|
    +------+----+-------------+---------+------------------------------------------+
    

在Databricks中追加值到已存在值的行中。

huangapple
  • 本文由 发表于 2023年3月7日 15:27:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/75659052.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定