2023年3月7日 15:27:18go评论64阅读模式

英文:

Append Value in rows with existing value in Databricks

问题

我是Databricks的新手，如果我听起来很蠢，请谅解。

我有一个需求，我正在对一个数据框进行验证，我目前为每个验证定义了函数，比如一个用于检查空值，一个用于检查日期范围，每当我的函数符合验证规则时，它应该在Validation remarks列中标记为1，如下所示：

Name	ID	Date_of Birth	position	Validation remarks
dam 	1	02-04-1992	Manager 	
dana		02-04-1992	Associate	1
rich	3	02-04-1992	VP	
danial	4	02-04-1992	CEO	
mathew		02-04-1910	Manager 	1

但问题在于，我无法确定函数标记为1的原因是什么，是因为ID列为空，还是因为出生日期超过100年，或者两者都可能。

所以我想知道是否可以附加原因，如下所示。

Name	ID	Date_of Birth	position	Validation remarks
dam 	1	02-04-1992	Manager 	
dana		02-04-1992	Associate	ID为空
rich	3	02-04-1992	VP	
danial	4	02-04-1992	CEO	
mathew		02-04-1910	Manager 	['ID为空', '出生日期超过100年']

也就是说，如果行的ID为空，那么稍后如果它还具有出生日期超过100年，也附加该值，就像上面一样。

我只想知道如何将值附加到Validation remarks。

英文:

I am new to Databricks so bear with me if I sound stupid.
I have a requirement wherein I am doing validations to a data frame and I currently have defined functions for each validation like one for null-check one for Date_range, every time my function meets the validation rules it should mark 1 in **Validation_remark **column like below

Name	ID	Date_of Birth	position	Validation remarks
dam 	1	02-04-1992	Manager 	
dana		02-04-1992	Associate	1
rich	3	02-04-1992	VP	
danial	4	02-04-1992	CEO	
mathew		02-04-1910	Manager 	1

but the problem here here is i am not able to figure out what is the reason for the function to mark it 1 whether it's because ID col is empty or whether it's because Date_of_birth os > 100 years or may be both.

So I want to know if I can append the reason like below.

Name	ID	Date_of Birth	position	Validation remarks
dam 	1	02-04-1992	Manager 	
dana		02-04-1992	Associate	ID id null
rich	3	02-04-1992	VP	
danial	4	02-04-1992	CEO	
mathew		02-04-1910	Manager 	[&#39;ID is null&#39;, &#39;Date_of_Birth is &gt; 100 years&#39;]

i.e if the row have blank ID then note that late if it also has Date_of Birth > 100 years append that value as well like above.

I just want to know how do I append the values to Validation remarks

答案1

得分: 0

以下是代码的翻译部分：

您可以在PySpark中使用自定义UDF函数来执行此操作。

```python
from pyspark.sql.functions import *
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
# 创建示例数据框
df = spark.createDataFrame([("dam", "1", "02-04-1992", "Manager"),
                            ("dana", "", "02-04-1992", "Associate"),
                            ("rich", "3", "02-04-1992", "VP"),
                            ("danial", "4", "02-04-1992", "CEO"),
                            ("mathew", "", "02-04-1910", "Manager")],
                           ["Name", "ID", "Date_of_birth", "position"])
df.show()
# +------+---+-------------+---------+
# |  Name| ID|Date_of_birth| position|
# +------+---+-------------+---------+
# |   dam|  1|   02-04-1992|  Manager|
# |  dana|   |   02-04-1992|Associate|
# |  rich|  3|   02-04-1992|       VP|
# |danial|  4|   02-04-1992|      CEO|
# |mathew|   |   02-04-1910|  Manager|
# +------+---+-------------+---------+
# Python函数检查ID和出生日期，返回备注列表
def validate_row(row):
    remarks = []
    if row.ID == "":
        remarks.append("ID为空")
    if (row.Date_of_birth is not None and
        int(row.Date_of_birth.split("-")[-1]) <= 2023 - 100):
        remarks.append("出生日期大于100年") # 我将检查当前年份2023
    return remarks

# 从上面的Python函数创建UDF函数
validate_udf = udf(validate_row, ArrayType(StringType()))

# 将上述函数应用于DF中的每一行
df = df.withColumn("验证备注", validate_udf(struct(df.columns)))
df.show()
# +------+---+-------------+---------+--------------------+
# |  Name| ID|Date_of_birth| position|  验证备注|
# +------+---+-------------+---------+--------------------+
# |   dam|  1|   02-04-1992|  Manager|                  []|
# |  dana|   |   02-04-1992|Associate|        [ID为空]|
# |  rich|  3|   02-04-1992|       VP|                  []|
# |danial|  4|   02-04-1992|      CEO|                  []|
# |mathew|   |   02-04-1910|  Manager|[ID为空, 出生日期大于100年]|
# +------+---+-------------+---------+--------------------+

希望这对您有所帮助。

英文:

You could use a custom udf function in PySpark to do this.

from pyspark.sql.functions import *
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Create example DataFrame
df = spark.createDataFrame([(&quot;dam&quot;, &quot;1&quot;, &quot;02-04-1992&quot;, &quot;Manager&quot;),
(&quot;dana&quot;, &quot;&quot;, &quot;02-04-1992&quot;, &quot;Associate&quot;),
(&quot;rich&quot;, &quot;3&quot;, &quot;02-04-1992&quot;, &quot;VP&quot;),
(&quot;danial&quot;, &quot;4&quot;, &quot;02-04-1992&quot;, &quot;CEO&quot;),
(&quot;mathew&quot;, &quot;&quot;, &quot;02-04-1910&quot;, &quot;Manager&quot;)],
[&quot;Name&quot;, &quot;ID&quot;, &quot;Date_of_birth&quot;, &quot;position&quot;])
df.show()
# +------+---+-------------+---------+
# |  Name| ID|Date_of_birth| position|
# +------+---+-------------+---------+
# |   dam|  1|   02-04-1992|  Manager|
# |  dana|   |   02-04-1992|Associate|
# |  rich|  3|   02-04-1992|       VP|
# |danial|  4|   02-04-1992|      CEO|
# |mathew|   |   02-04-1910|  Manager|
# +------+---+-------------+---------+
# Python func to check ID and date of birth, return list of remarks
def validate_row(row):
remarks = []
if row.ID == &quot;&quot;:
remarks.append(&quot;ID is null&quot;)
if (row.Date_of_birth is not None and
int(row.Date_of_birth.split(&quot;-&quot;)[-1]) &lt;= 2023 - 100):
remarks.append(&quot;Date_of_Birth is &gt; 100 years&quot;) # I will check with current year 2023
return remarks
# Create an UDF func from Python func above
validate_udf = udf(validate_row, ArrayType(StringType()))
# Apply the func above to each row in DF
df = df.withColumn(&quot;Validation remarks&quot;, validate_udf(struct(df.columns)))
df.show()
# +------+---+-------------+---------+--------------------+
# |  Name| ID|Date_of_birth| position|  Validation remarks|
# +------+---+-------------+---------+--------------------+
# |   dam|  1|   02-04-1992|  Manager|                  []|
# |  dana|   |   02-04-1992|Associate|        [ID is null]|
# |  rich|  3|   02-04-1992|       VP|                  []|
# |danial|  4|   02-04-1992|      CEO|                  []|
# |mathew|   |   02-04-1910|  Manager|[ID is null, Date_of_Birth is &gt; 100 years|
# +------+---+-------------+---------+--------------------+

答案2

得分: 0

以下是翻译好的内容：

创建一个如下所示的数据框：

+------+----+-------------+---------+
|  Name|  ID|Date_of_Birth| position|
+------+----+-------------+---------+
|   dam|   1|   1992-04-02|  Manager|
|  dana|null|   1992-04-02|Associate|
|  rich|   3|   1992-04-02|       VP|
|danial|   4|   1992-04-02|      CEO|
|mathew|null|   1910-04-02|  Manager|
+------+----+-------------+---------+

将所有条件放入列表 cond_result_list 中：

df_input = df_input.withColumn('Validation remarks', lit(''))
cond_result_list = [
["id is null",'Id is Null'],
["DATEDIFF(CURRENT_DATE(), Date_of_Birth) / 365.25 > 100",'Date_of_Birth is > 100 years']
]
for i in range(len(cond_result_list)):
df_input = df_input.withColumn('Validation remarks',
concat(col('Validation remarks'), 
expr("""case when {} then '{}'{}' else '' end """.format(cond_result_list[i][0],'|',cond_result_list[i][1]))
)
)
df_input = df_input.withColumn("Validation remarks", split(df_input["Validation remarks"], "\|"))
df_input = df_input.withColumn("Validation remarks", array_remove(df_input["Validation remarks"], ""))

打印数据框：

df_input.show(truncate=False)
+------+----+-------------+---------+------------------------------------------+
|Name  |ID  |Date_of_Birth|position |Validation remarks                        |
+------+----+-------------+---------+------------------------------------------+
|dam   |1   |1992-04-02   |Manager  |[]                                        |
|dana  |null|1992-04-02   |Associate|[Id is Null]                              |
|rich  |3   |1992-04-02   |VP       |[]                                        |
|danial|4   |1992-04-02   |CEO      |[]                                        |
|mathew|null|1910-04-02   |Manager  |[Id is Null, Date_of_Birth is > 100 years]|
+------+----+-------------+---------+------------------------------------------+

请注意，代码部分未进行翻译。

英文:

Here are my 2 cents:

Created a dataframe as follows:

 +------+----+-------------+---------+
|  Name|  ID|Date_of_Birth| position|
+------+----+-------------+---------+
|   dam|   1|   1992-04-02|  Manager|
|  dana|null|   1992-04-02|Associate|
|  rich|   3|   1992-04-02|       VP|
|danial|   4|   1992-04-02|      CEO|
|mathew|null|   1910-04-02|  Manager|
+------+----+-------------+---------+

Put all your conditions in the list(cond_result_list):

 df_input = df_input.withColumn(&#39;Validation remarks&#39;,lit((&#39;&#39;)))
cond_result_list = [
[&quot;id is null&quot;,&#39;Id is Null&#39;],
[&quot;DATEDIFF(CURRENT_DATE(), Date_of_Birth) / 365.25 &gt; 100&quot;,&#39;Date_of_Birth is &gt; 100 years&#39;]
]
for i in range(len(cond_result_list)):
df_input = df_input.withColumn(&#39;Validation remarks&#39;,
concat(col(&#39;Validation remarks&#39;), 
expr(&quot;&quot;&quot;case when {} then &#39;{}{}&#39; else &#39;&#39; end &quot;&quot;&quot;.format(cond_result_list[i][0],&#39;|&#39;,cond_result_list[i][1]))
)
)
df_input= df_input.withColumn(&quot;Validation remarks&quot;, split(df_input[&quot;Validation remarks&quot;], &quot;\|&quot;))
df_input = df_input.withColumn(&quot;Validation remarks&quot;, array_remove(df_input[&quot;Validation remarks&quot;], &quot;&quot;))

Print the dataframe:

df_input.show(truncate=False)
+------+----+-------------+---------+------------------------------------------+
|Name  |ID  |Date_of_Birth|position |Validation remarks                        |
+------+----+-------------+---------+------------------------------------------+
|dam   |1   |1992-04-02   |Manager  |[]                                        |
|dana  |null|1992-04-02   |Associate|[Id is Null]                              |
|rich  |3   |1992-04-02   |VP       |[]                                        |
|danial|4   |1992-04-02   |CEO      |[]                                        |
|mathew|null|1910-04-02   |Manager  |[Id is Null, Date_of_Birth is &gt; 100 years]|
+------+----+-------------+---------+------------------------------------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Databricks中追加值到已存在值的行中。

问题

答案1

答案2

通配符字符在读取流.meta文件期间的行为如何？

Pyspark表名与时间戳

设置 `spark.sql.files.maxPartitionBytes` 时出现了倾斜的分区。

Pyspark的嵌套for循环的连接操作等效方法是什么？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论