验证pyspark数据框的列是否符合相同的数据质量规则。

huangapple go评论58阅读模式
英文:

Validating pyspark dataframe columns with the same data quality rules

问题

我创建了一个虚拟的Pyspark数据框。

我正在尝试强制执行以下规则:

rules = [{"column": "last_name", "value": "NA", "name": "姓氏中'NA'值的百分比"},
         {"column": "first_name", "value": "NA", "name": "名字中'NA'值的百分比"}]

我希望有一个字典键来表示NA规则,因为它适用于姓氏和名字,而不必两次列出相同的规则。

rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name": "姓氏和名字中'NA'值的百分比"},
         {"column": "country", "value": "USA", "name": "国家中'USA'值的百分比"}]

以下是我迄今为止所做的工作。有关如何使用第二组规则获得相同结果的任何好建议吗?

percentages = []

for rule in rules:
    column = rule["column"]
    value = rule["value"]
    name = rule["name"]
    count = df.filter(col(column) == value).count()
    total_count = df.count()
    percentage = (count / total_count) * 100
    percentages.append({"name": name, "percentage": percentage})
for result in percentages:
    print("{}: {:.2f}%".format(result["name"], result["percentage"]))

验证pyspark数据框的列是否符合相同的数据质量规则。

验证pyspark数据框的列是否符合相同的数据质量规则。

英文:

I created a dummy pyspark dataframe.
验证pyspark数据框的列是否符合相同的数据质量规则。

I am trying to enforce the following rules:

rules = [{"column": "last_name", "value": "NA", "name": "Percentage of 'NA' Values in Last Name"},{"column": "first_name", "value": "NA", "name": "Percentage of 'NA' Values in First Name"}
]

I would like to have one dictionary key for the NA rule since applies to both first and last name rather than having to list the same rule twice. <br>

rules = [{&quot;columns&quot;: [&quot;last_name&quot;, &quot;first_name&quot;], &quot;value&quot;: &quot;NA&quot;, &quot;name&quot;: &quot;Percentage of &#39;NA&#39; Values in Last Name and First Name&quot;},{&quot;column&quot;: &quot;country&quot;, &quot;value&quot;: &quot;USA&quot;, &quot;name&quot;: &quot;Percentage of &#39;USA&#39; Values in Country&quot;}
]

<br>
Below is what I have done so far. Any good tips on a way forward on obtaining the same results but using the second set of rules?
<br>

percentages = []

for rule in rules:
    column = rule[&quot;column&quot;]
    value = rule[&quot;value&quot;]
    name = rule[&quot;name&quot;]
    count = df.filter(col(column) == value).count()
    total_count = df.count()
    percentage = (count / total_count) * 100
    percentages.append({&quot;name&quot;: name, &quot;percentage&quot;: percentage})
for result in percentages:
    print(&quot;{}: {:.2f}%&quot;.format(result[&quot;name&quot;], result[&quot;percentage&quot;]))

验证pyspark数据框的列是否符合相同的数据质量规则。

答案1

得分: 1

我认为最清晰的方式是修改您的规则如下:

rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name": 
"Percentage of 'NA' Values in Last Name and First Name"},
{"columns": ["country"], "value": "USA", "name": "Percentage of 'USA' Values in Country"}]

然后,您可以修改您的代码以使用这种标准格式:

percentages = []

for rule in rules:
    columns = rule["columns"]
    value = rule["value"]
    name = rule["name"]
    for column in columns:
        count = df.filter(col(column) == value).count()
        total_count = df.count()
        percentage = (count / total_count) * 100
        percentages.append({"name": name, "percentage": percentage})
for result in percentages:
    print("{}: {:.2f}%".format(result["name"], result["percentage"]))

如果您想继续使用您的格式,您可以检查:

for rule in rules:
    if "columns" in rule:
        # 处理多列
    else: # ("column" in rule)
        # 处理单列

请注意,以上代码部分已经按照您的要求进行了翻译,没有包含其他内容。

英文:

I think the cleanest way is altering your rules to:

rules = [{&quot;columns&quot;: [&quot;last_name&quot;, &quot;first_name&quot;], &quot;value&quot;: &quot;NA&quot;, &quot;name&quot;: 
&quot;Percentage of &#39;NA&#39; Values in Last Name and First Name&quot;},
{&quot;columns&quot;: [&quot;country&quot;], &quot;value&quot;: &quot;USA&quot;, &quot;name&quot;: &quot;Percentage of &#39;USA&#39; Values in Country&quot;}]

Then, you can alter your code to use this standard format:

percentages = []

for rule in rules:
    columns = rule[&quot;columns&quot;]
    value = rule[&quot;value&quot;]
    name = rule[&quot;name&quot;]
    for column in columns:
        count = df.filter(col(column) == value).count()
        total_count = df.count()
        percentage = (count / total_count) * 100
        percentages.append({&quot;name&quot;: name, &quot;percentage&quot;: percentage})
for result in percentages:
    print(&quot;{}: {:.2f}%&quot;.format(result[&quot;name&quot;], result[&quot;percentage&quot;]))

You can make further adjustments to also add the names in a list of names and print accordingly.

If you want to stick to your format, you could check:

for rule in rules:
    if columns in rule:
        #process as multiple columns
    else: #(column in rule)
        #process as one column

huangapple
  • 本文由 发表于 2023年6月13日 06:07:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76460607.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定