验证pyspark数据框的列是否符合相同的数据质量规则。

huangapple go评论107阅读模式
英文:

Validating pyspark dataframe columns with the same data quality rules

问题

我创建了一个虚拟的Pyspark数据框。

我正在尝试强制执行以下规则:

  1. rules = [{"column": "last_name", "value": "NA", "name": "姓氏中'NA'值的百分比"},
  2. {"column": "first_name", "value": "NA", "name": "名字中'NA'值的百分比"}]

我希望有一个字典键来表示NA规则,因为它适用于姓氏和名字,而不必两次列出相同的规则。

  1. rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name": "姓氏和名字中'NA'值的百分比"},
  2. {"column": "country", "value": "USA", "name": "国家中'USA'值的百分比"}]

以下是我迄今为止所做的工作。有关如何使用第二组规则获得相同结果的任何好建议吗?

  1. percentages = []
  2. for rule in rules:
  3. column = rule["column"]
  4. value = rule["value"]
  5. name = rule["name"]
  6. count = df.filter(col(column) == value).count()
  7. total_count = df.count()
  8. percentage = (count / total_count) * 100
  9. percentages.append({"name": name, "percentage": percentage})
  10. for result in percentages:
  11. print("{}: {:.2f}%".format(result["name"], result["percentage"]))

验证pyspark数据框的列是否符合相同的数据质量规则。

验证pyspark数据框的列是否符合相同的数据质量规则。

英文:

I created a dummy pyspark dataframe.
验证pyspark数据框的列是否符合相同的数据质量规则。

I am trying to enforce the following rules:

rules = [{"column": "last_name", "value": "NA", "name": "Percentage of 'NA' Values in Last Name"},{"column": "first_name", "value": "NA", "name": "Percentage of 'NA' Values in First Name"}
]

I would like to have one dictionary key for the NA rule since applies to both first and last name rather than having to list the same rule twice. <br>

rules = [{&quot;columns&quot;: [&quot;last_name&quot;, &quot;first_name&quot;], &quot;value&quot;: &quot;NA&quot;, &quot;name&quot;: &quot;Percentage of &#39;NA&#39; Values in Last Name and First Name&quot;},{&quot;column&quot;: &quot;country&quot;, &quot;value&quot;: &quot;USA&quot;, &quot;name&quot;: &quot;Percentage of &#39;USA&#39; Values in Country&quot;}
]

<br>
Below is what I have done so far. Any good tips on a way forward on obtaining the same results but using the second set of rules?
<br>

  1. percentages = []
  2. for rule in rules:
  3. column = rule[&quot;column&quot;]
  4. value = rule[&quot;value&quot;]
  5. name = rule[&quot;name&quot;]
  6. count = df.filter(col(column) == value).count()
  7. total_count = df.count()
  8. percentage = (count / total_count) * 100
  9. percentages.append({&quot;name&quot;: name, &quot;percentage&quot;: percentage})
  10. for result in percentages:
  11. print(&quot;{}: {:.2f}%&quot;.format(result[&quot;name&quot;], result[&quot;percentage&quot;]))

验证pyspark数据框的列是否符合相同的数据质量规则。

答案1

得分: 1

我认为最清晰的方式是修改您的规则如下:

  1. rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name":
  2. "Percentage of 'NA' Values in Last Name and First Name"},
  3. {"columns": ["country"], "value": "USA", "name": "Percentage of 'USA' Values in Country"}]

然后,您可以修改您的代码以使用这种标准格式:

  1. percentages = []
  2. for rule in rules:
  3. columns = rule["columns"]
  4. value = rule["value"]
  5. name = rule["name"]
  6. for column in columns:
  7. count = df.filter(col(column) == value).count()
  8. total_count = df.count()
  9. percentage = (count / total_count) * 100
  10. percentages.append({"name": name, "percentage": percentage})
  11. for result in percentages:
  12. print("{}: {:.2f}%".format(result["name"], result["percentage"]))

如果您想继续使用您的格式,您可以检查:

  1. for rule in rules:
  2. if "columns" in rule:
  3. # 处理多列
  4. else: # ("column" in rule)
  5. # 处理单列

请注意,以上代码部分已经按照您的要求进行了翻译,没有包含其他内容。

英文:

I think the cleanest way is altering your rules to:

  1. rules = [{&quot;columns&quot;: [&quot;last_name&quot;, &quot;first_name&quot;], &quot;value&quot;: &quot;NA&quot;, &quot;name&quot;:
  2. &quot;Percentage of &#39;NA&#39; Values in Last Name and First Name&quot;},
  3. {&quot;columns&quot;: [&quot;country&quot;], &quot;value&quot;: &quot;USA&quot;, &quot;name&quot;: &quot;Percentage of &#39;USA&#39; Values in Country&quot;}]

Then, you can alter your code to use this standard format:

  1. percentages = []
  2. for rule in rules:
  3. columns = rule[&quot;columns&quot;]
  4. value = rule[&quot;value&quot;]
  5. name = rule[&quot;name&quot;]
  6. for column in columns:
  7. count = df.filter(col(column) == value).count()
  8. total_count = df.count()
  9. percentage = (count / total_count) * 100
  10. percentages.append({&quot;name&quot;: name, &quot;percentage&quot;: percentage})
  11. for result in percentages:
  12. print(&quot;{}: {:.2f}%&quot;.format(result[&quot;name&quot;], result[&quot;percentage&quot;]))

You can make further adjustments to also add the names in a list of names and print accordingly.

If you want to stick to your format, you could check:

  1. for rule in rules:
  2. if columns in rule:
  3. #process as multiple columns
  4. else: #(column in rule)
  5. #process as one column

huangapple
  • 本文由 发表于 2023年6月13日 06:07:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76460607.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定