2023年6月13日 06:07:07go评论107阅读模式

英文:

Validating pyspark dataframe columns with the same data quality rules

问题

我创建了一个虚拟的Pyspark数据框。

我正在尝试强制执行以下规则：

rules = [{"column": "last_name", "value": "NA", "name": "姓氏中'NA'值的百分比"},
         {"column": "first_name", "value": "NA", "name": "名字中'NA'值的百分比"}]

我希望有一个字典键来表示NA规则，因为它适用于姓氏和名字，而不必两次列出相同的规则。

rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name": "姓氏和名字中'NA'值的百分比"},
         {"column": "country", "value": "USA", "name": "国家中'USA'值的百分比"}]

以下是我迄今为止所做的工作。有关如何使用第二组规则获得相同结果的任何好建议吗？

percentages = []
for rule in rules:
    column = rule["column"]
    value = rule["value"]
    name = rule["name"]
    count = df.filter(col(column) == value).count()
    total_count = df.count()
    percentage = (count / total_count) * 100
    percentages.append({"name": name, "percentage": percentage})
for result in percentages:
    print("{}: {:.2f}%".format(result["name"], result["percentage"]))

英文:

I created a dummy pyspark dataframe.

I am trying to enforce the following rules:

rules = [{"column": "last_name", "value": "NA", "name": "Percentage of 'NA' Values in Last Name"},{"column": "first_name", "value": "NA", "name": "Percentage of 'NA' Values in First Name"} ]

I would like to have one dictionary key for the NA rule since applies to both first and last name rather than having to list the same rule twice. <br>

rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name": "Percentage of 'NA' Values in Last Name and First Name"},{"column": "country", "value": "USA", "name": "Percentage of 'USA' Values in Country"} ]

<br>
Below is what I have done so far. Any good tips on a way forward on obtaining the same results but using the second set of rules?
<br>

percentages = []
for rule in rules:
    column = rule[&quot;column&quot;]
    value = rule[&quot;value&quot;]
    name = rule[&quot;name&quot;]
    count = df.filter(col(column) == value).count()
    total_count = df.count()
    percentage = (count / total_count) * 100
    percentages.append({&quot;name&quot;: name, &quot;percentage&quot;: percentage})
for result in percentages:
    print(&quot;{}: {:.2f}%&quot;.format(result[&quot;name&quot;], result[&quot;percentage&quot;]))

答案1

得分: 1

我认为最清晰的方式是修改您的规则如下：

rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name": 
"Percentage of 'NA' Values in Last Name and First Name"},
{"columns": ["country"], "value": "USA", "name": "Percentage of 'USA' Values in Country"}]

然后，您可以修改您的代码以使用这种标准格式：

percentages = []
for rule in rules:
    columns = rule["columns"]
    value = rule["value"]
    name = rule["name"]
    for column in columns:
        count = df.filter(col(column) == value).count()
        total_count = df.count()
        percentage = (count / total_count) * 100
        percentages.append({"name": name, "percentage": percentage})
for result in percentages:
    print("{}: {:.2f}%".format(result["name"], result["percentage"]))

如果您想继续使用您的格式，您可以检查：

for rule in rules:
    if "columns" in rule:
        # 处理多列
    else: # ("column" in rule)
        # 处理单列

请注意，以上代码部分已经按照您的要求进行了翻译，没有包含其他内容。

英文:

I think the cleanest way is altering your rules to:

rules = [{&quot;columns&quot;: [&quot;last_name&quot;, &quot;first_name&quot;], &quot;value&quot;: &quot;NA&quot;, &quot;name&quot;: 
&quot;Percentage of &#39;NA&#39; Values in Last Name and First Name&quot;},
{&quot;columns&quot;: [&quot;country&quot;], &quot;value&quot;: &quot;USA&quot;, &quot;name&quot;: &quot;Percentage of &#39;USA&#39; Values in Country&quot;}]

Then, you can alter your code to use this standard format:

percentages = []
for rule in rules:
    columns = rule[&quot;columns&quot;]
    value = rule[&quot;value&quot;]
    name = rule[&quot;name&quot;]
    for column in columns:
        count = df.filter(col(column) == value).count()
        total_count = df.count()
        percentage = (count / total_count) * 100
        percentages.append({&quot;name&quot;: name, &quot;percentage&quot;: percentage})
for result in percentages:
    print(&quot;{}: {:.2f}%&quot;.format(result[&quot;name&quot;], result[&quot;percentage&quot;]))

You can make further adjustments to also add the names in a list of names and print accordingly.

If you want to stick to your format, you could check:

for rule in rules:
    if columns in rule:
        #process as multiple columns
    else: #(column in rule)
        #process as one column

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

验证pyspark数据框的列是否符合相同的数据质量规则。

问题

答案1

Can't go past Cloudflare's verify you are human check even after clicking the check box multiple times when using Selenium

为什么在ML模型初始化期间出现TypeError错误？

优化polars语句，通过在每一行上应用lambda函数添加一列。

寻找标签元素并点击它

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。