英文:
Validating pyspark dataframe columns with the same data quality rules
问题
我创建了一个虚拟的Pyspark数据框。
我正在尝试强制执行以下规则:
rules = [{"column": "last_name", "value": "NA", "name": "姓氏中'NA'值的百分比"},
{"column": "first_name", "value": "NA", "name": "名字中'NA'值的百分比"}]
我希望有一个字典键来表示NA规则,因为它适用于姓氏和名字,而不必两次列出相同的规则。
rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name": "姓氏和名字中'NA'值的百分比"},
{"column": "country", "value": "USA", "name": "国家中'USA'值的百分比"}]
以下是我迄今为止所做的工作。有关如何使用第二组规则获得相同结果的任何好建议吗?
percentages = []
for rule in rules:
column = rule["column"]
value = rule["value"]
name = rule["name"]
count = df.filter(col(column) == value).count()
total_count = df.count()
percentage = (count / total_count) * 100
percentages.append({"name": name, "percentage": percentage})
for result in percentages:
print("{}: {:.2f}%".format(result["name"], result["percentage"]))
英文:
I created a dummy pyspark dataframe.
I am trying to enforce the following rules:
rules = [{"column": "last_name", "value": "NA", "name": "Percentage of 'NA' Values in Last Name"},{"column": "first_name", "value": "NA", "name": "Percentage of 'NA' Values in First Name"}
]
I would like to have one dictionary key for the NA rule since applies to both first and last name rather than having to list the same rule twice. <br>
rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name": "Percentage of 'NA' Values in Last Name and First Name"},{"column": "country", "value": "USA", "name": "Percentage of 'USA' Values in Country"}
]
<br>
Below is what I have done so far. Any good tips on a way forward on obtaining the same results but using the second set of rules?
<br>
percentages = []
for rule in rules:
column = rule["column"]
value = rule["value"]
name = rule["name"]
count = df.filter(col(column) == value).count()
total_count = df.count()
percentage = (count / total_count) * 100
percentages.append({"name": name, "percentage": percentage})
for result in percentages:
print("{}: {:.2f}%".format(result["name"], result["percentage"]))
答案1
得分: 1
我认为最清晰的方式是修改您的规则如下:
rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name":
"Percentage of 'NA' Values in Last Name and First Name"},
{"columns": ["country"], "value": "USA", "name": "Percentage of 'USA' Values in Country"}]
然后,您可以修改您的代码以使用这种标准格式:
percentages = []
for rule in rules:
columns = rule["columns"]
value = rule["value"]
name = rule["name"]
for column in columns:
count = df.filter(col(column) == value).count()
total_count = df.count()
percentage = (count / total_count) * 100
percentages.append({"name": name, "percentage": percentage})
for result in percentages:
print("{}: {:.2f}%".format(result["name"], result["percentage"]))
如果您想继续使用您的格式,您可以检查:
for rule in rules:
if "columns" in rule:
# 处理多列
else: # ("column" in rule)
# 处理单列
请注意,以上代码部分已经按照您的要求进行了翻译,没有包含其他内容。
英文:
I think the cleanest way is altering your rules to:
rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name":
"Percentage of 'NA' Values in Last Name and First Name"},
{"columns": ["country"], "value": "USA", "name": "Percentage of 'USA' Values in Country"}]
Then, you can alter your code to use this standard format:
percentages = []
for rule in rules:
columns = rule["columns"]
value = rule["value"]
name = rule["name"]
for column in columns:
count = df.filter(col(column) == value).count()
total_count = df.count()
percentage = (count / total_count) * 100
percentages.append({"name": name, "percentage": percentage})
for result in percentages:
print("{}: {:.2f}%".format(result["name"], result["percentage"]))
You can make further adjustments to also add the names in a list of names and print accordingly.
If you want to stick to your format, you could check:
for rule in rules:
if columns in rule:
#process as multiple columns
else: #(column in rule)
#process as one column
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论