英文:
Validate csv by checking if enumeration columns contains any invalid coded values
问题
以下是翻译好的部分:
我们从外部实验室和中心收到许多不同的CSV文件。在接收到这样的文件时,我们首先需要进行一些质量保证检查,然后才能进行进一步处理。因此,确保数据至少在技术层面上是正确的。
我们有一些Python脚本来检查列数、检查日期值、最小/最大范围等等。但现在我们还想检查枚举列是否正确。例如,如果一个名为"visit"的列是一个编码值,并且只能包含`baseline`、`fup_6_m`、`fup_12_m`,那么它就不应该包含像`fup_36_m`这样的其他内容。
我们有元数据规范,因此列名和编码值列表(也称为枚举)在事先是已知的。
以下是我迄今为止编写的Python脚本:
# 检查编码值是否正确
import pandas as pd
import io
## 从CSV文件加载数据
##df = pd.read_csv (r'patlist_mcl2017.csv', sep = ",", decimal=".")
# 测试:从文本创建数据框
str_patients = """patid,dob,sex,height,score,visit
1072,16-01-1981,M,154,1,fup_12_m
1091,20-12-1991,M,168,4,baseline
1126,25-12-1999,M,181,3,fup_6_m
1139,14-04-1980,Y,165,1,baseline
1171,05-11-1984,M,192,2,fup_12_m
1237,17-08-1983,F,170,3,fup_6_m
1334,26-08-1985,F,160,5,fup_6_m
1365,14-09-1976,M,184,3,fup_24_m
1384,28-12-1993,F,152,1,baseline
1456,27-09-1998,F,164,5,fup_12_m
"""
df = pd.read_csv(io.StringIO(str_patients), sep = ",", decimal=".")
print(df)
# 枚举列的允许值
allowed_enum = {
'sex': ['M', 'F'],
'score': [0, 1, 2, 3, 4],
'visit': ['baseline', 'fup_6_m', 'fup_12_m']
}
# 检查枚举值
for column_name, allowed_values in allowed_enum.items():
df_chk = df[~df[column_name].isin(allowed_values)].groupby(column_name).size().reset_index(name='Count')
if not df_chk.empty:
print("找到列 '%s' 的无效值:" % column_name)
print(df_chk)
希望这能对你有所帮助。
英文:
We recieve many different csv files from external labs and centers. When recieving such a file, we first need to do some QA checks before further processing. So make sure the data is correct, at least on a technical level.
We have some Python scripts to check the number of columns, check date values, min/max range etc. But now we also want to check wether the enumerated columns are correct. So for example, if a column visit is a coded value and may only contain baseline
, fup_6_m
, fup_12_m
then it shouldn't contain anything else like fup_36_m
.
We have the metadata specifications, so the column names and the lists of coded values (aka enumeration) are known beforehand.
This is the Python script I've got so far:
# check if coded values are correct
import pandas as pd
import io
## load data from csv files
##df = pd.read_csv (r'patlist_mcl2017.csv', sep = ",", decimal=".")
# TESTING: create data frame from text
str_patients = """patid,dob,sex,height,score,visit
1072,16-01-1981,M,154,1,fup_12_m
1091,20-12-1991,M,168,4,baseline
1126,25-12-1999,M,181,3,fup_6_m
1139,14-04-1980,Y,165,1,baseline
1171,05-11-1984,M,192,2,fup_12_m
1237,17-08-1983,F,170,3,fup_6_m
1334,26-08-1985,F,160,5,fup_6_m
1365,14-09-1976,M,184,3,fup_24_m
1384,28-12-1993,F,152,1,baseline
1456,27-09-1998,F,164,5,fup_12_m
"""
df = pd.read_csv(io.StringIO(str_patients), sep = ",", decimal=".")
print(df)
# allowed values for enumeration columnms
allowed_enum = {
'sex': ['M', 'F'],
'score': [0, 1, 2, 3, 4],
'visit': ['baseline', 'fup_6_m', 'fup_12_m']
}
# check enumeration
for column_name, allowed_values in allowed_enum.items():
df_chk = df[~df[column_name].isin(allowed_values)].groupby(column_name).size().reset_index(name='Count')
if not df_chk.empty:
print("Found invalid values for column '%s':" % column_name)
print(df_chk)
It works and the output is like this:
Found invalid values for column 'sex':
sex Count
0 Y 1
Found invalid values for column 'score':
score Count
0 5 2
Found invalid values for column 'visit':
visit Count
0 fup_24_m 1
But the different files can contain many columns, and for better reporting we'd like to get the output as one dataframe, so something like this:
Column_name Invalid Count
0 Sex Y 1
1 Score 5 2
2 visit fup_24_m 1
So my question is:
- What is the best way to collect the invalid values in a dataframe, like above?
- Or, is there maybe a better way for checking/validating these kind of coded values?
答案1
得分: 2
你可以尝试以下代码来获得所需结果:
Invalid count
Column_name
sex Y 1
score 5 2
visit fup_24_m 1
对于示例数据框(另一个.reset_index
将给你问题中的格式)。
或者,类似于Zach Young的提议,你可以使用以下代码来获得仅包含无效值的子数据框:
sex score visit
3 Y NaN NaN
6 NaN 5.0 NaN
7 NaN NaN fup_24_m
9 NaN 5.0 NaN
英文:
You could try
...
dfs = {
column_name: df[~df[column_name].isin(allowed_values)]
.value_counts(subset=column_name)
.to_frame().reset_index(names="Invalid")
for column_name, allowed_values in allowed_enum.items()
}
out = pd.concat(dfs, names=("Column_name", None)).droplevel(1)
to get
Invalid count
Column_name
sex Y 1
score 5 2
visit fup_24_m 1
for the sample dataframe (another .reset_index
would give you the format in the question).
Or, similiar to Zach Young's proposal, you could do
...
columns = (
df.loc[~df[column_name].isin(allowed_values), column_name]
for column_name, allowed_values in allowed_enum.items()
)
out = pd.concat(columns, axis=1, sort=True)
to get a sub-dataframe which contains only the invalid values
sex score visit
3 Y NaN NaN
6 NaN 5.0 NaN
7 NaN NaN fup_24_m
9 NaN 5.0 NaN
答案2
得分: 1
以下是输入表格和错误报告的翻译:
输入表格:
| patid | dob | sex | height | score | visit |
| ----- | ---------- | --- | ------ | ----- | -------- |
| 1072 | 16-01-1981 | M | 154 | 1 | fup_12_m |
| 1091 | 20-12-1991 | M | 168 | 4 | baseline |
| 1126 | 25-12-1999 | M | 181 | 3 | fup_6_m |
| 1139 | 14-04-1980 | Y | 165 | 1 | baseline |
| 1171 | 05-11-1984 | M | 192 | 2 | fup_12_m |
| 1237 | 17-08-1983 | F | 170 | 3 | fup_6_m |
| 1334 | 26-08-1985 | F | 160 | 5 | fup_6_m |
| 1365 | 14-09-1976 | M | 184 | 3 | fup_24_m |
| 1384 | 28-12-1993 | F | 152 | 1 | baseline |
| 1456 | 27-09-1998 | F | 164 | 5 | fup_12_m |
| a | 1/1/2000 | N | b | -1 | foo |
错误报告:
| Row_num | patid | dob | sex | height | score | visit |
| ------- | ----- | -------- | --- | ------ | ----- | -------- |
| 4 | | | Y | | | |
| 7 | | | | | 5 | |
| 8 | | | | | | fup_24_m |
| 10 | | | | | 5 | |
| 11 | a | 1/1/2000 | N | b | -1 | foo |
代码部分未翻译,仅提供原文。
英文:
Given the input (I added an extra row with all invalid fields):
| patid | dob | sex | height | score | visit |
| ----- | ---------- | --- | ------ | ----- | -------- |
| 1072 | 16-01-1981 | M | 154 | 1 | fup_12_m |
| 1091 | 20-12-1991 | M | 168 | 4 | baseline |
| 1126 | 25-12-1999 | M | 181 | 3 | fup_6_m |
| 1139 | 14-04-1980 | Y | 165 | 1 | baseline |
| 1171 | 05-11-1984 | M | 192 | 2 | fup_12_m |
| 1237 | 17-08-1983 | F | 170 | 3 | fup_6_m |
| 1334 | 26-08-1985 | F | 160 | 5 | fup_6_m |
| 1365 | 14-09-1976 | M | 184 | 3 | fup_24_m |
| 1384 | 28-12-1993 | F | 152 | 1 | baseline |
| 1456 | 27-09-1998 | F | 164 | 5 | fup_12_m |
| a | 1/1/2000 | N | b | -1 | foo |
How about a nice CSV report of all errors, like:
| Row_num | patid | dob | sex | height | score | visit |
| ------- | ----- | -------- | --- | ------ | ----- | -------- |
| 4 | | | Y | | | |
| 7 | | | | | 5 | |
| 8 | | | | | | fup_24_m |
| 10 | | | | | 5 | |
| 11 | a | 1/1/2000 | N | b | -1 | foo |
I don't know Pandas, so if you aren't locked into using it, here's how to validate each column in each row:
import csv
import sys
from datetime import datetime
from typing import Any, Callable
# Map an invalid value to its column, which is mapped to row's number where the
# errors were found:
#
# {
# 4: {"sex": "Y"},
# 7: {"score": "5"},
# 8: {"visit": "fup_24_m"},
# 10: {"score": "5"},
# 11: {
# "patid": "a",
# "dob": "1/1/2000",
# "sex": "N",
# "height": "b",
# "score": "-1",
# "visit": "foo",
# },
# }
#
errors_map: dict[int, dict[str, str]] = {}
def update(row_num: int, col_name: str, value: str):
"""Map col_name/bad-value pair to its row number."""
if row_num not in errors_map:
errors_map[row_num] = {}
errors_map[row_num][col_name] = value
def validate_int(x: str):
return int(x)
def validate_dt(x: str, format_str: str):
return datetime.strptime(x, format_str)
def validate_strs(x: str, strs: list[str]):
if x not in strs:
raise ValueError
return x
def validate_ints(x: str, ints: list[int]):
x_int = validate_int(x)
if x_int not in ints:
raise ValueError
return x_int
v_int = validate_int
v_dt = validate_dt
v_ints = validate_ints
v_strs = validate_strs
validators: dict[str, Callable[[str], Any]] = {
"patid": lambda x: v_int(x),
"dob": lambda x: v_dt(x, r"%d-%m-%Y"),
"sex": lambda x: v_strs(x, ["M", "F"]),
"height": lambda x: v_int(x),
"score": lambda x: v_ints(x, [0, 1, 2, 3, 4]),
"visit": lambda x: v_strs(x, ["baseline", "fup_6_m", "fup_12_m"]),
}
col_names = list(validators.keys())
with open("input.csv", newline="") as f:
reader = csv.reader(f)
header = next(reader)
if header != col_names:
print(f"unexpected header: {header}", file=sys.stderr)
sys.exit(1)
for i, row in enumerate(reader, start=1):
for j, col_name in enumerate(col_names):
val = row[j]
validator = validators[col_name]
try:
validator(val)
except:
update(i, col_name, val)
writer = csv.DictWriter(sys.stdout, fieldnames=["Row_num"] + col_names)
writer.writeheader()
for row_num, errors in errors_map.items():
errors["Row_num"] = str(row_num)
writer.writerow(errors)
This approach relies on the validator funcs raising some exception, either implicitly (e.g., int(patid)
if patid isn't an integer) or explecitly (e.g. raise ValueError
in validate_score() if the parsed int is out of bounds). As-is, all expected exceptions happen to be ValueErrors.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论