验证CSV文件,检查枚举列是否包含任何无效的编码值。

huangapple go评论59阅读模式
英文:

Validate csv by checking if enumeration columns contains any invalid coded values

问题

以下是翻译好的部分:

我们从外部实验室和中心收到许多不同的CSV文件。在接收到这样的文件时,我们首先需要进行一些质量保证检查,然后才能进行进一步处理。因此,确保数据至少在技术层面上是正确的。

我们有一些Python脚本来检查列数、检查日期值、最小/最大范围等等。但现在我们还想检查枚举列是否正确。例如,如果一个名为"visit"的列是一个编码值,并且只能包含`baseline`、`fup_6_m`、`fup_12_m`,那么它就不应该包含像`fup_36_m`这样的其他内容。

我们有元数据规范,因此列名和编码值列表(也称为枚举)在事先是已知的。

以下是我迄今为止编写的Python脚本:

    # 检查编码值是否正确
    import pandas as pd
    import io
    
    ## 从CSV文件加载数据
    ##df = pd.read_csv (r'patlist_mcl2017.csv', sep = ",", decimal=".")
    
    # 测试:从文本创建数据框
    str_patients = """patid,dob,sex,height,score,visit
    1072,16-01-1981,M,154,1,fup_12_m
    1091,20-12-1991,M,168,4,baseline
    1126,25-12-1999,M,181,3,fup_6_m
    1139,14-04-1980,Y,165,1,baseline
    1171,05-11-1984,M,192,2,fup_12_m
    1237,17-08-1983,F,170,3,fup_6_m
    1334,26-08-1985,F,160,5,fup_6_m
    1365,14-09-1976,M,184,3,fup_24_m
    1384,28-12-1993,F,152,1,baseline
    1456,27-09-1998,F,164,5,fup_12_m
    """
    df = pd.read_csv(io.StringIO(str_patients), sep = ",", decimal=".")
    
    print(df)
    
    # 枚举列的允许值
    allowed_enum = {
        'sex': ['M', 'F'],
        'score': [0, 1, 2, 3, 4],
        'visit': ['baseline', 'fup_6_m', 'fup_12_m']
    }
    
    # 检查枚举值
    for column_name, allowed_values in allowed_enum.items():
        df_chk = df[~df[column_name].isin(allowed_values)].groupby(column_name).size().reset_index(name='Count')
        if not df_chk.empty:
            print("找到列 '%s' 的无效值:" % column_name)
            print(df_chk)

希望这能对你有所帮助。

英文:

We recieve many different csv files from external labs and centers. When recieving such a file, we first need to do some QA checks before further processing. So make sure the data is correct, at least on a technical level.

We have some Python scripts to check the number of columns, check date values, min/max range etc. But now we also want to check wether the enumerated columns are correct. So for example, if a column visit is a coded value and may only contain baseline, fup_6_m, fup_12_m then it shouldn't contain anything else like fup_36_m.

We have the metadata specifications, so the column names and the lists of coded values (aka enumeration) are known beforehand.

This is the Python script I've got so far:

# check if coded values are correct
import pandas as pd
import io

## load data from csv files
##df = pd.read_csv (r'patlist_mcl2017.csv', sep = ",", decimal=".")

# TESTING: create data frame from text
str_patients = """patid,dob,sex,height,score,visit
1072,16-01-1981,M,154,1,fup_12_m
1091,20-12-1991,M,168,4,baseline
1126,25-12-1999,M,181,3,fup_6_m
1139,14-04-1980,Y,165,1,baseline
1171,05-11-1984,M,192,2,fup_12_m
1237,17-08-1983,F,170,3,fup_6_m
1334,26-08-1985,F,160,5,fup_6_m
1365,14-09-1976,M,184,3,fup_24_m
1384,28-12-1993,F,152,1,baseline
1456,27-09-1998,F,164,5,fup_12_m
"""
df = pd.read_csv(io.StringIO(str_patients), sep = ",", decimal=".")

print(df)

# allowed values for enumeration columnms
allowed_enum = {
    'sex': ['M', 'F'],
    'score': [0, 1, 2, 3, 4],
    'visit': ['baseline', 'fup_6_m', 'fup_12_m']
}

# check enumeration
for column_name, allowed_values in allowed_enum.items():
    df_chk = df[~df[column_name].isin(allowed_values)].groupby(column_name).size().reset_index(name='Count')
    if not df_chk.empty:
        print("Found invalid values for column '%s':" % column_name)
        print(df_chk)

It works and the output is like this:

Found invalid values for column 'sex':
  sex  Count
0   Y      1
Found invalid values for column 'score':
   score  Count
0      5      2
Found invalid values for column 'visit':
	  visit  Count
0  fup_24_m      1

But the different files can contain many columns, and for better reporting we'd like to get the output as one dataframe, so something like this:

  Column_name  Invalid  Count
0 Sex            Y          1
1 Score          5          2
2 visit       fup_24_m      1

So my question is:

  • What is the best way to collect the invalid values in a dataframe, like above?
  • Or, is there maybe a better way for checking/validating these kind of coded values?

答案1

得分: 2

你可以尝试以下代码来获得所需结果:

              Invalid  count
Column_name                 
sex                 Y      1
score               5      2
visit        fup_24_m      1

对于示例数据框(另一个.reset_index将给你问题中的格式)。

或者,类似于Zach Young的提议,你可以使用以下代码来获得仅包含无效值的子数据框:

   sex  score     visit
3    Y    NaN       NaN
6  NaN    5.0       NaN
7  NaN    NaN  fup_24_m
9  NaN    5.0       NaN
英文:

You could try

...
dfs = {
    column_name: df[~df[column_name].isin(allowed_values)]
                 .value_counts(subset=column_name)
                 .to_frame().reset_index(names="Invalid")
    for column_name, allowed_values in allowed_enum.items()
}
out = pd.concat(dfs, names=("Column_name", None)).droplevel(1)

to get

              Invalid  count
Column_name                 
sex                 Y      1
score               5      2
visit        fup_24_m      1

for the sample dataframe (another .reset_index would give you the format in the question).

Or, similiar to Zach Young's proposal, you could do

...
columns = (
    df.loc[~df[column_name].isin(allowed_values), column_name]
    for column_name, allowed_values in allowed_enum.items()
)
out = pd.concat(columns, axis=1, sort=True)

to get a sub-dataframe which contains only the invalid values

   sex  score     visit
3    Y    NaN       NaN
6  NaN    5.0       NaN
7  NaN    NaN  fup_24_m
9  NaN    5.0       NaN

答案2

得分: 1

以下是输入表格和错误报告的翻译:

输入表格:

| patid | dob        | sex | height | score | visit    |
| ----- | ---------- | --- | ------ | ----- | -------- |
| 1072  | 16-01-1981 | M   | 154    | 1     | fup_12_m |
| 1091  | 20-12-1991 | M   | 168    | 4     | baseline |
| 1126  | 25-12-1999 | M   | 181    | 3     | fup_6_m  |
| 1139  | 14-04-1980 | Y   | 165    | 1     | baseline |
| 1171  | 05-11-1984 | M   | 192    | 2     | fup_12_m |
| 1237  | 17-08-1983 | F   | 170    | 3     | fup_6_m  |
| 1334  | 26-08-1985 | F   | 160    | 5     | fup_6_m  |
| 1365  | 14-09-1976 | M   | 184    | 3     | fup_24_m |
| 1384  | 28-12-1993 | F   | 152    | 1     | baseline |
| 1456  | 27-09-1998 | F   | 164    | 5     | fup_12_m |
| a     | 1/1/2000   | N   | b      | -1    | foo      |

错误报告:

| Row_num | patid | dob      | sex | height | score | visit    |
| ------- | ----- | -------- | --- | ------ | ----- | -------- |
|       4 |       |          | Y   |        |       |          |
|       7 |       |          |     |        | 5     |          |
|       8 |       |          |     |        |       | fup_24_m |
|      10 |       |          |     |        | 5     |          |
|      11 | a     | 1/1/2000 | N   | b      | -1    | foo      |

代码部分未翻译,仅提供原文。

英文:

Given the input (I added an extra row with all invalid fields):

| patid | dob        | sex | height | score | visit    |
| ----- | ---------- | --- | ------ | ----- | -------- |
| 1072  | 16-01-1981 | M   | 154    | 1     | fup_12_m |
| 1091  | 20-12-1991 | M   | 168    | 4     | baseline |
| 1126  | 25-12-1999 | M   | 181    | 3     | fup_6_m  |
| 1139  | 14-04-1980 | Y   | 165    | 1     | baseline |
| 1171  | 05-11-1984 | M   | 192    | 2     | fup_12_m |
| 1237  | 17-08-1983 | F   | 170    | 3     | fup_6_m  |
| 1334  | 26-08-1985 | F   | 160    | 5     | fup_6_m  |
| 1365  | 14-09-1976 | M   | 184    | 3     | fup_24_m |
| 1384  | 28-12-1993 | F   | 152    | 1     | baseline |
| 1456  | 27-09-1998 | F   | 164    | 5     | fup_12_m |
| a     | 1/1/2000   | N   | b      | -1    | foo      |

How about a nice CSV report of all errors, like:

| Row_num | patid | dob      | sex | height | score | visit    |
| ------- | ----- | -------- | --- | ------ | ----- | -------- |
|       4 |       |          | Y   |        |       |          |
|       7 |       |          |     |        | 5     |          |
|       8 |       |          |     |        |       | fup_24_m |
|      10 |       |          |     |        | 5     |          |
|      11 | a     | 1/1/2000 | N   | b      | -1    | foo      |

I don't know Pandas, so if you aren't locked into using it, here's how to validate each column in each row:

import csv
import sys

from datetime import datetime
from typing import Any, Callable

# Map an invalid value to its column, which is mapped to row's number where the
# errors were found:
#
#   {
#       4: {"sex": "Y"},
#       7: {"score": "5"},
#       8: {"visit": "fup_24_m"},
#       10: {"score": "5"},
#       11: {
#           "patid": "a",
#           "dob": "1/1/2000",
#           "sex": "N",
#           "height": "b",
#           "score": "-1",
#           "visit": "foo",
#       },
#   }
#
errors_map: dict[int, dict[str, str]] = {}


def update(row_num: int, col_name: str, value: str):
    """Map col_name/bad-value pair to its row number."""
    if row_num not in errors_map:
        errors_map[row_num] = {}
    errors_map[row_num][col_name] = value


def validate_int(x: str):
    return int(x)


def validate_dt(x: str, format_str: str):
    return datetime.strptime(x, format_str)


def validate_strs(x: str, strs: list[str]):
    if x not in strs:
        raise ValueError
    return x


def validate_ints(x: str, ints: list[int]):
    x_int = validate_int(x)
    if x_int not in ints:
        raise ValueError
    return x_int


v_int = validate_int
v_dt = validate_dt
v_ints = validate_ints
v_strs = validate_strs

validators: dict[str, Callable[[str], Any]] = {
    "patid": lambda x: v_int(x),
    "dob": lambda x: v_dt(x, r"%d-%m-%Y"),
    "sex": lambda x: v_strs(x, ["M", "F"]),
    "height": lambda x: v_int(x),
    "score": lambda x: v_ints(x, [0, 1, 2, 3, 4]),
    "visit": lambda x: v_strs(x, ["baseline", "fup_6_m", "fup_12_m"]),
}

col_names = list(validators.keys())


with open("input.csv", newline="") as f:
    reader = csv.reader(f)
    header = next(reader)

    if header != col_names:
        print(f"unexpected header: {header}", file=sys.stderr)
        sys.exit(1)

    for i, row in enumerate(reader, start=1):
        for j, col_name in enumerate(col_names):
            val = row[j]
            validator = validators[col_name]
            try:
                validator(val)
            except:
                update(i, col_name, val)


writer = csv.DictWriter(sys.stdout, fieldnames=["Row_num"] + col_names)
writer.writeheader()
for row_num, errors in errors_map.items():
    errors["Row_num"] = str(row_num)
    writer.writerow(errors)

This approach relies on the validator funcs raising some exception, either implicitly (e.g., int(patid) if patid isn't an integer) or explecitly (e.g. raise ValueError in validate_score() if the parsed int is out of bounds). As-is, all expected exceptions happen to be ValueErrors.

huangapple
  • 本文由 发表于 2023年6月2日 05:49:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76385931.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定