2023年6月2日 05:49:08go评论59阅读模式

英文:

Validate csv by checking if enumeration columns contains any invalid coded values

问题

以下是翻译好的部分：

我们从外部实验室和中心收到许多不同的CSV文件。在接收到这样的文件时，我们首先需要进行一些质量保证检查，然后才能进行进一步处理。因此，确保数据至少在技术层面上是正确的。

我们有一些Python脚本来检查列数、检查日期值、最小/最大范围等等。但现在我们还想检查枚举列是否正确。例如，如果一个名为"visit"的列是一个编码值，并且只能包含`baseline`、`fup_6_m`、`fup_12_m`，那么它就不应该包含像`fup_36_m`这样的其他内容。

我们有元数据规范，因此列名和编码值列表（也称为枚举）在事先是已知的。

以下是我迄今为止编写的Python脚本：

    # 检查编码值是否正确
    import pandas as pd
    import io
    
    ## 从CSV文件加载数据
    ##df = pd.read_csv (r&#39;patlist_mcl2017.csv&#39;, sep = &quot;,&quot;, decimal=&quot;.&quot;)
    
    # 测试：从文本创建数据框
    str_patients = &quot;&quot;&quot;patid,dob,sex,height,score,visit
    1072,16-01-1981,M,154,1,fup_12_m
    1091,20-12-1991,M,168,4,baseline
    1126,25-12-1999,M,181,3,fup_6_m
    1139,14-04-1980,Y,165,1,baseline
    1171,05-11-1984,M,192,2,fup_12_m
    1237,17-08-1983,F,170,3,fup_6_m
    1334,26-08-1985,F,160,5,fup_6_m
    1365,14-09-1976,M,184,3,fup_24_m
    1384,28-12-1993,F,152,1,baseline
    1456,27-09-1998,F,164,5,fup_12_m
    &quot;&quot;&quot;
    df = pd.read_csv(io.StringIO(str_patients), sep = &quot;,&quot;, decimal=&quot;.&quot;)
    
    print(df)
    
    # 枚举列的允许值
    allowed_enum = {
        &#39;sex&#39;: [&#39;M&#39;, &#39;F&#39;],
        &#39;score&#39;: [0, 1, 2, 3, 4],
        &#39;visit&#39;: [&#39;baseline&#39;, &#39;fup_6_m&#39;, &#39;fup_12_m&#39;]
    }
    
    # 检查枚举值
    for column_name, allowed_values in allowed_enum.items():
        df_chk = df[~df[column_name].isin(allowed_values)].groupby(column_name).size().reset_index(name=&#39;Count&#39;)
        if not df_chk.empty:
            print(&quot;找到列 &#39;%s&#39; 的无效值：&quot; % column_name)
            print(df_chk)

希望这能对你有所帮助。

英文:

We recieve many different csv files from external labs and centers. When recieving such a file, we first need to do some QA checks before further processing. So make sure the data is correct, at least on a technical level.

We have some Python scripts to check the number of columns, check date values, min/max range etc. But now we also want to check wether the enumerated columns are correct. So for example, if a column visit is a coded value and may only contain baseline, fup_6_m, fup_12_m then it shouldn't contain anything else like fup_36_m.

We have the metadata specifications, so the column names and the lists of coded values (aka enumeration) are known beforehand.

This is the Python script I've got so far:

# check if coded values are correct
import pandas as pd
import io

## load data from csv files
##df = pd.read_csv (r&#39;patlist_mcl2017.csv&#39;, sep = &quot;,&quot;, decimal=&quot;.&quot;)

# TESTING: create data frame from text
str_patients = &quot;&quot;&quot;patid,dob,sex,height,score,visit
1072,16-01-1981,M,154,1,fup_12_m
1091,20-12-1991,M,168,4,baseline
1126,25-12-1999,M,181,3,fup_6_m
1139,14-04-1980,Y,165,1,baseline
1171,05-11-1984,M,192,2,fup_12_m
1237,17-08-1983,F,170,3,fup_6_m
1334,26-08-1985,F,160,5,fup_6_m
1365,14-09-1976,M,184,3,fup_24_m
1384,28-12-1993,F,152,1,baseline
1456,27-09-1998,F,164,5,fup_12_m
&quot;&quot;&quot;
df = pd.read_csv(io.StringIO(str_patients), sep = &quot;,&quot;, decimal=&quot;.&quot;)

print(df)

# allowed values for enumeration columnms
allowed_enum = {
    &#39;sex&#39;: [&#39;M&#39;, &#39;F&#39;],
    &#39;score&#39;: [0, 1, 2, 3, 4],
    &#39;visit&#39;: [&#39;baseline&#39;, &#39;fup_6_m&#39;, &#39;fup_12_m&#39;]
}

# check enumeration
for column_name, allowed_values in allowed_enum.items():
    df_chk = df[~df[column_name].isin(allowed_values)].groupby(column_name).size().reset_index(name=&#39;Count&#39;)
    if not df_chk.empty:
        print(&quot;Found invalid values for column &#39;%s&#39;:&quot; % column_name)
        print(df_chk)

It works and the output is like this:

Found invalid values for column &#39;sex&#39;:
  sex  Count
0   Y      1
Found invalid values for column &#39;score&#39;:
   score  Count
0      5      2
Found invalid values for column &#39;visit&#39;:
	  visit  Count
0  fup_24_m      1

But the different files can contain many columns, and for better reporting we'd like to get the output as one dataframe, so something like this:

  Column_name  Invalid  Count
0 Sex            Y          1
1 Score          5          2
2 visit       fup_24_m      1

So my question is:

What is the best way to collect the invalid values in a dataframe, like above?
Or, is there maybe a better way for checking/validating these kind of coded values?

答案1

得分: 2

你可以尝试以下代码来获得所需结果：

              Invalid  count
Column_name                 
sex                 Y      1
score               5      2
visit        fup_24_m      1

对于示例数据框（另一个.reset_index将给你问题中的格式）。

或者，类似于Zach Young的提议，你可以使用以下代码来获得仅包含无效值的子数据框：

   sex  score     visit
3    Y    NaN       NaN
6  NaN    5.0       NaN
7  NaN    NaN  fup_24_m
9  NaN    5.0       NaN

英文:

You could try

...
dfs = {
    column_name: df[~df[column_name].isin(allowed_values)]
                 .value_counts(subset=column_name)
                 .to_frame().reset_index(names=&quot;Invalid&quot;)
    for column_name, allowed_values in allowed_enum.items()
}
out = pd.concat(dfs, names=(&quot;Column_name&quot;, None)).droplevel(1)

to get

              Invalid  count
Column_name                 
sex                 Y      1
score               5      2
visit        fup_24_m      1

for the sample dataframe (another .reset_index would give you the format in the question).

Or, similiar to Zach Young's proposal, you could do

...
columns = (
    df.loc[~df[column_name].isin(allowed_values), column_name]
    for column_name, allowed_values in allowed_enum.items()
)
out = pd.concat(columns, axis=1, sort=True)

to get a sub-dataframe which contains only the invalid values

   sex  score     visit
3    Y    NaN       NaN
6  NaN    5.0       NaN
7  NaN    NaN  fup_24_m
9  NaN    5.0       NaN

答案2

得分: 1

以下是输入表格和错误报告的翻译：

输入表格：

| patid | dob        | sex | height | score | visit    |
| ----- | ---------- | --- | ------ | ----- | -------- |
| 1072  | 16-01-1981 | M   | 154    | 1     | fup_12_m |
| 1091  | 20-12-1991 | M   | 168    | 4     | baseline |
| 1126  | 25-12-1999 | M   | 181    | 3     | fup_6_m  |
| 1139  | 14-04-1980 | Y   | 165    | 1     | baseline |
| 1171  | 05-11-1984 | M   | 192    | 2     | fup_12_m |
| 1237  | 17-08-1983 | F   | 170    | 3     | fup_6_m  |
| 1334  | 26-08-1985 | F   | 160    | 5     | fup_6_m  |
| 1365  | 14-09-1976 | M   | 184    | 3     | fup_24_m |
| 1384  | 28-12-1993 | F   | 152    | 1     | baseline |
| 1456  | 27-09-1998 | F   | 164    | 5     | fup_12_m |
| a     | 1/1/2000   | N   | b      | -1    | foo      |

错误报告：

| Row_num | patid | dob      | sex | height | score | visit    |
| ------- | ----- | -------- | --- | ------ | ----- | -------- |
|       4 |       |          | Y   |        |       |          |
|       7 |       |          |     |        | 5     |          |
|       8 |       |          |     |        |       | fup_24_m |
|      10 |       |          |     |        | 5     |          |
|      11 | a     | 1/1/2000 | N   | b      | -1    | foo      |

代码部分未翻译，仅提供原文。

英文:

Given the input (I added an extra row with all invalid fields):

| patid | dob        | sex | height | score | visit    |
| ----- | ---------- | --- | ------ | ----- | -------- |
| 1072  | 16-01-1981 | M   | 154    | 1     | fup_12_m |
| 1091  | 20-12-1991 | M   | 168    | 4     | baseline |
| 1126  | 25-12-1999 | M   | 181    | 3     | fup_6_m  |
| 1139  | 14-04-1980 | Y   | 165    | 1     | baseline |
| 1171  | 05-11-1984 | M   | 192    | 2     | fup_12_m |
| 1237  | 17-08-1983 | F   | 170    | 3     | fup_6_m  |
| 1334  | 26-08-1985 | F   | 160    | 5     | fup_6_m  |
| 1365  | 14-09-1976 | M   | 184    | 3     | fup_24_m |
| 1384  | 28-12-1993 | F   | 152    | 1     | baseline |
| 1456  | 27-09-1998 | F   | 164    | 5     | fup_12_m |
| a     | 1/1/2000   | N   | b      | -1    | foo      |

How about a nice CSV report of all errors, like:

| Row_num | patid | dob      | sex | height | score | visit    |
| ------- | ----- | -------- | --- | ------ | ----- | -------- |
|       4 |       |          | Y   |        |       |          |
|       7 |       |          |     |        | 5     |          |
|       8 |       |          |     |        |       | fup_24_m |
|      10 |       |          |     |        | 5     |          |
|      11 | a     | 1/1/2000 | N   | b      | -1    | foo      |

I don't know Pandas, so if you aren't locked into using it, here's how to validate each column in each row:

import csv
import sys

from datetime import datetime
from typing import Any, Callable

# Map an invalid value to its column, which is mapped to row&#39;s number where the
# errors were found:
#
#   {
#       4: {&quot;sex&quot;: &quot;Y&quot;},
#       7: {&quot;score&quot;: &quot;5&quot;},
#       8: {&quot;visit&quot;: &quot;fup_24_m&quot;},
#       10: {&quot;score&quot;: &quot;5&quot;},
#       11: {
#           &quot;patid&quot;: &quot;a&quot;,
#           &quot;dob&quot;: &quot;1/1/2000&quot;,
#           &quot;sex&quot;: &quot;N&quot;,
#           &quot;height&quot;: &quot;b&quot;,
#           &quot;score&quot;: &quot;-1&quot;,
#           &quot;visit&quot;: &quot;foo&quot;,
#       },
#   }
#
errors_map: dict[int, dict[str, str]] = {}


def update(row_num: int, col_name: str, value: str):
    &quot;&quot;&quot;Map col_name/bad-value pair to its row number.&quot;&quot;&quot;
    if row_num not in errors_map:
        errors_map[row_num] = {}
    errors_map[row_num][col_name] = value


def validate_int(x: str):
    return int(x)


def validate_dt(x: str, format_str: str):
    return datetime.strptime(x, format_str)


def validate_strs(x: str, strs: list[str]):
    if x not in strs:
        raise ValueError
    return x


def validate_ints(x: str, ints: list[int]):
    x_int = validate_int(x)
    if x_int not in ints:
        raise ValueError
    return x_int


v_int = validate_int
v_dt = validate_dt
v_ints = validate_ints
v_strs = validate_strs

validators: dict[str, Callable[[str], Any]] = {
    &quot;patid&quot;: lambda x: v_int(x),
    &quot;dob&quot;: lambda x: v_dt(x, r&quot;%d-%m-%Y&quot;),
    &quot;sex&quot;: lambda x: v_strs(x, [&quot;M&quot;, &quot;F&quot;]),
    &quot;height&quot;: lambda x: v_int(x),
    &quot;score&quot;: lambda x: v_ints(x, [0, 1, 2, 3, 4]),
    &quot;visit&quot;: lambda x: v_strs(x, [&quot;baseline&quot;, &quot;fup_6_m&quot;, &quot;fup_12_m&quot;]),
}

col_names = list(validators.keys())


with open(&quot;input.csv&quot;, newline=&quot;&quot;) as f:
    reader = csv.reader(f)
    header = next(reader)

    if header != col_names:
        print(f&quot;unexpected header: {header}&quot;, file=sys.stderr)
        sys.exit(1)

    for i, row in enumerate(reader, start=1):
        for j, col_name in enumerate(col_names):
            val = row[j]
            validator = validators[col_name]
            try:
                validator(val)
            except:
                update(i, col_name, val)


writer = csv.DictWriter(sys.stdout, fieldnames=[&quot;Row_num&quot;] + col_names)
writer.writeheader()
for row_num, errors in errors_map.items():
    errors[&quot;Row_num&quot;] = str(row_num)
    writer.writerow(errors)

This approach relies on the validator funcs raising some exception, either implicitly (e.g., int(patid) if patid isn't an integer) or explecitly (e.g. raise ValueError in validate_score() if the parsed int is out of bounds). As-is, all expected exceptions happen to be ValueErrors.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

验证CSV文件，检查枚举列是否包含任何无效的编码值。

问题

答案1

答案2

使用存储在另一个数据框中的索引引用数据框。

边框颜色更改

SymPy解决容易方程的零空间时出现问题。

Levenshtein距离能防止密码滥用吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论