你可以使用Pandera来检查可能包含浮点数或整数的Pandas列。

huangapple go评论90阅读模式
英文:

How can I use Pandera to check a Pandas column that might have floats or ints

问题

我正在尝试在Pandera中设置一个DataFrameSchema。问题是,数据中的某一列可能是float或int类型,具体取决于创建数据帧时使用的数据源。有没有办法在这样的列上设置检查?以下是你提供的代码:

import pandera as pa
from pandera.typing import DataFrame, Series
from datetime import datetime
import pandas as pd

class IngestSchema(pa.SchemaModel):
    column_header: Series[float | int] = pa.Field(alias='MY HEADER')

我还尝试了其他方法:

from typing import Union
float_int = Union[float, int]

但是pandera不认识这个union作为一种数据类型。有没有办法设置这样的模式?

英文:

I am trying to set up a DataFrameSchema in Pandera. The catch is that one of the columns of data may be a float or an int, depending on what data source was used to create the dataframe. Is there a way to set up a check on such a column? This code failed:

import pandera as pa
from pandera.typing import DataFrame, Series
from datetime import datetime
import pandas as pd

class IngestSchema(pa.SchemaModel):
    column_header: Series[float | int] = pa.Field(alias = 'MY HEADER')

Other things I've tried:

from typing import Union
float_int = Union[float, int]

But pandera does not recognize that union as a datatype. Is there any way to set up such a schema?

答案1

得分: 1

深入研究了他们的文档,他们有一个is_numeric函数(检查是否为数字数据类型)。但是目前它是一个私有变量,也许将来会公开。与此同时,你可以使用建议中的解决方法:

from pandas.api.types import is_numeric_dtype
import pandera as pa
import pandas as pd

is_number = pa.Check(is_numeric_dtype, name="is_number")
schema = pa.DataFrameSchema({"column": pa.Column(checks=is_number)})
schema(pd.DataFrame({"column": [1,2,"a"]}))

我看到你正在使用SchemaModel,我对此不太熟悉。不过我在本地测试过,它可以工作(不确定Series注释的情况下):

import pandas as pd
import pandera as pa
from pandera.typing import Series
from pandas.api.types import is_numeric_dtype

class IngestSchema(pa.DataFrameModel):
    column_header: Series

    @pa.check("column_header")
    def check_is_number(cls, column_header: Series):
        return is_numeric_dtype(column_header)

# 标记为错误
IngestSchema(pd.DataFrame({"column_header": [1, 2, "a"]}))

# 通过验证
IngestSchema(pd.DataFrame({"column_header": [1, 2, 3]}))

请注意,pa.DataFrameModel是更新的语法,而SchemaModel是其别名。如文档中所述,SchemaModel将在0.20.0版本中被弃用。

英文:

Digging into their docs they have a is_numeric which checks if its a _Number datatype. But it's a private var atm so maybe someday down the line? In the meantime you can go with the suggested workaround:

from pandas.api.types import is_numeric_dtype
import pandera as pa
import pandas as pd

is_number = pa.Check(is_numeric_dtype, name="is_number")
schema = pa.DataFrameSchema({"column": pa.Column(checks=is_number)})
schema(pd.DataFrame({"column": [1,2,"a"]}))

I see you're using the SchemaModel which I'm not very familiar with. I tested this locally and it worked though (w caveat of uncertainty regarding the Series annotation:

import pandas as pd
import pandera as pa
from pandera.typing import Series
from pandas.api.types import is_numeric_dtype

class IngestSchema(pa.DataFrameModel):
    column_header: Series

    @pa.check("column_header")
    def check_is_number(cls, column_header: Series):
        return is_numeric_dtype(column_header)

# flags it
IngestSchema(pd.DataFrame({"column_header": [1, 2, "a"]}))

# passes
IngestSchema(pd.DataFrame({"column_header": [1, 2, 3]}))

Note that pa.DataFrameModel is the updated syntax and SchemaModel serves as an alias for it. SchemaModel will be deprecated in version 0.20.0 as mentioned in the docs.

huangapple
  • 本文由 发表于 2023年8月8日 22:41:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76860646.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定