英文:
How can I use Pandera to check a Pandas column that might have floats or ints
问题
我正在尝试在Pandera中设置一个DataFrameSchema。问题是,数据中的某一列可能是float或int类型,具体取决于创建数据帧时使用的数据源。有没有办法在这样的列上设置检查?以下是你提供的代码:
import pandera as pa
from pandera.typing import DataFrame, Series
from datetime import datetime
import pandas as pd
class IngestSchema(pa.SchemaModel):
column_header: Series[float | int] = pa.Field(alias='MY HEADER')
我还尝试了其他方法:
from typing import Union
float_int = Union[float, int]
但是pandera不认识这个union作为一种数据类型。有没有办法设置这样的模式?
英文:
I am trying to set up a DataFrameSchema in Pandera. The catch is that one of the columns of data may be a float or an int, depending on what data source was used to create the dataframe. Is there a way to set up a check on such a column? This code failed:
import pandera as pa
from pandera.typing import DataFrame, Series
from datetime import datetime
import pandas as pd
class IngestSchema(pa.SchemaModel):
column_header: Series[float | int] = pa.Field(alias = 'MY HEADER')
Other things I've tried:
from typing import Union
float_int = Union[float, int]
But pandera does not recognize that union as a datatype. Is there any way to set up such a schema?
答案1
得分: 1
深入研究了他们的文档,他们有一个is_numeric
函数(检查是否为数字数据类型)。但是目前它是一个私有变量,也许将来会公开。与此同时,你可以使用建议中的解决方法:
from pandas.api.types import is_numeric_dtype
import pandera as pa
import pandas as pd
is_number = pa.Check(is_numeric_dtype, name="is_number")
schema = pa.DataFrameSchema({"column": pa.Column(checks=is_number)})
schema(pd.DataFrame({"column": [1,2,"a"]}))
我看到你正在使用SchemaModel
,我对此不太熟悉。不过我在本地测试过,它可以工作(不确定Series
注释的情况下):
import pandas as pd
import pandera as pa
from pandera.typing import Series
from pandas.api.types import is_numeric_dtype
class IngestSchema(pa.DataFrameModel):
column_header: Series
@pa.check("column_header")
def check_is_number(cls, column_header: Series):
return is_numeric_dtype(column_header)
# 标记为错误
IngestSchema(pd.DataFrame({"column_header": [1, 2, "a"]}))
# 通过验证
IngestSchema(pd.DataFrame({"column_header": [1, 2, 3]}))
请注意,pa.DataFrameModel
是更新的语法,而SchemaModel
是其别名。如文档中所述,SchemaModel
将在0.20.0版本中被弃用。
英文:
Digging into their docs they have a is_numeric
which checks if its a _Number datatype. But it's a private var atm so maybe someday down the line? In the meantime you can go with the suggested workaround:
from pandas.api.types import is_numeric_dtype
import pandera as pa
import pandas as pd
is_number = pa.Check(is_numeric_dtype, name="is_number")
schema = pa.DataFrameSchema({"column": pa.Column(checks=is_number)})
schema(pd.DataFrame({"column": [1,2,"a"]}))
I see you're using the SchemaModel
which I'm not very familiar with. I tested this locally and it worked though (w caveat of uncertainty regarding the Series
annotation:
import pandas as pd
import pandera as pa
from pandera.typing import Series
from pandas.api.types import is_numeric_dtype
class IngestSchema(pa.DataFrameModel):
column_header: Series
@pa.check("column_header")
def check_is_number(cls, column_header: Series):
return is_numeric_dtype(column_header)
# flags it
IngestSchema(pd.DataFrame({"column_header": [1, 2, "a"]}))
# passes
IngestSchema(pd.DataFrame({"column_header": [1, 2, 3]}))
Note that pa.DataFrameModel
is the updated syntax and SchemaModel
serves as an alias for it. SchemaModel
will be deprecated in version 0.20.0 as mentioned in the docs.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论