你可以使用Pandera来检查可能包含浮点数或整数的Pandas列。

huangapple go评论125阅读模式
英文:

How can I use Pandera to check a Pandas column that might have floats or ints

问题

我正在尝试在Pandera中设置一个DataFrameSchema。问题是,数据中的某一列可能是float或int类型,具体取决于创建数据帧时使用的数据源。有没有办法在这样的列上设置检查?以下是你提供的代码:

  1. import pandera as pa
  2. from pandera.typing import DataFrame, Series
  3. from datetime import datetime
  4. import pandas as pd
  5. class IngestSchema(pa.SchemaModel):
  6. column_header: Series[float | int] = pa.Field(alias='MY HEADER')

我还尝试了其他方法:

  1. from typing import Union
  2. float_int = Union[float, int]

但是pandera不认识这个union作为一种数据类型。有没有办法设置这样的模式?

英文:

I am trying to set up a DataFrameSchema in Pandera. The catch is that one of the columns of data may be a float or an int, depending on what data source was used to create the dataframe. Is there a way to set up a check on such a column? This code failed:

  1. import pandera as pa
  2. from pandera.typing import DataFrame, Series
  3. from datetime import datetime
  4. import pandas as pd
  5. class IngestSchema(pa.SchemaModel):
  6. column_header: Series[float | int] = pa.Field(alias = 'MY HEADER')

Other things I've tried:

  1. from typing import Union
  2. float_int = Union[float, int]

But pandera does not recognize that union as a datatype. Is there any way to set up such a schema?

答案1

得分: 1

深入研究了他们的文档,他们有一个is_numeric函数(检查是否为数字数据类型)。但是目前它是一个私有变量,也许将来会公开。与此同时,你可以使用建议中的解决方法:

  1. from pandas.api.types import is_numeric_dtype
  2. import pandera as pa
  3. import pandas as pd
  4. is_number = pa.Check(is_numeric_dtype, name="is_number")
  5. schema = pa.DataFrameSchema({"column": pa.Column(checks=is_number)})
  6. schema(pd.DataFrame({"column": [1,2,"a"]}))

我看到你正在使用SchemaModel,我对此不太熟悉。不过我在本地测试过,它可以工作(不确定Series注释的情况下):

  1. import pandas as pd
  2. import pandera as pa
  3. from pandera.typing import Series
  4. from pandas.api.types import is_numeric_dtype
  5. class IngestSchema(pa.DataFrameModel):
  6. column_header: Series
  7. @pa.check("column_header")
  8. def check_is_number(cls, column_header: Series):
  9. return is_numeric_dtype(column_header)
  10. # 标记为错误
  11. IngestSchema(pd.DataFrame({"column_header": [1, 2, "a"]}))
  12. # 通过验证
  13. IngestSchema(pd.DataFrame({"column_header": [1, 2, 3]}))

请注意,pa.DataFrameModel是更新的语法,而SchemaModel是其别名。如文档中所述,SchemaModel将在0.20.0版本中被弃用。

英文:

Digging into their docs they have a is_numeric which checks if its a _Number datatype. But it's a private var atm so maybe someday down the line? In the meantime you can go with the suggested workaround:

  1. from pandas.api.types import is_numeric_dtype
  2. import pandera as pa
  3. import pandas as pd
  4. is_number = pa.Check(is_numeric_dtype, name="is_number")
  5. schema = pa.DataFrameSchema({"column": pa.Column(checks=is_number)})
  6. schema(pd.DataFrame({"column": [1,2,"a"]}))

I see you're using the SchemaModel which I'm not very familiar with. I tested this locally and it worked though (w caveat of uncertainty regarding the Series annotation:

  1. import pandas as pd
  2. import pandera as pa
  3. from pandera.typing import Series
  4. from pandas.api.types import is_numeric_dtype
  5. class IngestSchema(pa.DataFrameModel):
  6. column_header: Series
  7. @pa.check("column_header")
  8. def check_is_number(cls, column_header: Series):
  9. return is_numeric_dtype(column_header)
  10. # flags it
  11. IngestSchema(pd.DataFrame({"column_header": [1, 2, "a"]}))
  12. # passes
  13. IngestSchema(pd.DataFrame({"column_header": [1, 2, 3]}))

Note that pa.DataFrameModel is the updated syntax and SchemaModel serves as an alias for it. SchemaModel will be deprecated in version 0.20.0 as mentioned in the docs.

huangapple
  • 本文由 发表于 2023年8月8日 22:41:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76860646.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定