2023年8月8日 22:41:58go评论125阅读模式

英文:

How can I use Pandera to check a Pandas column that might have floats or ints

问题

我正在尝试在Pandera中设置一个DataFrameSchema。问题是，数据中的某一列可能是float或int类型，具体取决于创建数据帧时使用的数据源。有没有办法在这样的列上设置检查？以下是你提供的代码：

import pandera as pa
from pandera.typing import DataFrame, Series
from datetime import datetime
import pandas as pd
class IngestSchema(pa.SchemaModel):
    column_header: Series[float | int] = pa.Field(alias='MY HEADER')

我还尝试了其他方法：

from typing import Union
float_int = Union[float, int]

但是pandera不认识这个union作为一种数据类型。有没有办法设置这样的模式？

英文:

I am trying to set up a DataFrameSchema in Pandera. The catch is that one of the columns of data may be a float or an int, depending on what data source was used to create the dataframe. Is there a way to set up a check on such a column? This code failed:

import pandera as pa
from pandera.typing import DataFrame, Series
from datetime import datetime
import pandas as pd
class IngestSchema(pa.SchemaModel):
    column_header: Series[float | int] = pa.Field(alias = &#39;MY HEADER&#39;)

Other things I've tried:

from typing import Union
float_int = Union[float, int]

But pandera does not recognize that union as a datatype. Is there any way to set up such a schema?

答案1

得分: 1

深入研究了他们的文档，他们有一个is_numeric函数（检查是否为数字数据类型）。但是目前它是一个私有变量，也许将来会公开。与此同时，你可以使用建议中的解决方法：

from pandas.api.types import is_numeric_dtype
import pandera as pa
import pandas as pd
is_number = pa.Check(is_numeric_dtype, name="is_number")
schema = pa.DataFrameSchema({"column": pa.Column(checks=is_number)})
schema(pd.DataFrame({"column": [1,2,"a"]}))

我看到你正在使用SchemaModel，我对此不太熟悉。不过我在本地测试过，它可以工作（不确定Series注释的情况下）：

import pandas as pd
import pandera as pa
from pandera.typing import Series
from pandas.api.types import is_numeric_dtype
class IngestSchema(pa.DataFrameModel):
    column_header: Series
    @pa.check("column_header")
    def check_is_number(cls, column_header: Series):
        return is_numeric_dtype(column_header)
# 标记为错误
IngestSchema(pd.DataFrame({"column_header": [1, 2, "a"]}))
# 通过验证
IngestSchema(pd.DataFrame({"column_header": [1, 2, 3]}))

请注意，pa.DataFrameModel是更新的语法，而SchemaModel是其别名。如文档中所述，SchemaModel将在0.20.0版本中被弃用。

英文:

Digging into their docs they have a is_numeric which checks if its a _Number datatype. But it's a private var atm so maybe someday down the line? In the meantime you can go with the suggested workaround:

from pandas.api.types import is_numeric_dtype
import pandera as pa
import pandas as pd
is_number = pa.Check(is_numeric_dtype, name=&quot;is_number&quot;)
schema = pa.DataFrameSchema({&quot;column&quot;: pa.Column(checks=is_number)})
schema(pd.DataFrame({&quot;column&quot;: [1,2,&quot;a&quot;]}))

I see you're using the SchemaModel which I'm not very familiar with. I tested this locally and it worked though (w caveat of uncertainty regarding the Series annotation:

import pandas as pd
import pandera as pa
from pandera.typing import Series
from pandas.api.types import is_numeric_dtype
class IngestSchema(pa.DataFrameModel):
    column_header: Series
    @pa.check(&quot;column_header&quot;)
    def check_is_number(cls, column_header: Series):
        return is_numeric_dtype(column_header)
# flags it
IngestSchema(pd.DataFrame({&quot;column_header&quot;: [1, 2, &quot;a&quot;]}))
# passes
IngestSchema(pd.DataFrame({&quot;column_header&quot;: [1, 2, 3]}))

Note that pa.DataFrameModel is the updated syntax and SchemaModel serves as an alias for it. SchemaModel will be deprecated in version 0.20.0 as mentioned in the docs.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

你可以使用Pandera来检查可能包含浮点数或整数的Pandas列。

问题

答案1

创建和填充一个数组

使用等效于“match”函数来检索多个值。

筛选R中具有特定字符串值的行

根据另一列具有略有不同值的 pandas 列进行屏蔽

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。