2023年6月8日 22:51:49go评论67阅读模式

英文:

A pandera DataFrame Schema with special characters in column names

问题

I have received a dataframe from an institute and the column names have some special characters which are not allowed in Python variable naming. I would like to use the DataFrameModel and NOT the DataFrameSchema in pandera to create a schema to validate the dataset. The problem is that, I can not use these column names with special characters as python variables.

Here is a minimal working example: Let us suppose I want to use the DataFrameSchema class. In that case, I will just do this:

from pandera import DataFrameSchema, Column
import pandera as pa

schema = DataFrameSchema(
    columns={
        "Time.Phase": Column(
            dtype=float,
            nullable=False,
            unique=False,
            coerce=True,
            required=True, 
            checks=pa.Check.greater_than_or_equal_to(min_value=0.1),
            description="Time measurement in seconds within the selected phase."
        ),
        "Phase._dynamic": Column(
            dtype=float,
            nullable=False,
            unique=False,
            coerce=True,
            required=True, 
            checks=pa.Check.greater_than_or_equal_to(min_value=0.5),
            description="Measurement of phase dynamics."
        )
    }
)

valid_data = pd.DataFrame.from_records([
    {"Time.Phase": 0.1, "Phase._dynamic": 0.5},
    {"Time.Phase": 0.2, "Phase._dynamic": 0.75}
])

invalid_data = pd.DataFrame.from_records([
    {"Time.Phase": 0.1, "Phase._dynamic": -0.5},
    {"Time.Phase": 0.2, "Phase._dynamic": 0.75}
])

schema.validate(valid_data, lazy=True)

try:
    schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

If I do something similar using the DataFrameModel class, it is supposed to be like this:

import pandera as pa
from pandera.typing import Series

class Schema(pa.DataFrameModel):
    Time_Phase: Series[float] = pa.Field(
        nullable=False, 
        unique=False,
        coerce=True,
        description="Time measurement in seconds within the selected phase."
    )
    Phase_dynamic: Series[float] = pa.Field(
        nullable=False, 
        unique=False,
        coerce=True,
        description="Measurement of phase dynamics."
    )

However, Time_Phase and Phase_dynamic are NOT valid variables in Python, hence can not be used as column names and running the code gives NameError.

Here is what I tried. I first created the Python file institute_data_columns.py with the code below as content.

class InstituteDataColumns:
    time_phase_1 = "Time.Phase"
    phase_dynamic = "Phase._dynamic"

Next, I created another Python file, institute_data_schema.py with the following content:

from institute_data_columns import InstituteDataColumns
from pandera.typing import Series
import pandera as pa

class Schema(pa.DataFrameModel):
    InstituteDataColumns.time_phase_1: Series[float] = pa.Field(ge=0.1,
        nullable=False,
        coerce=True, 
        description="Time measurement in seconds within the selected phase."
    )
    InstituteDataColumns.phase_dynamic: Series[float] = pa.Field(
        ge=0.5,
        nullable=False,
        coerce=True,
        description="Measurement of phase dynamics."
    )

Schema.validate(valid_data, lazy=True)

try:
    Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

Note that, in try-except-block, where I validate both valid_data and invalid_data both works, which is not supposed to be so. Moreover, I can not retrieve the column names defined in the Schema class. That is, Schema._collect_fields() returns {}, however not so for schema.columns. Is there a way I will be able to retrieve the column names in the Schema class?

I know that, I can use pandas to rename the column names using this:

import pandas as pd
pd.DataFrame(valid_data).rename(columns={
    "Time.Phase": "time_phase", 
    "Phase._dynamic": "phase_dynamic"
})

But this that is not allowed either.

英文:

Here is a minimal working example: Let us suppose I want to use the DataFrameSchema class. In that case, I will just do this:

from pandera import DataFrameSchema, Column
import pandera as pa

schema = DataFrameSchema(
    columns={
        &quot;Time.Phase&quot;: Column(
            dtype=float,
            nullable=False,
            unique=False,
            coerce=True,
            required=True, 
            checks=pa.Check.greater_than_or_equal_to(min_value=0.1),
            description=&quot;Time measurement in seconds within the selected phase.&quot;
        ),
        &quot;Phase._dynamic&quot;: Column(
            dtype=float,
            nullable=False,
            unique=False,
            coerce=True,
            required=True, 
            checks=pa.Check.greater_than_or_equal_to(min_value=0.5),
            description=&quot;Measurement of phase dynamics.&quot;
        )
    }
)

valid_data = pd.DataFrame.from_records([
    {&quot;Time.Phase&quot;: 0.1, &quot;Phase._dynamic&quot;: 0.5},
    {&quot;Time.Phase&quot;: 0.2, &quot;Phase._dynamic&quot;: 0.75}
])

invalid_data = pd.DataFrame.from_records([
    {&quot;Time.Phase&quot;: 0.1, &quot;Phase._dynamic&quot;: -0.5},
    {&quot;Time.Phase&quot;: 0.2, &quot;Phase._dynamic&quot;: 0.75}
])

schema.validate(valid_data, lazy=True)

try:
    schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

If I do something similar using the DataFrameModel class, it is supposed to be like this:

import pandera as pa
from pandera.typing import Series

class Schema(pa.DataFrameModel):
    Time.Phase: Series[float] = pa.Field(
        nullable=False, 
        unique=False,
        coerce=True,
        description=&quot;Time measurement in seconds within the selected phase.&quot;
    )
    Phase._dynamic: Series[float] = pa.Field(
        nullable=False, 
        unique=False,
        coerce=True,
        description=&quot;Measurement of phase dynamics.&quot;
    )

However, Time.Phase and Phase._dynamic are NOT valid variables in Python, hence can not be used as columns names and running the code gives NameError.

Here is what I tried. I first created the Python file institute_data_columns.py with the code below as content.

class InstituteDataColumns:
    time_phase_1 = &quot;Time.Phase&quot;
    phase_dynamic = &quot;Phase._dynamic&quot;

Next, I created another Python file, institute_data_schema.py with the following content:

from institute_data_columns import InstituteDataColumns
from pandera.typing import Series
import pandera as pa

class Schema(pa.DataFrameModel):
    InstituteDataColumns.time_phase_1: Series[float] = pa.Field(ge=0.1,
        nullable=False,
        coerce=True, 
        description=&quot;Time measurement in seconds within the selected phase.&quot;
    )
    InstituteDataColumns.phase_dynamic: Series[float] = pa.Field(
        ge=0.5,
        nullable=False,
        coerce=True,
        description=&quot;Measurement of phase dynamics.&quot;
    )

Schema.validate(valid_data, lazy=True)

try:
    Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

I know that, I can use pandas to rename the column names using this:

import pandas as pd
pd.DataFrame(valid_data).rename(columns={
    &quot;Time.Phase&quot;: &quot;time_phase&quot;, 
    &quot;Phase._dynamic&quot;: &quot;phase_dynamic&quot;
})

But this that is not allowed either.

答案1

得分: 0

我在这个 pandera 文档页面找到了答案。我所需要做的就是在 pandera.Field 中使用 alias 关键词，并将其分配给包含不支持字符的列名。完整的代码如下：

from pandera.typing import Series
import pandera as pa

class InstituteDataColumns:
    time_phase_1 = "Time.Phase"
    phase_dynamic = "Phase._dynamic"

class Schema(pa.DataFrameModel):
    time_phase: Series[float] = pa.Field(
        ge=0.1,
        alias=InstituteDataColumns.time_phase_1,
        nullable=False,
        coerce=True,
        description="Time measurement in seconds within the selected phase."
    )
    phase_dynamic: Series[float] = pa.Field(
        ge=0.5,
        alias=InstituteDataColumns.phase_dynamic,
        nullable=False,
        coerce=True,
        description="Measurement of phase dynamics."
    )

Schema.validate(valid_data, lazy=True)

try:
    Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

现在 Schema._collect_fields() 应该正常工作了。

英文:

I found the answer on this pandera documentation page. All I needed to do was to use the alias keyword available in pandera.Field and assign it to the column name containing unsupported characters. The complete code is below:

from pandera.typing import Series
import pandera as pa

class InstituteDataColumns:
    time_phase_1 = &quot;Time.Phase&quot;
    phase_dynamic = &quot;Phase._dynamic&quot;

class Schema(pa.DataFrameModel):
    time_phase: Series[float] = pa.Field(
        ge=0.1, 
        alias=InstituteDataColumns.time_phase_1,
        nullable=False,
        coerce=True, 
        description=&quot;Time measurement in seconds within the selected phase.&quot;
    )
    phase_dynamic: Series[float] = pa.Field(
        ge=0.5,
        alias=InstituteDataColumns.phase_dynamic,
        nullable=False,
        coerce=True,
        description=&quot;Measurement of phase dynamics.&quot;
    )

Schema.validate(valid_data, lazy=True)

try:
    Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

Now Schema._collect_fields() should work now.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

A pandera DataFrame Schema with special characters in column names

问题

答案1

model.fit calculates validation only once after validation_freq train epochs and then never again

生成多元回归中交互项的Pandas截距乘积。

如何通过Python获取C程序的返回值

选择一个列中与另一个列的数据匹配的元素范围。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论