英文:
A pandera DataFrame Schema with special characters in column names
问题
I have received a dataframe from an institute and the column names have some special characters which are not allowed in Python variable naming. I would like to use the DataFrameModel
and NOT the DataFrameSchema
in pandera to create a schema to validate the dataset. The problem is that, I can not use these column names with special characters as python variables.
Here is a minimal working example: Let us suppose I want to use the DataFrameSchema
class. In that case, I will just do this:
from pandera import DataFrameSchema, Column
import pandera as pa
schema = DataFrameSchema(
columns={
"Time.Phase": Column(
dtype=float,
nullable=False,
unique=False,
coerce=True,
required=True,
checks=pa.Check.greater_than_or_equal_to(min_value=0.1),
description="Time measurement in seconds within the selected phase."
),
"Phase._dynamic": Column(
dtype=float,
nullable=False,
unique=False,
coerce=True,
required=True,
checks=pa.Check.greater_than_or_equal_to(min_value=0.5),
description="Measurement of phase dynamics."
)
}
)
valid_data = pd.DataFrame.from_records([
{"Time.Phase": 0.1, "Phase._dynamic": 0.5},
{"Time.Phase": 0.2, "Phase._dynamic": 0.75}
])
invalid_data = pd.DataFrame.from_records([
{"Time.Phase": 0.1, "Phase._dynamic": -0.5},
{"Time.Phase": 0.2, "Phase._dynamic": 0.75}
])
schema.validate(valid_data, lazy=True)
try:
schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases)
If I do something similar using the DataFrameModel
class, it is supposed to be like this:
import pandera as pa
from pandera.typing import Series
class Schema(pa.DataFrameModel):
Time_Phase: Series[float] = pa.Field(
nullable=False,
unique=False,
coerce=True,
description="Time measurement in seconds within the selected phase."
)
Phase_dynamic: Series[float] = pa.Field(
nullable=False,
unique=False,
coerce=True,
description="Measurement of phase dynamics."
)
However, Time_Phase
and Phase_dynamic
are NOT valid variables in Python, hence can not be used as column names and running the code gives NameError
.
Here is what I tried. I first created the Python file institute_data_columns.py
with the code below as content.
class InstituteDataColumns:
time_phase_1 = "Time.Phase"
phase_dynamic = "Phase._dynamic"
Next, I created another Python file, institute_data_schema.py
with the following content:
from institute_data_columns import InstituteDataColumns
from pandera.typing import Series
import pandera as pa
class Schema(pa.DataFrameModel):
InstituteDataColumns.time_phase_1: Series[float] = pa.Field(ge=0.1,
nullable=False,
coerce=True,
description="Time measurement in seconds within the selected phase."
)
InstituteDataColumns.phase_dynamic: Series[float] = pa.Field(
ge=0.5,
nullable=False,
coerce=True,
description="Measurement of phase dynamics."
)
Schema.validate(valid_data, lazy=True)
try:
Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases)
Note that, in try
-except
-block, where I validate both valid_data
and invalid_data
both works, which is not supposed to be so. Moreover, I can not retrieve the column names defined in the Schema
class. That is, Schema._collect_fields()
returns {}
, however not so for schema.columns
. Is there a way I will be able to retrieve the column names in the Schema
class?
I know that, I can use pandas
to rename the column names using this:
import pandas as pd
pd.DataFrame(valid_data).rename(columns={
"Time.Phase": "time_phase",
"Phase._dynamic": "phase_dynamic"
})
But this that is not allowed either.
英文:
I have received a dataframe from an institute and the column names have some special characters which are not allowed in Python variable naming. I would like to use the DataFrameModel
and NOT the DataFrameSchema
in pandera to create a schema to validate the dataset. The problem is that, I can not use these column names with special characters as python variables.
Here is a minimal working example: Let us suppose I want to use the DataFrameSchema
class. In that case, I will just do this:
from pandera import DataFrameSchema, Column
import pandera as pa
schema = DataFrameSchema(
columns={
"Time.Phase": Column(
dtype=float,
nullable=False,
unique=False,
coerce=True,
required=True,
checks=pa.Check.greater_than_or_equal_to(min_value=0.1),
description="Time measurement in seconds within the selected phase."
),
"Phase._dynamic": Column(
dtype=float,
nullable=False,
unique=False,
coerce=True,
required=True,
checks=pa.Check.greater_than_or_equal_to(min_value=0.5),
description="Measurement of phase dynamics."
)
}
)
valid_data = pd.DataFrame.from_records([
{"Time.Phase": 0.1, "Phase._dynamic": 0.5},
{"Time.Phase": 0.2, "Phase._dynamic": 0.75}
])
invalid_data = pd.DataFrame.from_records([
{"Time.Phase": 0.1, "Phase._dynamic": -0.5},
{"Time.Phase": 0.2, "Phase._dynamic": 0.75}
])
schema.validate(valid_data, lazy=True)
try:
schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases)
If I do something similar using the DataFrameModel
class, it is supposed to be like this:
import pandera as pa
from pandera.typing import Series
class Schema(pa.DataFrameModel):
Time.Phase: Series[float] = pa.Field(
nullable=False,
unique=False,
coerce=True,
description="Time measurement in seconds within the selected phase."
)
Phase._dynamic: Series[float] = pa.Field(
nullable=False,
unique=False,
coerce=True,
description="Measurement of phase dynamics."
)
However, Time.Phase
and Phase._dynamic
are NOT valid variables in Python, hence can not be used as columns names and running the code gives NameError
.
Here is what I tried. I first created the Python file institute_data_columns.py
with the code below as content.
class InstituteDataColumns:
time_phase_1 = "Time.Phase"
phase_dynamic = "Phase._dynamic"
Next, I created another Python file, institute_data_schema.py
with the following content:
from institute_data_columns import InstituteDataColumns
from pandera.typing import Series
import pandera as pa
class Schema(pa.DataFrameModel):
InstituteDataColumns.time_phase_1: Series[float] = pa.Field(ge=0.1,
nullable=False,
coerce=True,
description="Time measurement in seconds within the selected phase."
)
InstituteDataColumns.phase_dynamic: Series[float] = pa.Field(
ge=0.5,
nullable=False,
coerce=True,
description="Measurement of phase dynamics."
)
Schema.validate(valid_data, lazy=True)
try:
Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases)
Note that, in try
-except
-block, where I validate both valid_data
and invalid_data
both works, which is not supposed to be so. Moreover, I can not retrieve the column names defined in the Schema
class. That is, Schema._collect_fields()
returns {}
, however not so for schema.columns
. Is there a way I will be able to retrieve the column names in the Schema
class?
I know that, I can use pandas
to rename the column names using this:
import pandas as pd
pd.DataFrame(valid_data).rename(columns={
"Time.Phase": "time_phase",
"Phase._dynamic": "phase_dynamic"
})
But this that is not allowed either.
答案1
得分: 0
我在这个 pandera 文档页面 找到了答案。我所需要做的就是在 pandera.Field
中使用 alias
关键词,并将其分配给包含不支持字符的列名。完整的代码如下:
from pandera.typing import Series
import pandera as pa
class InstituteDataColumns:
time_phase_1 = "Time.Phase"
phase_dynamic = "Phase._dynamic"
class Schema(pa.DataFrameModel):
time_phase: Series[float] = pa.Field(
ge=0.1,
alias=InstituteDataColumns.time_phase_1,
nullable=False,
coerce=True,
description="Time measurement in seconds within the selected phase."
)
phase_dynamic: Series[float] = pa.Field(
ge=0.5,
alias=InstituteDataColumns.phase_dynamic,
nullable=False,
coerce=True,
description="Measurement of phase dynamics."
)
Schema.validate(valid_data, lazy=True)
try:
Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases)
现在 Schema._collect_fields()
应该正常工作了。
英文:
I found the answer on this pandera documentation page. All I needed to do was to use the alias
keyword available in pandera.Field
and assign it to the column name containing unsupported characters. The complete code is below:
from pandera.typing import Series
import pandera as pa
class InstituteDataColumns:
time_phase_1 = "Time.Phase"
phase_dynamic = "Phase._dynamic"
class Schema(pa.DataFrameModel):
time_phase: Series[float] = pa.Field(
ge=0.1,
alias=InstituteDataColumns.time_phase_1,
nullable=False,
coerce=True,
description="Time measurement in seconds within the selected phase."
)
phase_dynamic: Series[float] = pa.Field(
ge=0.5,
alias=InstituteDataColumns.phase_dynamic,
nullable=False,
coerce=True,
description="Measurement of phase dynamics."
)
Schema.validate(valid_data, lazy=True)
try:
Schema.validate(invalid_data, lazy=True)
except pa.errors.SchemaErrors as exc:
display(exc.failure_cases)
Now Schema._collect_fields()
should work now.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论