英文:
How do I drop and change dtype in a Pipeline with sklearn?
问题
我有一些爬取的数据需要清理。清理之后,我想在ColumnTransformer中创建一个“数值和分类管道”,如下所示:
```python
categorical_cols = df.select_dtypes(include='object').columns
numerical_cols = df.select_dtypes(exclude='object').columns
num_pipeline = Pipeline(
steps=[
('scaler', StandardScaler())
]
)
cat_pipeline = Pipeline(
steps=[
('onehotencoder', OneHotEncoder(handle_unknown='ignore'))
]
)
preprocessor = ColumnTransformer([
('num_pipeline', num_pipeline, numerical_cols),
('cat_pipeline', cat_pipeline, categorical_cols)
])
我的想法是创建一个变压器class Transformer(BaseEstimator, TransformerMixin):
,并在其中创建一个管道。该变压器将包括所有清理步骤。我的问题是其中一些步骤会将dtype从对象更改为整数,因此我在考虑是否应该使用列名而不是dtype来定义categorical_cols和numerical_cols。
这是否是正确的方法?我的想法是自动化这个过程,这样我就可以每次使用新数据来训练模型。
<details>
<summary>英文:</summary>
I have some scraped data that needs some cleaning. After the cleaning, I want to create a "numerical and categorical pipelines" inside a ColumnTransformer such as:
categorical_cols = df.select_dtypes(include='object').columns
numerical_cols = df.select_dtypes(exclude='object').columns
num_pipeline = Pipeline(
steps=[
('scaler', StandardScaler())
]
)
cat_pipeline = Pipeline(
steps=[
('onehotencoder', OneHotEncoder(handle_unknown='ignore'))
]
)
preprocessor = ColumnTransformer([
('num_pipeline', num_pipeline, numerical_cols),
('cat_pipeline', cat_pipeline, categorical_cols)
])
My idea was to create a transformer `class Transformer(BaseEstimator, TransformerMixin):` and create a pipeline with it. That transformer would include all the cleaning steps. My problem is that some of the steps change the dtype from object to integer mostly so I'm thinking that instead of defining the categorical_cols and numerical_cols with dtypes, instead, do it with column names.
Would that be the correct approach? The idea would be to automate the process so I can train the model with new data every time.
</details>
# 答案1
**得分**: 2
以下是翻译好的内容:
"Instead of making a list of columns beforehand you can use scikit-learn's [`make_column_selector`][1] to dynamically specify the columns that each transformer will be applied to.
In your example:
```python
from sklearn.compose import make_column_selector as selector
preprocessor = ColumnTransformer([
('num_pipeline', num_pipeline, selector(dtype_exclude=object)),
('cat_pipeline', cat_pipeline, selector(dtype_include=object))
])
Under the hood it uses pandas' select_dtypes
for the type selection. You can pass a regex and select based on column name as well.
I also recommend you checking out make_column_transformer
for more control over the pipeline.
英文:
Instead of making a list of columns beforehand you can use scikit-learn's make_column_selector
to dynamically specify the columns that each transformer will be applied to.
In your example:
from sklearn.compose import make_column_selector as selector
preprocessor = ColumnTransformer([
('num_pipeline', num_pipeline, selector(dtype_exclude=object)),
('cat_pipeline', cat_pipeline, selector(dtype_include=object))
])
Under the hood it uses pandas' select_dtypes
for the type selection. You can pass a regex and select based on column name as well.
I also recommend you checking out make_column_transformer
for more control over the pipeline.
答案2
得分: 1
以下是翻译好的部分:
"the process is OK, as you said the type changes, and on many occasions, you encode
the data to use it. To prevent this from happening label columns as categorical
and numerical
then change their types as you wish; for example use LabelEncoder
. in many situations, a missing value makes an integer
column into an object
making you miserable in reporting results.
so forget about total automation in this field and try methods to get each columns dtype
and save them, then give the data to the pipeline
."
"使用以下修改,您可以在有新数据和不同列的情况下更新numerical_cols
和categorical_cols
列表,管道将相应地进行调整。"
"您始终可以使用此类方法查找每个列的dtype
。"
英文:
the process is OK, as you said the type changes, and on many occasions, you encode
the data to use it. To prevent this from happening label columns as categorical
and numerical
then change their types as you wish; for example use LabelEncoder
. in many situations, a missing value makes an integer
column into an object
making you miserable in reporting results.
so forget about total automation in this field and try methods to get each columns dtype
and save them, then give the data to the pipeline
.
# Define numerical and categorical columns
numerical_cols = ['numerical_col_1', 'numerical_col_2', ...]
categorical_cols = ['categorical_col_1', 'categorical_col_2', ...]
num_pipeline = Pipeline(
steps=[ ('scaler', StandardScaler()) ]
)
cat_pipeline = Pipeline(
steps=[ ('onehotencoder', OneHotEncoder(handle_unknown='ignore')) ]
)
preprocessor = ColumnTransformer([ ('num_pipeline', num_pipeline, numerical_cols), ('cat_pipeline', cat_pipeline, categorical_cols)])
With this modification, you can update the numerical_cols
and categorical_cols
lists whenever you have new data with different columns, and the pipeline will adapt accordingly.
you can always do this methods and methods like this to find each columns dtype
.
non_integer_columns = []
new_data = data.dropna()
for col in data.columns:
try:
new_data[col] = new_data[col].astype(int)
except:
non_integer_columns.append(col)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论