如何在sklearn的Pipeline中删除并更改dtype?

huangapple go评论69阅读模式
英文:

How do I drop and change dtype in a Pipeline with sklearn?

问题

我有一些爬取的数据需要清理清理之后我想在ColumnTransformer中创建一个数值和分类管道”,如下所示

```python
categorical_cols = df.select_dtypes(include='object').columns
numerical_cols = df.select_dtypes(exclude='object').columns

num_pipeline = Pipeline(
    steps=[
    ('scaler', StandardScaler())
    ]
)

cat_pipeline = Pipeline(
    steps=[
        ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer([
    ('num_pipeline', num_pipeline, numerical_cols),
    ('cat_pipeline', cat_pipeline, categorical_cols)
])

我的想法是创建一个变压器class Transformer(BaseEstimator, TransformerMixin):,并在其中创建一个管道。该变压器将包括所有清理步骤。我的问题是其中一些步骤会将dtype从对象更改为整数,因此我在考虑是否应该使用列名而不是dtype来定义categorical_cols和numerical_cols。

这是否是正确的方法?我的想法是自动化这个过程,这样我就可以每次使用新数据来训练模型。


<details>
<summary>英文:</summary>

I have some scraped data that needs some cleaning. After the cleaning, I want to create a &quot;numerical and categorical pipelines&quot; inside a ColumnTransformer such as:

categorical_cols = df.select_dtypes(include='object').columns
numerical_cols = df.select_dtypes(exclude='object').columns

num_pipeline = Pipeline(
steps=[
('scaler', StandardScaler())
]
)

cat_pipeline = Pipeline(
steps=[
('onehotencoder', OneHotEncoder(handle_unknown='ignore'))
]
)

preprocessor = ColumnTransformer([
('num_pipeline', num_pipeline, numerical_cols),
('cat_pipeline', cat_pipeline, categorical_cols)
])



My idea was to create a transformer `class Transformer(BaseEstimator, TransformerMixin):` and create a pipeline with it. That transformer would include all the cleaning steps. My problem is that some of the steps change the dtype from object to integer mostly so I&#39;m thinking that instead of defining the categorical_cols and numerical_cols with dtypes, instead, do it with column names.

Would that be the correct approach? The idea would be to automate the process so I can train the model with new data every time.

</details>


# 答案1
**得分**: 2

以下是翻译好的内容:

"Instead of making a list of columns beforehand you can use scikit-learn's [`make_column_selector`][1] to dynamically specify the columns that each transformer will be applied to.

In your example:
```python
from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer([
    ('num_pipeline', num_pipeline, selector(dtype_exclude=object)),
    ('cat_pipeline', cat_pipeline, selector(dtype_include=object))
])

Under the hood it uses pandas' select_dtypes for the type selection. You can pass a regex and select based on column name as well.

I also recommend you checking out make_column_transformer for more control over the pipeline.

英文:

Instead of making a list of columns beforehand you can use scikit-learn's make_column_selector to dynamically specify the columns that each transformer will be applied to.

In your example:

from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer([
    (&#39;num_pipeline&#39;, num_pipeline, selector(dtype_exclude=object)),
    (&#39;cat_pipeline&#39;, cat_pipeline, selector(dtype_include=object))
])

Under the hood it uses pandas' select_dtypes for the type selection. You can pass a regex and select based on column name as well.

I also recommend you checking out make_column_transformer for more control over the pipeline.

答案2

得分: 1

以下是翻译好的部分:

"the process is OK, as you said the type changes, and on many occasions, you encode the data to use it. To prevent this from happening label columns as categorical and numerical then change their types as you wish; for example use LabelEncoder. in many situations, a missing value makes an integer column into an object making you miserable in reporting results.
so forget about total automation in this field and try methods to get each columns dtype and save them, then give the data to the pipeline."

"使用以下修改,您可以在有新数据和不同列的情况下更新numerical_colscategorical_cols列表,管道将相应地进行调整。"

"您始终可以使用此类方法查找每个列的dtype。"

英文:

the process is OK, as you said the type changes, and on many occasions, you encode the data to use it. To prevent this from happening label columns as categorical and numerical then change their types as you wish; for example use LabelEncoder. in many situations, a missing value makes an integer column into an object making you miserable in reporting results.
so forget about total automation in this field and try methods to get each columns dtype and save them, then give the data to the pipeline.

# Define numerical and categorical columns
numerical_cols = [&#39;numerical_col_1&#39;, &#39;numerical_col_2&#39;, ...]
categorical_cols = [&#39;categorical_col_1&#39;, &#39;categorical_col_2&#39;, ...]

num_pipeline = Pipeline(
    steps=[        (&#39;scaler&#39;, StandardScaler())    ]
)

cat_pipeline = Pipeline(
    steps=[        (&#39;onehotencoder&#39;, OneHotEncoder(handle_unknown=&#39;ignore&#39;))    ]
)

preprocessor = ColumnTransformer([    (&#39;num_pipeline&#39;, num_pipeline, numerical_cols),    (&#39;cat_pipeline&#39;, cat_pipeline, categorical_cols)])

With this modification, you can update the numerical_cols and categorical_cols lists whenever you have new data with different columns, and the pipeline will adapt accordingly.

you can always do this methods and methods like this to find each columns dtype.

non_integer_columns = []
new_data = data.dropna()
for col in data.columns:
   try:
      new_data[col] = new_data[col].astype(int)
   except:
     non_integer_columns.append(col)

huangapple
  • 本文由 发表于 2023年5月14日 17:52:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76246837.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定