2023年5月14日 17:52:34go评论75阅读模式

英文:

How do I drop and change dtype in a Pipeline with sklearn?

问题

我有一些爬取的数据需要清理。清理之后，我想在ColumnTransformer中创建一个“数值和分类管道”，如下所示：

```python
categorical_cols = df.select_dtypes(include='object').columns
numerical_cols = df.select_dtypes(exclude='object').columns

num_pipeline = Pipeline(
    steps=[
    ('scaler', StandardScaler())
    ]
)

cat_pipeline = Pipeline(
    steps=[
        ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer([
    ('num_pipeline', num_pipeline, numerical_cols),
    ('cat_pipeline', cat_pipeline, categorical_cols)
])

我的想法是创建一个变压器class Transformer(BaseEstimator, TransformerMixin):，并在其中创建一个管道。该变压器将包括所有清理步骤。我的问题是其中一些步骤会将dtype从对象更改为整数，因此我在考虑是否应该使用列名而不是dtype来定义categorical_cols和numerical_cols。

这是否是正确的方法？我的想法是自动化这个过程，这样我就可以每次使用新数据来训练模型。


<details>
<summary>英文:</summary>

I have some scraped data that needs some cleaning. After the cleaning, I want to create a &quot;numerical and categorical pipelines&quot; inside a ColumnTransformer such as:

categorical_cols = df.select_dtypes(include='object').columns
numerical_cols = df.select_dtypes(exclude='object').columns

num_pipeline = Pipeline(
steps=[
('scaler', StandardScaler())
]
)

cat_pipeline = Pipeline(
steps=[
('onehotencoder', OneHotEncoder(handle_unknown='ignore'))
]
)

preprocessor = ColumnTransformer([
('num_pipeline', num_pipeline, numerical_cols),
('cat_pipeline', cat_pipeline, categorical_cols)
])



My idea was to create a transformer `class Transformer(BaseEstimator, TransformerMixin):` and create a pipeline with it. That transformer would include all the cleaning steps. My problem is that some of the steps change the dtype from object to integer mostly so I&#39;m thinking that instead of defining the categorical_cols and numerical_cols with dtypes, instead, do it with column names.

Would that be the correct approach? The idea would be to automate the process so I can train the model with new data every time.

</details>


# 答案1
**得分**: 2

以下是翻译好的内容：

"Instead of making a list of columns beforehand you can use scikit-learn's [`make_column_selector`][1] to dynamically specify the columns that each transformer will be applied to.

In your example:
```python
from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer([
    ('num_pipeline', num_pipeline, selector(dtype_exclude=object)),
    ('cat_pipeline', cat_pipeline, selector(dtype_include=object))
])

Under the hood it uses pandas' select_dtypes for the type selection. You can pass a regex and select based on column name as well.

I also recommend you checking out make_column_transformer for more control over the pipeline.

英文:

Instead of making a list of columns beforehand you can use scikit-learn's make_column_selector to dynamically specify the columns that each transformer will be applied to.

In your example:

from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer([
    (&#39;num_pipeline&#39;, num_pipeline, selector(dtype_exclude=object)),
    (&#39;cat_pipeline&#39;, cat_pipeline, selector(dtype_include=object))
])

Under the hood it uses pandas' select_dtypes for the type selection. You can pass a regex and select based on column name as well.

I also recommend you checking out make_column_transformer for more control over the pipeline.

答案2

得分: 1

以下是翻译好的部分：

"the process is OK, as you said the type changes, and on many occasions, you encode the data to use it. To prevent this from happening label columns as categorical and numerical then change their types as you wish; for example use LabelEncoder. in many situations, a missing value makes an integer column into an object making you miserable in reporting results.
so forget about total automation in this field and try methods to get each columns dtype and save them, then give the data to the pipeline."

"使用以下修改，您可以在有新数据和不同列的情况下更新numerical_cols和categorical_cols列表，管道将相应地进行调整。"

"您始终可以使用此类方法查找每个列的dtype。"

英文:

the process is OK, as you said the type changes, and on many occasions, you encode the data to use it. To prevent this from happening label columns as categorical and numerical then change their types as you wish; for example use LabelEncoder. in many situations, a missing value makes an integer column into an object making you miserable in reporting results.
so forget about total automation in this field and try methods to get each columns dtype and save them, then give the data to the pipeline.

# Define numerical and categorical columns
numerical_cols = [&#39;numerical_col_1&#39;, &#39;numerical_col_2&#39;, ...]
categorical_cols = [&#39;categorical_col_1&#39;, &#39;categorical_col_2&#39;, ...]

num_pipeline = Pipeline(
    steps=[        (&#39;scaler&#39;, StandardScaler())    ]
)

cat_pipeline = Pipeline(
    steps=[        (&#39;onehotencoder&#39;, OneHotEncoder(handle_unknown=&#39;ignore&#39;))    ]
)

preprocessor = ColumnTransformer([    (&#39;num_pipeline&#39;, num_pipeline, numerical_cols),    (&#39;cat_pipeline&#39;, cat_pipeline, categorical_cols)])

With this modification, you can update the numerical_cols and categorical_cols lists whenever you have new data with different columns, and the pipeline will adapt accordingly.

you can always do this methods and methods like this to find each columns dtype.

non_integer_columns = []
new_data = data.dropna()
for col in data.columns:
   try:
      new_data[col] = new_data[col].astype(int)
   except:
     non_integer_columns.append(col)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在sklearn的Pipeline中删除并更改dtype？

问题

答案2

Django的`.first()`方法从一个Decimal中去掉第一个数字。

如何修复错误 “模块不可调用”

自定义颜色的matplotlib线条，但图例不会更新。

Error loading base64 image: PIL.UnidentifiedImageError: 无法识别图像文件 <_io.BytesIO

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论