2023年7月24日 17:07:34go评论116阅读模式

英文:

Error while peforming Tf-idfvectorizer() on the training values

问题

# 模型创建
X = df.drop(columns='v1', axis=1)
y = df['v1']
from sklearn.feature_extraction.text import TfidfVectorizer
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
pipe_gnb = Pipeline([
    ('vect', TfidfVectorizer()),
    ('gnb', GaussianNB())
])
params_gnb = {
        # 'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
        # 'vect__max_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        # 'vect__min_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        # 'vect__max_features': [3000, 4000, 5000, None],
        # 'vect__binary': [True, False],
        'vect__sublinear_tf': [True, False]
}
gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring='accuracy')
gs_gnb.fit(X_train, y_train)

英文:

#Model Creation
X = df.drop(columns=&#39;v1&#39;, axis=1)
y = df[&#39;v1&#39;]
from sklearn.feature_extraction.text import TfidfVectorizer
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
pipe_gnb = Pipeline([
    (&#39;vect&#39;, TfidfVectorizer()),
    (&#39;gnb&#39;, GaussianNB())
])
params_gnb = {
        # &#39;vect__ngram_range&#39;: [(1, 1), (1, 2), (1, 3)],
        # &#39;vect__max_df&#39;: [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        # &#39;vect__min_df&#39;: [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        # &#39;vect__max_features&#39;: [3000, 4000, 5000, None],
        # &#39;vect__binary&#39;: [True, False],
        &#39;vect__sublinear_tf&#39;: [True, False]
}
gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring=&#39;accuracy&#39;)
gs_gnb.fit(X_train, y_train)

I have been trying to do Tfidfvectorizeration on 'v2' column, that is the messages column in the attached dataset but I have been getting the below error(mainly an error to convert the fiited "X, y(Train values)" to a 'dense array').

Code link->https://colab.research.google.com/drive/1mZTyTIVDB2oz6Fp9QFghc4wy5g7DNQ4J#scrollTo=WDt_AMiOhGLm

Dataset link->
https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

Error log->

Fitting 5 folds for each of 2 candidates, totalling 10 fits
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
&lt;ipython-input-23-f27f59ee7f63&gt; in &lt;cell line: 45&gt;()
     43 
     44 gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring=&#39;accuracy&#39;)
---&gt; 45 gs_gnb.fit(X, y)
     46 
     47 print(&#39;Best accuracy: &#39;, gs_gnb.best_score_, end=&#39;\n&#39;)
3 frames
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in _warn_or_raise_about_fit_failures(results, error_score)
    365                 f&quot;Below are more details about the failures:\n{fit_errors_summary}&quot;
    366             )
--&gt; 367             raise ValueError(all_fits_failed_message)
    368 
    369         else:
ValueError: 
All the 10 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score=&#39;raise&#39;.
Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File &quot;/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py&quot;, line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File &quot;/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py&quot;, line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File &quot;/usr/local/lib/python3.10/dist-packages/sklearn/naive_bayes.py&quot;, line 267, in fit
    return self._partial_fit(
  File &quot;/usr/local/lib/python3.10/dist-packages/sklearn/naive_bayes.py&quot;, line 428, in _partial_fit
    X, y = self._validate_data(X, y, reset=first_call)
  File &quot;/usr/local/lib/python3.10/dist-packages/sklearn/base.py&quot;, line 584, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File &quot;/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py&quot;, line 1106, in check_X_y
    X = check_array(
  File &quot;/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py&quot;, line 845, in check_array
    array = _ensure_sparse_format(
  File &quot;/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py&quot;, line 522, in _ensure_sparse_format
    raise TypeError(
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

答案1

得分: 0

请注意，以下是翻译好的部分：

在“TfidfVectorizer”之后，因为输出是稀疏矩阵，你需要添加一个特定步骤。你可以从TransformerMixin创建一个DenseTransformer，并将其添加到流水线中：

import numpy as np
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
    def fit(self, X, y=None, **fit_params):
        return self
    def transform(self, X, y=None, **fit_params):
        return np.array(X.todense())

你需要对你的代码进行两个修改。首先，你需要**只选择“v2”**作为特征：

X = df['v2']
y = df['v1']

然后，你需要修改流水线：

pipe_gnb = Pipeline([
    ('vect', TfidfVectorizer()),
    ('to_dense', DenseTransformer()), 
    ('gnb', GaussianNB()),
])

英文:

You need to add a specific step after the "TfidfVectorizer" because the output is a sparse matrix. You can create a DenseTransformer from TransformerMixin and add it in the pipeline :

import numpy as np
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
    def fit(self, X, y=None, **fit_params):
        return self
    def transform(self, X, y=None, **fit_params):
        return np.array(X.todense())

You need to make two modifications in your code. First, you need to select only the "v2" as feature :

X = df[&#39;v2&#39;]
y = df[&#39;v1&#39;]

And you need to modify the pipeline :

pipe_gnb = Pipeline([
    (&#39;vect&#39;, TfidfVectorizer()),
    (&#39;to_dense&#39;, DenseTransformer()), 
    (&#39;gnb&#39;, GaussianNB()),
])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在训练值上执行Tf-idf向量化器（Tf-idfvectorizer()）时发生错误。

问题

答案1

如何将文件中的所有项设置为字典中的内容

Flask Web服务器应用在Windows Python程序中无法在程序退出时关闭。

Create a list or array of date time using pandas.

如何获取由pandas.get_dummies()生成的列？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。