在训练值上执行Tf-idf向量化器(Tf-idfvectorizer())时发生错误。

huangapple go评论81阅读模式
英文:

Error while peforming Tf-idfvectorizer() on the training values

问题

# 模型创建

X = df.drop(columns='v1', axis=1)
y = df['v1']

from sklearn.feature_extraction.text import TfidfVectorizer


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


pipe_gnb = Pipeline([
    ('vect', TfidfVectorizer()),
    ('gnb', GaussianNB())
])


params_gnb = {
        # 'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
        # 'vect__max_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        # 'vect__min_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        # 'vect__max_features': [3000, 4000, 5000, None],
        # 'vect__binary': [True, False],
        'vect__sublinear_tf': [True, False]
}

gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring='accuracy')
gs_gnb.fit(X_train, y_train)
英文:
#Model Creation

X = df.drop(columns='v1', axis=1)
y = df['v1']

from sklearn.feature_extraction.text import TfidfVectorizer


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


pipe_gnb = Pipeline([
    ('vect', TfidfVectorizer()),
    ('gnb', GaussianNB())
])


params_gnb = {
        # 'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
        # 'vect__max_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        # 'vect__min_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        # 'vect__max_features': [3000, 4000, 5000, None],
        # 'vect__binary': [True, False],
        'vect__sublinear_tf': [True, False]
}

gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring='accuracy')
gs_gnb.fit(X_train, y_train)

I have been trying to do Tfidfvectorizeration on 'v2' column, that is the messages column in the attached dataset but I have been getting the below error(mainly an error to convert the fiited "X, y(Train values)" to a 'dense array').

Code link->https://colab.research.google.com/drive/1mZTyTIVDB2oz6Fp9QFghc4wy5g7DNQ4J#scrollTo=WDt_AMiOhGLm

Dataset link->
https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

Error log->

Fitting 5 folds for each of 2 candidates, totalling 10 fits
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-f27f59ee7f63> in <cell line: 45>()
     43 
     44 gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring='accuracy')
---> 45 gs_gnb.fit(X, y)
     46 
     47 print('Best accuracy: ', gs_gnb.best_score_, end='\n')

3 frames
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in _warn_or_raise_about_fit_failures(results, error_score)
    365                 f"Below are more details about the failures:\n{fit_errors_summary}"
    366             )
--> 367             raise ValueError(all_fits_failed_message)
    368 
    369         else:

ValueError: 
All the 10 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py", line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/naive_bayes.py", line 267, in fit
    return self._partial_fit(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/naive_bayes.py", line 428, in _partial_fit
    X, y = self._validate_data(X, y, reset=first_call)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 584, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 1106, in check_X_y
    X = check_array(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 845, in check_array
    array = _ensure_sparse_format(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 522, in _ensure_sparse_format
    raise TypeError(
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

答案1

得分: 0

请注意,以下是翻译好的部分:

在“TfidfVectorizer”之后,因为输出是稀疏矩阵,你需要添加一个特定步骤。你可以从TransformerMixin创建一个DenseTransformer,并将其添加到流水线中:

import numpy as np
from sklearn.base import TransformerMixin

class DenseTransformer(TransformerMixin):
    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return np.array(X.todense())

你需要对你的代码进行两个修改。首先,你需要**只选择“v2”**作为特征:

X = df['v2']
y = df['v1']

然后,你需要修改流水线:

pipe_gnb = Pipeline([
    ('vect', TfidfVectorizer()),
    ('to_dense', DenseTransformer()), 
    ('gnb', GaussianNB()),
])
英文:

You need to add a specific step after the "TfidfVectorizer" because the output is a sparse matrix. You can create a DenseTransformer from TransformerMixin and add it in the pipeline :

import numpy as np
from sklearn.base import TransformerMixin

class DenseTransformer(TransformerMixin):
    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return np.array(X.todense())

You need to make two modifications in your code. First, you need to select only the "v2" as feature :

X = df['v2']
y = df['v1']

And you need to modify the pipeline :

pipe_gnb = Pipeline([
    ('vect', TfidfVectorizer()),
    ('to_dense', DenseTransformer()), 
    ('gnb', GaussianNB()),
])

huangapple
  • 本文由 发表于 2023年7月24日 17:07:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76752935.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定