英文:
Error while peforming Tf-idfvectorizer() on the training values
问题
# 模型创建
X = df.drop(columns='v1', axis=1)
y = df['v1']
from sklearn.feature_extraction.text import TfidfVectorizer
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
pipe_gnb = Pipeline([
('vect', TfidfVectorizer()),
('gnb', GaussianNB())
])
params_gnb = {
# 'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
# 'vect__max_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
# 'vect__min_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
# 'vect__max_features': [3000, 4000, 5000, None],
# 'vect__binary': [True, False],
'vect__sublinear_tf': [True, False]
}
gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring='accuracy')
gs_gnb.fit(X_train, y_train)
英文:
#Model Creation
X = df.drop(columns='v1', axis=1)
y = df['v1']
from sklearn.feature_extraction.text import TfidfVectorizer
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
pipe_gnb = Pipeline([
('vect', TfidfVectorizer()),
('gnb', GaussianNB())
])
params_gnb = {
# 'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
# 'vect__max_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
# 'vect__min_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
# 'vect__max_features': [3000, 4000, 5000, None],
# 'vect__binary': [True, False],
'vect__sublinear_tf': [True, False]
}
gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring='accuracy')
gs_gnb.fit(X_train, y_train)
I have been trying to do Tfidfvectorizeration on 'v2' column, that is the messages column in the attached dataset but I have been getting the below error(mainly an error to convert the fiited "X, y(Train values)" to a 'dense array').
Code link->https://colab.research.google.com/drive/1mZTyTIVDB2oz6Fp9QFghc4wy5g7DNQ4J#scrollTo=WDt_AMiOhGLm
Dataset link->
https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
Error log->
Fitting 5 folds for each of 2 candidates, totalling 10 fits
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-f27f59ee7f63> in <cell line: 45>()
43
44 gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring='accuracy')
---> 45 gs_gnb.fit(X, y)
46
47 print('Best accuracy: ', gs_gnb.best_score_, end='\n')
3 frames
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in _warn_or_raise_about_fit_failures(results, error_score)
365 f"Below are more details about the failures:\n{fit_errors_summary}"
366 )
--> 367 raise ValueError(all_fits_failed_message)
368
369 else:
ValueError:
All the 10 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py", line 405, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "/usr/local/lib/python3.10/dist-packages/sklearn/naive_bayes.py", line 267, in fit
return self._partial_fit(
File "/usr/local/lib/python3.10/dist-packages/sklearn/naive_bayes.py", line 428, in _partial_fit
X, y = self._validate_data(X, y, reset=first_call)
File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 584, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 1106, in check_X_y
X = check_array(
File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 845, in check_array
array = _ensure_sparse_format(
File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 522, in _ensure_sparse_format
raise TypeError(
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
答案1
得分: 0
请注意,以下是翻译好的部分:
在“TfidfVectorizer”之后,因为输出是稀疏矩阵,你需要添加一个特定步骤。你可以从TransformerMixin创建一个DenseTransformer,并将其添加到流水线中:
import numpy as np
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, y=None, **fit_params):
return np.array(X.todense())
你需要对你的代码进行两个修改。首先,你需要**只选择“v2”**作为特征:
X = df['v2']
y = df['v1']
然后,你需要修改流水线:
pipe_gnb = Pipeline([
('vect', TfidfVectorizer()),
('to_dense', DenseTransformer()),
('gnb', GaussianNB()),
])
英文:
You need to add a specific step after the "TfidfVectorizer" because the output is a sparse matrix. You can create a DenseTransformer from TransformerMixin and add it in the pipeline :
import numpy as np
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, y=None, **fit_params):
return np.array(X.todense())
You need to make two modifications in your code. First, you need to select only the "v2" as feature :
X = df['v2']
y = df['v1']
And you need to modify the pipeline :
pipe_gnb = Pipeline([
('vect', TfidfVectorizer()),
('to_dense', DenseTransformer()),
('gnb', GaussianNB()),
])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论