在训练值上执行Tf-idf向量化器(Tf-idfvectorizer())时发生错误。

huangapple go评论116阅读模式
英文:

Error while peforming Tf-idfvectorizer() on the training values

问题

  1. # 模型创建
  2. X = df.drop(columns='v1', axis=1)
  3. y = df['v1']
  4. from sklearn.feature_extraction.text import TfidfVectorizer
  5. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
  6. pipe_gnb = Pipeline([
  7. ('vect', TfidfVectorizer()),
  8. ('gnb', GaussianNB())
  9. ])
  10. params_gnb = {
  11. # 'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
  12. # 'vect__max_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  13. # 'vect__min_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  14. # 'vect__max_features': [3000, 4000, 5000, None],
  15. # 'vect__binary': [True, False],
  16. 'vect__sublinear_tf': [True, False]
  17. }
  18. gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring='accuracy')
  19. gs_gnb.fit(X_train, y_train)
英文:
  1. #Model Creation
  2. X = df.drop(columns='v1', axis=1)
  3. y = df['v1']
  4. from sklearn.feature_extraction.text import TfidfVectorizer
  5. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
  6. pipe_gnb = Pipeline([
  7. ('vect', TfidfVectorizer()),
  8. ('gnb', GaussianNB())
  9. ])
  10. params_gnb = {
  11. # 'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
  12. # 'vect__max_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  13. # 'vect__min_df': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
  14. # 'vect__max_features': [3000, 4000, 5000, None],
  15. # 'vect__binary': [True, False],
  16. 'vect__sublinear_tf': [True, False]
  17. }
  18. gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring='accuracy')
  19. gs_gnb.fit(X_train, y_train)

I have been trying to do Tfidfvectorizeration on 'v2' column, that is the messages column in the attached dataset but I have been getting the below error(mainly an error to convert the fiited "X, y(Train values)" to a 'dense array').

Code link->https://colab.research.google.com/drive/1mZTyTIVDB2oz6Fp9QFghc4wy5g7DNQ4J#scrollTo=WDt_AMiOhGLm

Dataset link->
https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

Error log->

  1. Fitting 5 folds for each of 2 candidates, totalling 10 fits
  2. ---------------------------------------------------------------------------
  3. ValueError Traceback (most recent call last)
  4. <ipython-input-23-f27f59ee7f63> in <cell line: 45>()
  5. 43
  6. 44 gs_gnb = GridSearchCV(pipe_gnb, params_gnb, verbose=10, cv=5, n_jobs=-1, scoring='accuracy')
  7. ---> 45 gs_gnb.fit(X, y)
  8. 46
  9. 47 print('Best accuracy: ', gs_gnb.best_score_, end='\n')
  10. 3 frames
  11. /usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in _warn_or_raise_about_fit_failures(results, error_score)
  12. 365 f"Below are more details about the failures:\n{fit_errors_summary}"
  13. 366 )
  14. --> 367 raise ValueError(all_fits_failed_message)
  15. 368
  16. 369 else:
  17. ValueError:
  18. All the 10 fits failed.
  19. It is very likely that your model is misconfigured.
  20. You can try to debug the error by setting error_score='raise'.
  21. Below are more details about the failures:
  22. --------------------------------------------------------------------------------
  23. 10 fits failed with the following error:
  24. Traceback (most recent call last):
  25. File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
  26. estimator.fit(X_train, y_train, **fit_params)
  27. File "/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py", line 405, in fit
  28. self._final_estimator.fit(Xt, y, **fit_params_last_step)
  29. File "/usr/local/lib/python3.10/dist-packages/sklearn/naive_bayes.py", line 267, in fit
  30. return self._partial_fit(
  31. File "/usr/local/lib/python3.10/dist-packages/sklearn/naive_bayes.py", line 428, in _partial_fit
  32. X, y = self._validate_data(X, y, reset=first_call)
  33. File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 584, in _validate_data
  34. X, y = check_X_y(X, y, **check_params)
  35. File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 1106, in check_X_y
  36. X = check_array(
  37. File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 845, in check_array
  38. array = _ensure_sparse_format(
  39. File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 522, in _ensure_sparse_format
  40. raise TypeError(
  41. TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

答案1

得分: 0

请注意,以下是翻译好的部分:

在“TfidfVectorizer”之后,因为输出是稀疏矩阵,你需要添加一个特定步骤。你可以从TransformerMixin创建一个DenseTransformer,并将其添加到流水线中:

  1. import numpy as np
  2. from sklearn.base import TransformerMixin
  3. class DenseTransformer(TransformerMixin):
  4. def fit(self, X, y=None, **fit_params):
  5. return self
  6. def transform(self, X, y=None, **fit_params):
  7. return np.array(X.todense())

你需要对你的代码进行两个修改。首先,你需要**只选择“v2”**作为特征:

  1. X = df['v2']
  2. y = df['v1']

然后,你需要修改流水线:

  1. pipe_gnb = Pipeline([
  2. ('vect', TfidfVectorizer()),
  3. ('to_dense', DenseTransformer()),
  4. ('gnb', GaussianNB()),
  5. ])
英文:

You need to add a specific step after the "TfidfVectorizer" because the output is a sparse matrix. You can create a DenseTransformer from TransformerMixin and add it in the pipeline :

  1. import numpy as np
  2. from sklearn.base import TransformerMixin
  3. class DenseTransformer(TransformerMixin):
  4. def fit(self, X, y=None, **fit_params):
  5. return self
  6. def transform(self, X, y=None, **fit_params):
  7. return np.array(X.todense())

You need to make two modifications in your code. First, you need to select only the "v2" as feature :

  1. X = df['v2']
  2. y = df['v1']

And you need to modify the pipeline :

  1. pipe_gnb = Pipeline([
  2. ('vect', TfidfVectorizer()),
  3. ('to_dense', DenseTransformer()),
  4. ('gnb', GaussianNB()),
  5. ])

huangapple
  • 本文由 发表于 2023年7月24日 17:07:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76752935.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定