英文:
What is responsible for this TypeError: DataUndersampler.transform() missing 1 required positional argument: 'y'?
问题
以下是代码的翻译部分:
这是我以前问题的一个自定义支持向量数据欠采样器答案。
主要思想是以一种明智的方式对多数类别进行欠采样,方法是将SVC与数据拟合,找到支持向量,然后根据这些支持向量的距离对多数类别进行欠采样。
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import resample
from sklearn.svm import SVC
import numpy as np
from sklearn.multiclass import OneVsOneClassifier
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
class DataUndersampler(BaseEstimator, TransformerMixin):
def __init__(self, random_state=None):
self.random_state = random_state
self.svc = SVC(kernel='linear')
def fit(self, X, y):
# 将SVC拟合到数据
self.svc.fit(X, y)
return self
def transform(self, X, y):
# 获取支持向量
support_vectors = self.svc.support_vectors_
# 获取支持向量的索引
support_vector_indices = self.svc.support_
# 分离多数类别和少数类别
majority_class = y.value_counts().idxmax()
minority_class = y.value_counts().idxmin()
X_majority = X[y == majority_class]
y_majority = y[y == majority_class]
X_minority = X[y == minority_class]
y_minority = y[y == minority_class]
# 计算多数类别样本到最近支持向量的距离
distances = np.min(np.linalg.norm(X_majority.values[:, np.newaxis] - support_vectors, axis=2), axis=1)
# 按距离对多数类别样本进行排序,并仅保留与少数类别相同数量的样本
sorted_indices = np.argsort(distances)
indices_to_keep = sorted_indices[:len(y_minority)]
# 将欠采样后的多数类别与少数类别合并
X_resampled = pd.concat([X_majority.iloc[indices_to_keep], X_minority])
y_resampled = pd.concat([y_majority.iloc[indices_to_keep], y_minority])
return X_resampled, y_resampled
最小工作示例(MWE):
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10_000, n_classes=5, weights=[22.6, 3.7, 16.4, 51.9],
n_informative=4)
rf_clf = model = RandomForestClassifier()
resampler = DataUndersampler(random_state=234)
pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)
classifier.fit(X, y)
产生的错误:
----> 7 classifier.fit(X, y)
18 frames
/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py in wrapped(self, X, *args, **kwargs)
138 @wraps(f)
139 def wrapped(self, X, *args, **kwargs):
--> 140 data_to_wrap = f(self, X, *args, **kwargs)
141 if isinstance(data_to_wrap, tuple):
142 # only wrap the first output for cross decomposition
TypeError: DataUndersampler.transform() missing 1 required positional argument: 'y'
英文:
This is a custom support vectorbased data undersampler answer from my previous question.
The main idea is to undersample the majority class in an informed way, by fitting an SVC to the data, find the support vectors, and then undersample the majority class based on the distances to these support vectors.
Code:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import resample
from sklearn.svm import SVC
import numpy as np
from sklearn.multiclass import OneVsOneClassifier
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
class DataUndersampler(BaseEstimator, TransformerMixin):
def __init__(self, random_state=None):
self.random_state = random_state
self.svc = SVC(kernel='linear')
def fit(self, X, y):
# Fit SVC to data
self.svc.fit(X, y)
return self
def transform(self, X, y):
# Get support vectors
support_vectors = self.svc.support_vectors_
# Get indices of support vectors
support_vector_indices = self.svc.support_
# Separate majority and minority classes
majority_class = y.value_counts().idxmax()
minority_class = y.value_counts().idxmin()
X_majority = X[y == majority_class]
y_majority = y[y == majority_class]
X_minority = X[y == minority_class]
y_minority = y[y == minority_class]
# Calculate distances of majority class samples to nearest support vector
distances = np.min(np.linalg.norm(X_majority.values[:, np.newaxis] - support_vectors, axis=2), axis=1)
# Sort the majority class samples by distance and take only as many as there are in minority class
sorted_indices = np.argsort(distances)
indices_to_keep = sorted_indices[:len(y_minority)]
# Combine the undersampled majority class with the minority class
X_resampled = pd.concat([X_majority.iloc[indices_to_keep], X_minority])
y_resampled = pd.concat([y_majority.iloc[indices_to_keep], y_minority])
return X_resampled, y_resampled
MWE:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10_000, n_classes=5, weights=[22.6, 3.7, 16.4, 51.9],
n_informative=4)
rf_clf = model = RandomForestClassifier()
resampler = DataUndersampler(random_state=234)
pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)
classifier.fit(X, y)
Produces the error:
----> 7 classifier.fit(X, y)
18 frames
/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py in wrapped(self, X, *args, **kwargs)
138 @wraps(f)
139 def wrapped(self, X, *args, **kwargs):
--> 140 data_to_wrap = f(self, X, *args, **kwargs)
141 if isinstance(data_to_wrap, tuple):
142 # only wrap the first output for cross decomposition
TypeError: DataUndersampler.transform() missing 1 required positional argument: 'y'
答案1
得分: 1
The problem is the TransformerMixin
as its implementation of fit_transform
is:
def fit_transform(self, X, y=None, **fit_params):
if y is None:
# fit method of arity 1 (unsupervised transformation)
return self.fit(X, **fit_params).transform(X)
else:
# fit method of arity 2 (supervised transformation)
return self.fit(X, y, **fit_params).transform(X) # <-- here is the problem
Solution implement fit_transform
yourself.
(That is the main purpose of the TransformerMixin
class - beside also inheriting from _SetOutputMixin
source).
class DataUndersampler(BaseEstimator):
def fit_transform(self, X, y):
return self.fit(X, y).transform(X, y)
...
NOTE:
You might run into problems further down the line if only a single output from transform is expected.
In that case you have to update Y inplace and only return X.
y[:] = y_resampled
return X_resampled
Should do the job.
英文:
The problem is the TransformerMixin
as its implementation of fit_transform
is:
def fit_transform(self, X, y=None, **fit_params):
"""
Fits transformer to `X` and `y` with optional parameters `fit_params`
and returns a transformed version of `X`.
"""
if y is None:
# fit method of arity 1 (unsupervised transformation)
return self.fit(X, **fit_params).transform(X)
else:
# fit method of arity 2 (supervised transformation)
return self.fit(X, y, **fit_params).transform(X) # <-- here is the problem
Solution implement fit_transform
yourself.
(That is the main purpose of the TransformerMixin
class - beside also also inheriting from _SetOutputMixin
source).
class DataUndersampler(BaseEstimator):
def fit_transform(self, X, y):
return self.fit(X, y).transform(X, y)
...
NOTE:
You might run into problems further down the line if only a single output from transform is expected.
In that case you have to update Y inplace and only return X.
y[:] = y_resampled
return X_resampled
Should do the job.
答案2
得分: 0
请慢慢一步步来。首先,让我们看一下错误。
>TypeError: DataUndersampler.transform() 缺少 1 个必需的位置参数: 'y'
当一个函数期望 2 个参数但只获得一个时,会发生这个错误。例如:
def func(x, y):
return x, y # 虚构的函数
# 这会导致错误:
func(3) # 参数不足
因此,无论谁在调用 transform()
,都只希望 transform
接受 1 个参数。
实际上,如果您查看OneVsOneClassifier.fit()
的源代码,我们会看到这行代码:
# 请注意,transform 只使用 1 个参数调用!
Y = self.label_binarizer_.fit_transform(y)
我对 Sklearn 不是特别熟悉,但我怀疑您需要一个可以处理 2 个输入变量的分类器。我尝试查找,但无法弄清楚是什么,不过。
英文:
Let's slowly take this step by step. First let's look at the error.
>TypeError: DataUndersampler.transform() missing 1 required positional argument: 'y'
This error happens when a function expects 2 arguments but gets only one. For example:
def func (x , y):
return x, y # Fummy function
# This causes the error:
func(3) # Not enough arguments
Thus, whatever is calling transform()
is only expecting transform to accept 1 argument.
Indeed, if you look at the source code for OneVsOneClassifier.fit()
, we see this line:
# Note transform is called with only 1 argument!
Y = self.label_binarizer_.fit_transform(y)
I'm not super familiar with Sklearn, but I suspect that you need a classifier that can handle 2 input variables. I looked but couldn't figure out what that would be, though.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论