What is responsible for this TypeError: DataUndersampler.transform() missing 1 required positional argument: 'y'?

huangapple go评论71阅读模式
英文:

What is responsible for this TypeError: DataUndersampler.transform() missing 1 required positional argument: 'y'?

问题

以下是代码的翻译部分:

这是我以前问题的一个自定义支持向量数据欠采样器答案。

主要思想是以一种明智的方式对多数类别进行欠采样,方法是将SVC与数据拟合,找到支持向量,然后根据这些支持向量的距离对多数类别进行欠采样。

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import resample
from sklearn.svm import SVC
import numpy as np
from sklearn.multiclass import OneVsOneClassifier
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

class DataUndersampler(BaseEstimator, TransformerMixin):
    def __init__(self, random_state=None):
        self.random_state = random_state
        self.svc = SVC(kernel='linear')

    def fit(self, X, y):
        # 将SVC拟合到数据
        self.svc.fit(X, y)
        return self

    def transform(self, X, y):
        # 获取支持向量
        support_vectors = self.svc.support_vectors_
        # 获取支持向量的索引
        support_vector_indices = self.svc.support_

        # 分离多数类别和少数类别
        majority_class = y.value_counts().idxmax()
        minority_class = y.value_counts().idxmin()
        X_majority = X[y == majority_class]
        y_majority = y[y == majority_class]
        X_minority = X[y == minority_class]
        y_minority = y[y == minority_class]

        # 计算多数类别样本到最近支持向量的距离
        distances = np.min(np.linalg.norm(X_majority.values[:, np.newaxis] - support_vectors, axis=2), axis=1)

        # 按距离对多数类别样本进行排序,并仅保留与少数类别相同数量的样本
        sorted_indices = np.argsort(distances)
        indices_to_keep = sorted_indices[:len(y_minority)]

        # 将欠采样后的多数类别与少数类别合并
        X_resampled = pd.concat([X_majority.iloc[indices_to_keep], X_minority])
        y_resampled = pd.concat([y_majority.iloc[indices_to_keep], y_minority])

        return X_resampled, y_resampled

最小工作示例(MWE):

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=10_000, n_classes=5, weights=[22.6, 3.7, 16.4, 51.9],
                           n_informative=4)

rf_clf = model = RandomForestClassifier()
resampler = DataUndersampler(random_state=234)

pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)

classifier.fit(X, y)

产生的错误:

----> 7 classifier.fit(X, y)

18 frames
/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py in wrapped(self, X, *args, **kwargs)
    138     @wraps(f)
    139     def wrapped(self, X, *args, **kwargs):
--> 140         data_to_wrap = f(self, X, *args, **kwargs)
    141         if isinstance(data_to_wrap, tuple):
    142             # only wrap the first output for cross decomposition

TypeError: DataUndersampler.transform() missing 1 required positional argument: 'y'
英文:

This is a custom support vectorbased data undersampler answer from my previous question.

The main idea is to undersample the majority class in an informed way, by fitting an SVC to the data, find the support vectors, and then undersample the majority class based on the distances to these support vectors.

Code:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import resample
from sklearn.svm import SVC
import numpy as np
from sklearn.multiclass import OneVsOneClassifier
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

class DataUndersampler(BaseEstimator, TransformerMixin):
    def __init__(self, random_state=None):
        self.random_state = random_state
        self.svc = SVC(kernel='linear')

    def fit(self, X, y):
        # Fit SVC to data
        self.svc.fit(X, y)
        return self

    def transform(self, X, y):
        # Get support vectors
        support_vectors = self.svc.support_vectors_
        # Get indices of support vectors
        support_vector_indices = self.svc.support_

        # Separate majority and minority classes
        majority_class = y.value_counts().idxmax()
        minority_class = y.value_counts().idxmin()
        X_majority = X[y == majority_class]
        y_majority = y[y == majority_class]
        X_minority = X[y == minority_class]
        y_minority = y[y == minority_class]

        # Calculate distances of majority class samples to nearest support vector
        distances = np.min(np.linalg.norm(X_majority.values[:, np.newaxis] - support_vectors, axis=2), axis=1)

        # Sort the majority class samples by distance and take only as many as there are in minority class
        sorted_indices = np.argsort(distances)
        indices_to_keep = sorted_indices[:len(y_minority)]

        # Combine the undersampled majority class with the minority class
        X_resampled = pd.concat([X_majority.iloc[indices_to_keep], X_minority])
        y_resampled = pd.concat([y_majority.iloc[indices_to_keep], y_minority])

        return X_resampled, y_resampled

MWE:

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=10_000, n_classes=5, weights=[22.6, 3.7, 16.4, 51.9],
                           n_informative=4)

rf_clf = model = RandomForestClassifier()
resampler = DataUndersampler(random_state=234)

pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)

classifier.fit(X, y)

Produces the error:

----> 7 classifier.fit(X, y)

18 frames
/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py in wrapped(self, X, *args, **kwargs)
    138     @wraps(f)
    139     def wrapped(self, X, *args, **kwargs):
--> 140         data_to_wrap = f(self, X, *args, **kwargs)
    141         if isinstance(data_to_wrap, tuple):
    142             # only wrap the first output for cross decomposition

TypeError: DataUndersampler.transform() missing 1 required positional argument: 'y'

答案1

得分: 1

The problem is the TransformerMixin as its implementation of fit_transform is:

def fit_transform(self, X, y=None, **fit_params):
    if y is None:
        # fit method of arity 1 (unsupervised transformation)
        return self.fit(X, **fit_params).transform(X)
    else:
        # fit method of arity 2 (supervised transformation)
        return self.fit(X, y, **fit_params).transform(X) # <-- here is the problem

Solution implement fit_transform yourself.
(That is the main purpose of the TransformerMixin class - beside also inheriting from _SetOutputMixin source).

class DataUndersampler(BaseEstimator):

    def fit_transform(self, X, y):
        return self.fit(X, y).transform(X, y)

    ...

NOTE:
You might run into problems further down the line if only a single output from transform is expected.
In that case you have to update Y inplace and only return X.

y[:] = y_resampled
return X_resampled

Should do the job.

英文:

The problem is the TransformerMixin as its implementation of fit_transform is:

def fit_transform(self, X, y=None, **fit_params):
    &quot;&quot;&quot;
    Fits transformer to `X` and `y` with optional parameters `fit_params`
        and returns a transformed version of `X`.
    &quot;&quot;&quot;
        if y is None:
            # fit method of arity 1 (unsupervised transformation)
            return self.fit(X, **fit_params).transform(X)
        else:
            # fit method of arity 2 (supervised transformation)
            return self.fit(X, y, **fit_params).transform(X) # &lt;-- here is the problem

Solution implement fit_transform yourself.
(That is the main purpose of the TransformerMixin class - beside also also inheriting from _SetOutputMixin source).

class DataUndersampler(BaseEstimator):

    def fit_transform(self, X, y):
        return self.fit(X, y).transform(X, y)

    ...

NOTE:
You might run into problems further down the line if only a single output from transform is expected.
In that case you have to update Y inplace and only return X.

y[:] = y_resampled
return X_resampled

Should do the job.

答案2

得分: 0

请慢慢一步步来。首先,让我们看一下错误。

>TypeError: DataUndersampler.transform() 缺少 1 个必需的位置参数: 'y'

当一个函数期望 2 个参数但只获得一个时,会发生这个错误。例如:

def func(x, y):
    return x, y # 虚构的函数

# 这会导致错误:
func(3) # 参数不足

因此,无论谁在调用 transform(),都只希望 transform 接受 1 个参数。

实际上,如果您查看OneVsOneClassifier.fit()源代码,我们会看到这行代码:

     # 请注意,transform 只使用 1 个参数调用!
     Y = self.label_binarizer_.fit_transform(y)

我对 Sklearn 不是特别熟悉,但我怀疑您需要一个可以处理 2 个输入变量的分类器。我尝试查找,但无法弄清楚是什么,不过。

英文:

Let's slowly take this step by step. First let's look at the error.

>TypeError: DataUndersampler.transform() missing 1 required positional argument: 'y'

This error happens when a function expects 2 arguments but gets only one. For example:

def func (x , y):
    return x, y # Fummy function

# This causes the error:
func(3) # Not enough arguments

Thus, whatever is calling transform() is only expecting transform to accept 1 argument.

Indeed, if you look at the source code for OneVsOneClassifier.fit(), we see this line:

     # Note transform is called with only 1 argument!
     Y = self.label_binarizer_.fit_transform(y)

I'm not super familiar with Sklearn, but I suspect that you need a classifier that can handle 2 input variables. I looked but couldn't figure out what that would be, though.

huangapple
  • 本文由 发表于 2023年7月4日 22:03:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76613438.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定