Scikit-Learn 交叉验证函数在索引不连续时不允许自定义折叠。

huangapple go评论62阅读模式
英文:

Scikit-Learn cross validation function not allowing custom folds when indices are not sequential

问题

尝试将自定义交叉验证折叠传递给scikit-learn的cross_validate函数。

cross_validate函数似乎触发了一个错误,因为它坚持使用基于位置的索引,而不是基于标签的索引。我在cv_folds参数中传递的索引与原始数据帧的索引一致。之所以相关,是因为我想使用哈希函数值来选择我的训练-测试分割的子集,以及我的交叉验证折叠。我遇到了以下错误:IndexError: indices are out-of-bounds

df2 = pd.DataFrame(np.random.rand(8, 3), columns=['feature_1', 'feature_2', 'feature_3'])
train_index_list = [0,1,2,5,6,7]
test_index_list = [3,4]
X_train = df2.loc[train_index_list].drop(columns='feature_3').copy()
y_train = df2.loc[train_index_list]['feature_3'].copy()
# 2-fold cross validation
cv_folds = [([0,1,2],[5,6,7]), ([5,6,7], [0,1,2])]
cv_output = cross_validate(model, X_train, y_train, scoring=['neg_mean_squared_error'], cv=cv_folds)

这会触发一个错误。但让我感到困惑的是以下几行代码可以正常运行:

X_train.loc[train_index_list]
y_train.loc[train_index_list]

我该如何解决这个问题,以便可以将我自定义定义的cv折叠传递给Scikit-Learn?

英文:

Attempting to pass in custom cross validation folds to sklearn's cross validate function.

The cross validate function seems to be triggering an error because it's insisting on using position-based indexing, rather than label-based indexing. The indices I'm passing in my cv_folds argument are consistent with the original dataframe's indices. The reason this is relevant is because I want to use a hash function value to select subsets for my train-test split, as well as my cv folds. I get the following error: IndexError: indices are out-of-bounds

df2 = pd.DataFrame(np.random.rand(8, 3), columns=['feature_1', 'feature_2', 'feature_3'])
train_index_list = [0,1,2,5,6,7]
test_index_list = [3,4]
X_train = df2.loc[train_index_list].drop(columns='feature_3').copy()
y_train = df2.loc[train_index_list]['feature_3'].copy()
# 2-fold cross validation
cv_folds = [ ([0,1,2,],[5,6,7]), ([5,6,7], [0,1,2])]
cv_output = cross_validate(model, X_train, y_train,  scoring=['neg_mean_squared_error'], cv=cv_folds) 

This triggers an error. But what puzzles me is that the following lines run just fine

X_train.loc[train_index_list]
y_train.loc[train_index_list]

How do I resolve this so I can pass in my custom-defined cv folds into Scikit-Learn?

答案1

得分: 0

你可以使用Index.get_indexer来将标签转换为索引位置,以绕过此问题:

def cv_folds(df, labels):
    for i, j in labels:
        i = df.index.get_indexer(i)
        j = df.index.get_indexer(j)
        yield (i.tolist(), j.tolist())

labels = [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])]
cv = cv_folds(X_train, labels)
cv_output = cross_validate(model, X_train, y_train, cv=cv,
                           scoring=['neg_mean_squared_error'])

测试:

>>> list(cv_folds(X_train, labels))
    [([0, 1, 2], [3, 4, 5]), ([3, 4, 5], [0, 1, 2])]  # <- 位置
#   [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])]  # <- 标签
英文:

You can use a workaround by using a Index.get_indexer to convert labels to index positions:

def cv_folds(df, labels):
    for i, j in labels:
        i = df.index.get_indexer(i)
        j = df.index.get_indexer(j)
        yield (i.tolist(), j.tolist())

labels = [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])]
cv = cv_folds(X_train, labels)
cv_output = cross_validate(model, X_train, y_train, cv=cv,
                           scoring=[&#39;neg_mean_squared_error&#39;])

Test:

&gt;&gt;&gt; list(cv_folds(X_train, labels))
    [([0, 1, 2], [3, 4, 5]), ([3, 4, 5], [0, 1, 2])]  # &lt;- positions
#   [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])]  # &lt;- labels

</details>



huangapple
  • 本文由 发表于 2023年3月12日 14:28:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75711417.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定