英文:
Scikit-Learn cross validation function not allowing custom folds when indices are not sequential
问题
尝试将自定义交叉验证折叠传递给scikit-learn的cross_validate函数。
cross_validate函数似乎触发了一个错误,因为它坚持使用基于位置的索引,而不是基于标签的索引。我在cv_folds参数中传递的索引与原始数据帧的索引一致。之所以相关,是因为我想使用哈希函数值来选择我的训练-测试分割的子集,以及我的交叉验证折叠。我遇到了以下错误:IndexError: indices are out-of-bounds
df2 = pd.DataFrame(np.random.rand(8, 3), columns=['feature_1', 'feature_2', 'feature_3'])
train_index_list = [0,1,2,5,6,7]
test_index_list = [3,4]
X_train = df2.loc[train_index_list].drop(columns='feature_3').copy()
y_train = df2.loc[train_index_list]['feature_3'].copy()
# 2-fold cross validation
cv_folds = [([0,1,2],[5,6,7]), ([5,6,7], [0,1,2])]
cv_output = cross_validate(model, X_train, y_train, scoring=['neg_mean_squared_error'], cv=cv_folds)
这会触发一个错误。但让我感到困惑的是以下几行代码可以正常运行:
X_train.loc[train_index_list]
y_train.loc[train_index_list]
我该如何解决这个问题,以便可以将我自定义定义的cv折叠传递给Scikit-Learn?
英文:
Attempting to pass in custom cross validation folds to sklearn's cross validate function.
The cross validate function seems to be triggering an error because it's insisting on using position-based indexing, rather than label-based indexing. The indices I'm passing in my cv_folds argument are consistent with the original dataframe's indices. The reason this is relevant is because I want to use a hash function value to select subsets for my train-test split, as well as my cv folds. I get the following error: IndexError: indices are out-of-bounds
df2 = pd.DataFrame(np.random.rand(8, 3), columns=['feature_1', 'feature_2', 'feature_3'])
train_index_list = [0,1,2,5,6,7]
test_index_list = [3,4]
X_train = df2.loc[train_index_list].drop(columns='feature_3').copy()
y_train = df2.loc[train_index_list]['feature_3'].copy()
# 2-fold cross validation
cv_folds = [ ([0,1,2,],[5,6,7]), ([5,6,7], [0,1,2])]
cv_output = cross_validate(model, X_train, y_train, scoring=['neg_mean_squared_error'], cv=cv_folds)
This triggers an error. But what puzzles me is that the following lines run just fine
X_train.loc[train_index_list]
y_train.loc[train_index_list]
How do I resolve this so I can pass in my custom-defined cv folds into Scikit-Learn?
答案1
得分: 0
你可以使用Index.get_indexer
来将标签转换为索引位置,以绕过此问题:
def cv_folds(df, labels):
for i, j in labels:
i = df.index.get_indexer(i)
j = df.index.get_indexer(j)
yield (i.tolist(), j.tolist())
labels = [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])]
cv = cv_folds(X_train, labels)
cv_output = cross_validate(model, X_train, y_train, cv=cv,
scoring=['neg_mean_squared_error'])
测试:
>>> list(cv_folds(X_train, labels))
[([0, 1, 2], [3, 4, 5]), ([3, 4, 5], [0, 1, 2])] # <- 位置
# [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])] # <- 标签
英文:
You can use a workaround by using a Index.get_indexer
to convert labels to index positions:
def cv_folds(df, labels):
for i, j in labels:
i = df.index.get_indexer(i)
j = df.index.get_indexer(j)
yield (i.tolist(), j.tolist())
labels = [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])]
cv = cv_folds(X_train, labels)
cv_output = cross_validate(model, X_train, y_train, cv=cv,
scoring=['neg_mean_squared_error'])
Test:
>>> list(cv_folds(X_train, labels))
[([0, 1, 2], [3, 4, 5]), ([3, 4, 5], [0, 1, 2])] # <- positions
# [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])] # <- labels
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论