2023年3月12日 14:28:07go评论94阅读模式

英文:

Scikit-Learn cross validation function not allowing custom folds when indices are not sequential

问题

尝试将自定义交叉验证折叠传递给scikit-learn的cross_validate函数。

cross_validate函数似乎触发了一个错误，因为它坚持使用基于位置的索引，而不是基于标签的索引。我在cv_folds参数中传递的索引与原始数据帧的索引一致。之所以相关，是因为我想使用哈希函数值来选择我的训练-测试分割的子集，以及我的交叉验证折叠。我遇到了以下错误：IndexError: indices are out-of-bounds

df2 = pd.DataFrame(np.random.rand(8, 3), columns=['feature_1', 'feature_2', 'feature_3'])

train_index_list = [0,1,2,5,6,7]
test_index_list = [3,4]
X_train = df2.loc[train_index_list].drop(columns='feature_3').copy()
y_train = df2.loc[train_index_list]['feature_3'].copy()

# 2-fold cross validation
cv_folds = [([0,1,2],[5,6,7]), ([5,6,7], [0,1,2])]

cv_output = cross_validate(model, X_train, y_train, scoring=['neg_mean_squared_error'], cv=cv_folds)

这会触发一个错误。但让我感到困惑的是以下几行代码可以正常运行：

X_train.loc[train_index_list]
y_train.loc[train_index_list]

我该如何解决这个问题，以便可以将我自定义定义的cv折叠传递给Scikit-Learn？

英文:

Attempting to pass in custom cross validation folds to sklearn's cross validate function.

The cross validate function seems to be triggering an error because it's insisting on using position-based indexing, rather than label-based indexing. The indices I'm passing in my cv_folds argument are consistent with the original dataframe's indices. The reason this is relevant is because I want to use a hash function value to select subsets for my train-test split, as well as my cv folds. I get the following error: IndexError: indices are out-of-bounds

df2 = pd.DataFrame(np.random.rand(8, 3), columns=[&#39;feature_1&#39;, &#39;feature_2&#39;, &#39;feature_3&#39;])

train_index_list = [0,1,2,5,6,7]
test_index_list = [3,4]
X_train = df2.loc[train_index_list].drop(columns=&#39;feature_3&#39;).copy()
y_train = df2.loc[train_index_list][&#39;feature_3&#39;].copy()

# 2-fold cross validation
cv_folds = [ ([0,1,2,],[5,6,7]), ([5,6,7], [0,1,2])]

cv_output = cross_validate(model, X_train, y_train,  scoring=[&#39;neg_mean_squared_error&#39;], cv=cv_folds)

This triggers an error. But what puzzles me is that the following lines run just fine

X_train.loc[train_index_list]
y_train.loc[train_index_list]

How do I resolve this so I can pass in my custom-defined cv folds into Scikit-Learn?

答案1

得分: 0

你可以使用Index.get_indexer来将标签转换为索引位置，以绕过此问题：

def cv_folds(df, labels):
    for i, j in labels:
        i = df.index.get_indexer(i)
        j = df.index.get_indexer(j)
        yield (i.tolist(), j.tolist())
labels = [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])]
cv = cv_folds(X_train, labels)
cv_output = cross_validate(model, X_train, y_train, cv=cv,
                           scoring=['neg_mean_squared_error'])

测试：

>>> list(cv_folds(X_train, labels))
    [([0, 1, 2], [3, 4, 5]), ([3, 4, 5], [0, 1, 2])]  # <- 位置
#   [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])]  # <- 标签

英文:

You can use a workaround by using a Index.get_indexer to convert labels to index positions:

def cv_folds(df, labels):
    for i, j in labels:
        i = df.index.get_indexer(i)
        j = df.index.get_indexer(j)
        yield (i.tolist(), j.tolist())
labels = [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])]
cv = cv_folds(X_train, labels)
cv_output = cross_validate(model, X_train, y_train, cv=cv,
                           scoring=[&#39;neg_mean_squared_error&#39;])

Test:

&gt;&gt;&gt; list(cv_folds(X_train, labels))
    [([0, 1, 2], [3, 4, 5]), ([3, 4, 5], [0, 1, 2])]  # &lt;- positions
#   [([0, 1, 2], [5, 6, 7]), ([5, 6, 7], [0, 1, 2])]  # &lt;- labels
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scikit-Learn 交叉验证函数在索引不连续时不允许自定义折叠。

问题

答案1

Scipy优化：限制非零变量数量

创建一个包含其他列的列，作为一个JSON对象？

根据特定列计算百分比。

如何使用Pandas的read_html来抓取Wikipedia表格时获取缺失的百分比数值？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。