2023年1月9日 09:24:14go评论105阅读模式

英文:

WEKA's performance on nominal dataset

问题

我在分类任务中使用了WEKA。我使用了WEKA数据文件夹中提供的乳腺癌数据集。该数据集是一个名义数据集。.arff文件可以在此链接中找到。

我使用朴素贝叶斯分类器进行分类。在分类后，我收到了包括准确度、精确度、召回率、ROC和其他度量在内的分类报告。

我熟悉SkLearn - 这是Python的一个包。我知道当输入特征是名义数据时，我们需要使用标签编码器或其他编码技术将这些特征转换为数值，只有在这之后，我们才能进行分类。

所有这些机器学习方法在背后都在进行某种数学运算，以提供预测结果。

因此，我对WEKA中的任何分类器如何在名义数据集上提供预测结果感到困惑。

英文:

I used WEKA for classification. I used the breast cancer dataset which is available in WEKA data folder. The data set is a nominal dataset. The .arff file can be found this link.

I did classification using Naive Bayes classifier. I received a classification report such as accuracy, precision, recall, ROC, and other metrics after classification.

I am familiar with SkLearn - the python package. I know that when the input features are nominal we need to convert those features into numerical values using the label encoder or other encoding techniques. Only after that, we can perform classification.

All those machine learning methods are doing some kind of mathematics in background to give the prediction result.

Therefore, I am confused about how could any classifier in WEKA give us prediction results on a nominal dataset?

答案1

得分: 1

TL;DR: 在设计软件时，复杂性总会存在于某个地方。

scikit-learn 假设用户能够编写代码来处理复杂性；WEKA 假设用户能够使用元数据解释复杂性。

2009年的WEKA "更新" 出版物描述了该软件背后设计动机的一些方面：

4.1 核心

.... "WEKA核心的另一个增加是'Capabilities'元数据功能。这个框架允许个别的学习算法和过滤器声明它们能够处理什么数据特征。这反过来使得WEKA的用户界面能够呈现这些信息，并向用户提供关于数据适用性的反馈意见。"

Hall等人 (2009). https://doi.org/10.1145/1656274.1656278

换句话说，假设用户能够描述数据的属性。威斯康星数据集包含有关变量（年龄、绝经等）及其可能取的有序值（'10-19'，'20-29'等）的详细注释：

@relation breast-cancer
@attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'}
@attribute menopause {'lt40','ge40','premeno'}
...
@attribute 'Class' {'no-recurrence-events','recurrence-events'}
@data
'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'

这为输入数据的外观提供了足够的上下文，从而暗示了哪些方法是合适的。

2013年的scikit-learn API设计出版物没有明确排除像这样的有序字符串输入。然而，“一致性”的核心API设计原则表明存在一些约束。

考虑这个例子：

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit([["a", "b"], ["b", "a"], ["a", "a"]], [0, 0, 1])

在 scikit-learn==1.2.0 中会产生错误：

ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.
            Convert your data to numeric values explicitly instead.

人们可以想象一种Multinomial朴素贝叶斯的版本，其中此代码不会引发错误。但是，如果一些估算器允许这些输入，而其他估算器不允许，那将是不一致的。如果我们将逻辑回归应用于这些数据，应该进行独热编码还是有序编码呢？最好将这个细节留给用户决定。

作者简要提到这是一个“数据表示”问题，并（或许）建议WEKA是一个有趣的替代模型：

2.2 数据表示

"在scikit-learn中，我们选择了一种尽可能接近矩阵表示的数据表示方式：数据集被编码为NumPy多维数组....尽管与更面向对象的构造相比，如Weka（Hall等人，2009）使用的构造，这些数据表示似乎相当简单，但它们带来了一个主要优势，即允许我们依赖于高效的NumPy和SciPy向量化操作，同时保持代码的简洁和可读性。"

Buitinck等人 2013 https://arxiv.org/abs/1309.0238

英文:

TL;DR: when designing software, complexity will exist somewhere.

scikit-learn assumes the user can write code to handle complexity; WEKA assumes the user can explain complexity with metadata.

The 2009 WEKA "update" publication describes some of the design motivations behind the software:

> 4.1 Core
>
> .... "Another addition to the core of WEKA is the 'Capabilities' meta-data facility. This framework allows individual learning algorithms and filters to declare what data characteristics they are able to handle. This, in turn, enables WEKA's user interfaces to present this information and provide feedback to the user about the applicability of a scheme for the data at hand."
>
> - Hall et al. (2009). https://doi.org/10.1145/1656274.1656278

In other words, it is assumed that the user can describe attributes of the data. The Wisconsin dataset includes extensive annotation about the variables (age, menopause, ..., Class) and the ordinal values that they may take ('10-19', '20-29', ...):

@relation breast-cancer
@attribute age {&#39;10-19&#39;,&#39;20-29&#39;,&#39;30-39&#39;,&#39;40-49&#39;,&#39;50-59&#39;,&#39;60-69&#39;,&#39;70-79&#39;,&#39;80-89&#39;,&#39;90-99&#39;}
@attribute menopause {&#39;lt40&#39;,&#39;ge40&#39;,&#39;premeno&#39;}
...
@attribute &#39;Class&#39; {&#39;no-recurrence-events&#39;,&#39;recurrence-events&#39;}
@data
&#39;40-49&#39;,&#39;premeno&#39;,&#39;15-19&#39;,&#39;0-2&#39;,&#39;yes&#39;,&#39;3&#39;,&#39;right&#39;,&#39;left_up&#39;,&#39;no&#39;,&#39;recurrence-events&#39;
&#39;50-59&#39;,&#39;ge40&#39;,&#39;15-19&#39;,&#39;0-2&#39;,&#39;no&#39;,&#39;1&#39;,&#39;right&#39;,&#39;central&#39;,&#39;no&#39;,&#39;no-recurrence-events&#39;

This provides an adequate amount of context for what the inputs look like, which in turn implies which methods are appropriate.

The 2013 scikit-learn API Design publication does not explicitly rule out ordinal string inputs like this. Nonetheless, the core API design principle of "Consistency" suggests some constraints.

Consider this:

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit([[&quot;a&quot;, &quot;b&quot;], [&quot;b&quot;, &quot;a&quot;], [&quot;a&quot;, &quot;a&quot;]], [0, 0, 1])

Which in scikit-learn==1.2.0 produces an error:

ValueError: dtype=&#39;numeric&#39; is not compatible with arrays of bytes/strings.
            Convert your data to numeric values explicitly instead.

One could imagine a version of Multinomial Naive Bayes where this code does not raise an error. However, it would be inconsistent if some estimators allowed these inputs while others did not. What if we were to apply Logistic Regression to this data? Should we one-hot encode the values or ordinal encode them? It's best to leave that detail to the user.

The authors briefly address this as a "data representation" problem, and (perhaps) suggest WEKA is an interesting alternative model:

> 2.2 Data representation
>
> "In scikit-learn, we chose a representation of data that is as close as possible to the matrix representation: datasets are encoded as NumPy multidimensional arrays .... While these may seem rather unsophisticated data representations when compared to more object-oriented constructs, such as the ones used by Weka (Hall et al., 2009), they bring the prime advantage of allowing us to rely on efficient NumPy and SciPy vectorized operations while keeping the code short and readable."
>
> - Buitinck et al. 2013 https://arxiv.org/abs/1309.0238

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

WEKA在名义数据集上的性能表现

问题

答案1

gzip.NewReader()在给定类型为Reader的输入时返回nil。

df.str.get_dummies() 与 pd.get_dummies() (Python)

InfluxDB: 如何处理缺失的数据？

无法从axios和vue-python更新数据。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。