WEKA在名义数据集上的性能表现

huangapple go评论82阅读模式
英文:

WEKA's performance on nominal dataset

问题

我在分类任务中使用了WEKA。我使用了WEKA数据文件夹中提供的乳腺癌数据集。该数据集是一个名义数据集。.arff文件可以在此链接中找到。

我使用朴素贝叶斯分类器进行分类。在分类后,我收到了包括准确度、精确度、召回率、ROC和其他度量在内的分类报告。

我熟悉SkLearn - 这是Python的一个包。我知道当输入特征是名义数据时,我们需要使用标签编码器或其他编码技术将这些特征转换为数值,只有在这之后,我们才能进行分类。

所有这些机器学习方法在背后都在进行某种数学运算,以提供预测结果。

因此,我对WEKA中的任何分类器如何在名义数据集上提供预测结果感到困惑。

英文:

I used WEKA for classification. I used the breast cancer dataset which is available in WEKA data folder. The data set is a nominal dataset. The .arff file can be found this link.

I did classification using Naive Bayes classifier. I received a classification report such as accuracy, precision, recall, ROC, and other metrics after classification.

I am familiar with SkLearn - the python package. I know that when the input features are nominal we need to convert those features into numerical values using the label encoder or other encoding techniques. Only after that, we can perform classification.

All those machine learning methods are doing some kind of mathematics in background to give the prediction result.

Therefore, I am confused about how could any classifier in WEKA give us prediction results on a nominal dataset?

答案1

得分: 1

TL;DR: 在设计软件时,复杂性总会存在于某个地方。

scikit-learn 假设用户能够编写代码来处理复杂性;WEKA 假设用户能够使用元数据解释复杂性。


2009年的WEKA "更新" 出版物描述了该软件背后设计动机的一些方面:

4.1 核心

.... "WEKA核心的另一个增加是'Capabilities'元数据功能。这个框架允许个别的学习算法和过滤器声明它们能够处理什么数据特征。这反过来使得WEKA的用户界面能够呈现这些信息,并向用户提供关于数据适用性的反馈意见。"

换句话说,假设用户能够描述数据的属性。威斯康星数据集包含有关变量(年龄、绝经等)及其可能取的有序值('10-19','20-29'等)的详细注释:

@relation breast-cancer
@attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'}
@attribute menopause {'lt40','ge40','premeno'}
...
@attribute 'Class' {'no-recurrence-events','recurrence-events'}
@data
'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'

这为输入数据的外观提供了足够的上下文,从而暗示了哪些方法是合适的。


2013年的scikit-learn API设计出版物 没有明确排除像这样的有序字符串输入。然而,“一致性”的核心API设计原则表明存在一些约束。

考虑这个例子:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit([["a", "b"], ["b", "a"], ["a", "a"]], [0, 0, 1])

scikit-learn==1.2.0 中会产生错误:

ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.
            Convert your data to numeric values explicitly instead.

人们可以想象一种Multinomial朴素贝叶斯的版本,其中此代码不会引发错误。但是,如果一些估算器允许这些输入,而其他估算器不允许,那将是不一致的。如果我们将逻辑回归应用于这些数据,应该进行独热编码还是有序编码呢?最好将这个细节留给用户决定

作者简要提到这是一个“数据表示”问题,并(或许)建议WEKA是一个有趣的替代模型:

2.2 数据表示

"在scikit-learn中,我们选择了一种尽可能接近矩阵表示的数据表示方式:数据集被编码为NumPy多维数组....尽管与更面向对象的构造相比,如Weka(Hall等人,2009)使用的构造,这些数据表示似乎相当简单,但它们带来了一个主要优势,即允许我们依赖于高效的NumPy和SciPy向量化操作,同时保持代码的简洁和可读性。"

英文:

TL;DR: when designing software, complexity will exist somewhere.

scikit-learn assumes the user can write code to handle complexity; WEKA assumes the user can explain complexity with metadata.


The 2009 WEKA "update" publication describes some of the design motivations behind the software:

> 4.1 Core
>
> .... "Another addition to the core of WEKA is the 'Capabilities' meta-data facility. This framework allows individual learning algorithms and filters to declare what data characteristics they are able to handle. This, in turn, enables WEKA's user interfaces to present this information and provide feedback to the user about the applicability of a scheme for the data at hand."
>
> - Hall et al. (2009). https://doi.org/10.1145/1656274.1656278

In other words, it is assumed that the user can describe attributes of the data. The Wisconsin dataset includes extensive annotation about the variables (age, menopause, ..., Class) and the ordinal values that they may take ('10-19', '20-29', ...):

@relation breast-cancer
@attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'}
@attribute menopause {'lt40','ge40','premeno'}
...
@attribute 'Class' {'no-recurrence-events','recurrence-events'}
@data
'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'

This provides an adequate amount of context for what the inputs look like, which in turn implies which methods are appropriate.


The 2013 scikit-learn API Design publication does not explicitly rule out ordinal string inputs like this. Nonetheless, the core API design principle of "Consistency" suggests some constraints.

Consider this:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit([["a", "b"], ["b", "a"], ["a", "a"]], [0, 0, 1])

Which in scikit-learn==1.2.0 produces an error:

ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.
            Convert your data to numeric values explicitly instead.

One could imagine a version of Multinomial Naive Bayes where this code does not raise an error. However, it would be inconsistent if some estimators allowed these inputs while others did not. What if we were to apply Logistic Regression to this data? Should we one-hot encode the values or ordinal encode them? It's best to leave that detail to the user.

The authors briefly address this as a "data representation" problem, and (perhaps) suggest WEKA is an interesting alternative model:

> 2.2 Data representation
>
> "In scikit-learn, we chose a representation of data that is as close as possible to the matrix representation: datasets are encoded as NumPy multidimensional arrays .... While these may seem rather unsophisticated data representations when compared to more object-oriented constructs, such as the ones used by Weka (Hall et al., 2009), they bring the prime advantage of allowing us to rely on efficient NumPy and SciPy vectorized operations while keeping the code short and readable."
>
> - Buitinck et al. 2013 https://arxiv.org/abs/1309.0238

huangapple
  • 本文由 发表于 2023年1月9日 09:24:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75052440.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定