Train a classification model using the "rpart" and "caret" libraries in R with four classes: how to define accuracy metric

huangapple go评论77阅读模式
英文:

Train a classification model using the "rpart" and "caret" libraries in R with four classes: how to define accuracy metric

问题

train in R uses the following definition of accuracy for multiclass problems:

Accuracy for multiclass classification is defined as the proportion of correctly predicted instances (samples) out of the total number of instances. In other words, it calculates the ratio of the number of instances correctly classified for all classes to the total number of instances. This definition considers all classes simultaneously and is also known as overall accuracy.

Mathematically, the accuracy for multiclass classification can be calculated using the following formula:

Accuracy = (Number of Correctly Predicted Instances) / (Total Number of Instances)

In the context of your code, it calculates the overall accuracy of the classification model across all classes ('Q1', 'Q2', 'Q3', 'Q4').

英文:

The following code trains a classification model using the "rpart" and "caret" libraries in R. It uses the train() function from the "caret" library to train the model with the "rpart" method, specifically using the Gini index for splitting. The trained model is stored in the variable classifier.

library(rpart)
library(caret)
classifier = train(x = training_set[, names(training_set) != "Target"],
                   y = training_set$Target,
                   method = 'rpart',
                   parms = list(split = "gini"),
                   tuneLength = 20)

The variable classifier is as follows:

> classifier
CART 

7112 samples
  89 predictor
   4 classes: 'Q1', 'Q2', 'Q3', 'Q4' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 7112, 7112, 7112, 7112, 7112, 7112, ... 
Resampling results across tuning parameters:

  cp            Accuracy   Kappa    
  0.0002343457  0.9536618  0.9382023
  0.0002812148  0.9535851  0.9380999
  0.0003749531  0.9535394  0.9380391
  0.0004686914  0.9539980  0.9386511
  0.0005624297  0.9539678  0.9386110
  0.0006561680  0.9543640  0.9391389
  0.0007499063  0.9540123  0.9386694
  0.0008248969  0.9536724  0.9382163
  0.0010311211  0.9536133  0.9381370
  0.0011248594  0.9532129  0.9376029
  0.0014373203  0.9515384  0.9353684
  0.0029058868  0.9470504  0.9293828
  0.0042182227  0.9388870  0.9184975
  0.0052493438  0.9336715  0.9115402
  0.0082489689  0.9247140  0.8995937
  0.0133108361  0.9169616  0.8892603
  0.0221222347  0.9060093  0.8746638
  0.0380577428  0.8739447  0.8319098
  0.2065991751  0.8156983  0.7544120
  0.3101799775  0.4304355  0.2461903

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.000656168.

So it is a predictor based on 4 classes. The optimal model is obtained by means the accuracy metric.

In binary classification, accuracy is defined as the ratio of the number of correct predictions (true positives and true negatives) to the total number of predictions.

Mathematically, the accuracy can be calculated using the following formula:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

where:

  • TP (True Positives) represents the number of instances correctly predicted as positive.
  • TN (True Negatives) represents the number of instances correctly predicted as negative.
  • FP (False Positives) represents the number of instances predicted as positive but are actually negative (Type I error).
  • FN (False Negatives) represents the number of instances predicted as negative but are actually positive (Type II error).

What is the definition of accuracy used by train for multiclass problems?

答案1

得分: 2

对于多类问题,您只需要将准确度的定义扩展到多类问题(即真正例的数量除以所有观测值)。这里还有一个定义多类准确度方程的权威来源,用于地图分类准确性评估:Congalton, 1991。在这篇文章中,总体准确度被定义为“将总正确数(即主对角线之和)除以误差矩阵中的总像素数”。因此,例如,在以下混淆矩阵中,其中预测的类别显示在行中,观察到的类别显示在列中:

类别 1 2 - q 总计
1 n<sub>11</sub> n<sub>12</sub> - n<sub>1q</sub> n<sub>1.</sub>
2 n<sub>21</sub> n<sub>22</sub> - n<sub>2q</sub> n<sub>2.</sub>
- - - - - -
q n<sub>q1</sub> n<sub>q2</sub> - n<sub>qq</sub> n<sub>q.</sub>
总计 n<sub>.1</sub> n<sub>.2</sub> - n<sub>.q</sub> n

总体准确度将被计算为所有n<sub>kk</sub>的总和,其中k类的正确观察数,然后除以总观察数(n)。

英文:

For multiclass problems, you just need to expand the same definition of accuracy to a multiclass problem (i.e., number of true positives over all observations). Here is also a reputable source that defines a multiclass accuracy equation for map classification accuracy assessment: Congalton, 1991. In this article, overall accuracy is defined as being calculated by "dividing the total correct (i.e., the sum of the major diagonal) by the total number of pixels in the error matrix". Thus, for example, for the following confusion matrix where the predicted class is shown in the rows and the observed one in the columns:

Class 1 2 - q Total
1 n<sub>11</sub> n<sub>12</sub> - n<sub>1q</sub> n<sub>1.</sub>
2 n<sub>21</sub> n<sub>22</sub> - n<sub>2q</sub> n<sub>2.</sub>
- - - - - -
q n<sub>q1</sub> n<sub>q2</sub> - n<sub>qq</sub> n<sub>q.</sub>
Total n<sub>.1</sub> n<sub>.2</sub> - n<sub>.q</sub> n

The overall accuracy would be calculated as the sum of the all the n<sub>kk</sub>, which stands for the number of correct observations for each k class, and then divided by the total number of observations (n).

答案2

得分: 0

在多类分类问题中,准确率计算方式与二元分类问题相同,即正确预测的总数除以总预测数。然而,"正确预测" 的概念现在不仅限于真正例和真负例,因为存在两个以上的类别。

换句话说,在多类分类中,正确预测的数量就是预测类别与实际类别匹配的实例数,不管那个类别是什么。因此,多类分类问题中的准确率就是:

准确率 = (正确预测的数量) / (总预测数)

其中:

正确预测的数量表示预测类别与实际类别匹配的实例数。

总预测数就是数据集中所有实例的数量。

这是caret包中用于多类问题的train函数所使用的准确率定义。在您提供的输出中,复杂性参数(cp)的每个值的准确率表示模型在自举重采样中正确预测类别的实例比例。详见此论文以获取更详细的回顾。

英文:

In multiclass classification problems, the accuracy is calculated as the total number of correct predictions divided by the total number of predictions, just as in binary classification problems. However, the notion of "correct prediction" now extends beyond just true positives and true negatives, given that there are more than two classes.

That is, in multiclass classification the number of correct predictions is simply the count of instances where the predicted class matches the actual class, irrespective of what that class is. Hence, the accuracy in a multiclass classification problem is just:

Accuracy = (number of correct predictions) / (total number of predictions)

where:

The number of correct predictions represents the number of instances where the predicted class matches the actual class.

The total number of predictions is simply the count of all instances in the dataset.

This is the definition of accuracy used by the train function in the caret package for multiclass problems. In the output you've provided, the accuracy for each value of the complexity parameter (cp) represents the proportion of instances in the bootstrapped resamples for which the model correctly predicted the class. See e.g. this paper for a nice review.

huangapple
  • 本文由 发表于 2023年6月14日 23:40:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76475308.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定