英文:
Reduced dimensions visualization for true vs predicted values
问题
我有一个数据框,看起来像这样:
标签 预测 F1 F2 F3 .... F40
主要 次要 2 1 4
主要 主要 1 0 10
次要 补丁 4 3 23
主要 补丁 2 1 11
次要 次要 0 4 8
补丁 主要 7 3 30
补丁 次要 8 0 1
补丁 补丁 1 7 11
我的数据包括一个“标签”,它是“id”的真实标签(未显示,因为不相关),以及“预测”标签,以及大约40个特征。
想法是将这40个特征转化为2个维度,并将它们可视化为真实标签与预测标签。我们有三个标签“主要”,“次要”和“补丁”以及它们的预测,共9种情况。
使用PCA时,使用2个成分无法捕获很多方差,我不确定如何将PCA的值与原始数据框中的标签和预测进行映射。一个实现方法是将所有情况分成9个数据框并获得结果,但这不是我寻找的方式。
是否有其他方法可以减少和可视化给定的数据?任何建议都将非常感谢。
英文:
I have a dataframe which looks like this:
label predicted F1 F2 F3 .... F40
major minor 2 1 4
major major 1 0 10
minor patch 4 3 23
major patch 2 1 11
minor minor 0 4 8
patch major 7 3 30
patch minor 8 0 1
patch patch 1 7 11
I have label
which is the true label for the id
(not shown as it is not relevant), and predicted
label, and then set of around 40 features in my df.
The idea is to transform these 40 features into 2 dimensions and visualize them true vs predicted. We have 9 cases for all the three labels major
,minor
and patch
vs their predictions.
With PCA, it is not able to capture much variance with 2 components and I am not sure how to map the PCA values with the labels and predictions in the original df as a whole. A way to achieve this is to separate all cases into 9 dataframes and achieve the result, but this isn't what I am looking for.
Is there any other way I can reduce and visualize the given data? Any suggestions would be highly appreciated.
答案1
得分: 2
以下是代码部分的翻译:
# Mock up some predictions.
iris['species_pred'] = (40 * ['setosa'] + 5 * ['versicolor'] + 5 * ['virginica']
+ 40 * ['versicolor'] + 5 * ['setosa'] + 5 * ['virginica']
+ 40 * ['virginica'] + 5 * ['versicolor'] + 5 * ['setosa'])
# Show confusion matrix.
pd.crosstab(iris.species, iris.species_pred)
# Reduce features to two dimensions.
X = iris.iloc[:, :4].values
X_embedded = TSNE(n_components=2, init='random', learning_rate='auto'
).fit_transform(X)
iris[['tsne_x', 'tsne_y']] = X_embedded
# Plot small multiples, corresponding to confusion matrix.
sns.set()
g = sns.FacetGrid(iris, row='species', col='species_pred', margin_titles=True)
g.map(sns.scatterplot, 'tsne_x', 'tsne_y');
请注意,以上只是代码部分的翻译。
英文:
You may want to consider a small multiple plot with one scatterplot for each cell of the confusion matrix.
If PCA does not work well, t-distributed stochastic neighbor embedding (TSNE) is often a good alternative in my experience.
For example, with the iris dataset, which also has three prediction classes, it could look like this:
import pandas as pd
import seaborn as sns
from sklearn.manifold import TSNE
iris = sns.load_dataset('iris')
# Mock up some predictions.
iris['species_pred'] = (40 * ['setosa'] + 5 * ['versicolor'] + 5 * ['virginica']
+ 40 * ['versicolor'] + 5 * ['setosa'] + 5 * ['virginica']
+ 40 * ['virginica'] + 5 * ['versicolor'] + 5 * ['setosa'])
# Show confusion matrix.
pd.crosstab(iris.species, iris.species_pred)
species_pred setosa versicolor virginica
species
setosa 40 5 5
versicolor 5 40 5
virginica 5 5 40
# Reduce features to two dimensions.
X = iris.iloc[:, :4].values
X_embedded = TSNE(n_components=2, init='random', learning_rate='auto'
).fit_transform(X)
iris[['tsne_x', 'tsne_y']] = X_embedded
# Plot small multiples, corresponding to confusion matrix.
sns.set()
g = sns.FacetGrid(iris, row='species', col='species_pred', margin_titles=True)
g.map(sns.scatterplot, 'tsne_x', 'tsne_y');
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论