为什么k均值聚类的质心离数据远? Python

huangapple go评论72阅读模式
英文:

Why are the kmeans centroids far from the data? Python

问题

I'm making a kmeans model with the data from Twitter, but when I apply the polarity and subjectivity analysis on the scatterplot, the centroids (red x) appear far from the data:

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
data = df[['Polarity', 'Subjetivity']].values

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

k = 3
kmeans = KMeans(n_clusters=k).fit(scaled_data)
cluster_labels = kmeans.labels_
df['Cluster'] = cluster_labels
centroides = kmeans.cluster_centers_
df_kmeans_center = pd.DataFrame(
    {
        'x1': centroides[:,0],
        'x2': centroides[:,1]
    }
)

sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', data=df,
                palette="flare")
sns.scatterplot(data=df_kmeans_center, x='x1', y='x2', marker='X', size=500, color='red')
plt.title('Seg. Tweets')
plt.xlabel('Polarity')
plt.ylabel('Subjetividad')
plt.show()

the result is this:

为什么k均值聚类的质心离数据远? Python

英文:

I'm making a kmeans model with the data from Twitter, but when I apply the polarity and subjectivity analysis on the scatterplot, the centroids (red x) appear far from the data:

     from sklearn.preprocessing import StandardScaler
            from sklearn.cluster import KMeans
    data = df[['Polarity', 'Subjetivity']].values
    
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data)

k = 3 
kmeans = KMeans(n_clusters=k).fit(scaled_data)
cluster_labels = kmeans.labels_
df['Cluster'] = cluster_labels
centroides = kmeans.cluster_centers_
df_kmeans_center = pd.DataFrame(
    {
        'x1': centroides[:,0],
        'x2': centroides[:,1]
    }
)

sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', data=df,
                palette="flare")
sns.scatterplot(data=df_kmeans_center, x='x1', y='x2', marker='X',size=500, color='red')
plt.title('Seg. Tweets')
plt.xlabel('Polarity')
plt.ylabel('Subjetividad')
plt.show()

the result is this:

为什么k均值聚类的质心离数据远? Python

答案1

得分: 2

以下是翻译好的部分:

"如果您绘制真实数据(没有任何转换),则无法绘制质心,除非您对转换进行逆操作:"

centroids = scaler.inverse_transform(kmeans.cluster_centers_)

演示:

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob

# 示例
df = pd.read_csv('vaccination_tweets.csv')
df['Polarity'] = df.text.apply(lambda x: TextBlob(x).polarity)
df['Subjetivity'] = df.text.apply(lambda x: TextBlob(x).subjectivity)

data = df[['Polarity', 'Subjetivity']].values
    
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

k = 3 
kmeans = KMeans(n_clusters=k, n_init=10).fit(scaled_data)
cluster_labels = kmeans.labels_
df['Cluster'] = cluster_labels

# 质心
scaled_centroids = kmeans.cluster_centers_
real_centroids = scaler.inverse_transform(scaled_centroids)

ax = sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', marker='.', data=df, palette='flare')
ax.scatter(real_centroids[:, 0], real_centroids[:, 1], s=100, marker='x', color='red', label='Centroid (real)')
ax.scatter(scaled_centroids[:, 0], scaled_centroids[:, 1], s=100, marker='x', color='green', label='Centroid (scaled)')

plt.legend(loc='lower right')
plt.show()

为什么k均值聚类的质心离数据远? Python

英文:

You can't plot centroids without rescaled them if you plot your real data (without any transformation). You have to inverse the transformation:

centroids = scaler.inverse_transform(kmeans.cluster_centers_)

Demo:

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob

# Sample
df = pd.read_csv('vaccination_tweets.csv')
df['Polarity'] = df.text.apply(lambda x: TextBlob(x).polarity)
df['Subjetivity'] = df.text.apply(lambda x: TextBlob(x).subjectivity)

data = df[['Polarity', 'Subjetivity']].values
    
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

k = 3 
kmeans = KMeans(n_clusters=k, n_init=10).fit(scaled_data)
cluster_labels = kmeans.labels_
df['Cluster'] = cluster_labels

# Centroid
scaled_centroids = kmeans.cluster_centers_
real_centroids = scaler.inverse_transform(scaled_centroids)

ax = sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', marker='.', data=df, palette='flare')
ax.scatter(real_centroids[:, 0], real_centroids[:, 1], s=100, marker='x', color='red', label='Centroid (real)')
ax.scatter(scaled_centroids[:, 0], scaled_centroids[:, 1], s=100, marker='x', color='green', label='Centroid (scaled)')

plt.legend(loc='lower right')
plt.show()

为什么k均值聚类的质心离数据远? Python

huangapple
  • 本文由 发表于 2023年6月6日 12:14:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76411411.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定