为什么k均值聚类的质心离数据远? Python

huangapple go评论102阅读模式
英文:

Why are the kmeans centroids far from the data? Python

问题

I'm making a kmeans model with the data from Twitter, but when I apply the polarity and subjectivity analysis on the scatterplot, the centroids (red x) appear far from the data:

  1. from sklearn.preprocessing import StandardScaler
  2. from sklearn.cluster import KMeans
  3. data = df[['Polarity', 'Subjetivity']].values
  4. scaler = StandardScaler()
  5. scaled_data = scaler.fit_transform(data)
  6. k = 3
  7. kmeans = KMeans(n_clusters=k).fit(scaled_data)
  8. cluster_labels = kmeans.labels_
  9. df['Cluster'] = cluster_labels
  10. centroides = kmeans.cluster_centers_
  11. df_kmeans_center = pd.DataFrame(
  12. {
  13. 'x1': centroides[:,0],
  14. 'x2': centroides[:,1]
  15. }
  16. )
  17. sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', data=df,
  18. palette="flare")
  19. sns.scatterplot(data=df_kmeans_center, x='x1', y='x2', marker='X', size=500, color='red')
  20. plt.title('Seg. Tweets')
  21. plt.xlabel('Polarity')
  22. plt.ylabel('Subjetividad')
  23. plt.show()

the result is this:

为什么k均值聚类的质心离数据远? Python

英文:

I'm making a kmeans model with the data from Twitter, but when I apply the polarity and subjectivity analysis on the scatterplot, the centroids (red x) appear far from the data:

  1. from sklearn.preprocessing import StandardScaler
  2. from sklearn.cluster import KMeans
  3. data = df[['Polarity', 'Subjetivity']].values
  4. scaler = StandardScaler()
  5. scaled_data = scaler.fit_transform(data)
  6. k = 3
  7. kmeans = KMeans(n_clusters=k).fit(scaled_data)
  8. cluster_labels = kmeans.labels_
  9. df['Cluster'] = cluster_labels
  10. centroides = kmeans.cluster_centers_
  11. df_kmeans_center = pd.DataFrame(
  12. {
  13. 'x1': centroides[:,0],
  14. 'x2': centroides[:,1]
  15. }
  16. )
  17. sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', data=df,
  18. palette="flare")
  19. sns.scatterplot(data=df_kmeans_center, x='x1', y='x2', marker='X',size=500, color='red')
  20. plt.title('Seg. Tweets')
  21. plt.xlabel('Polarity')
  22. plt.ylabel('Subjetividad')
  23. plt.show()

the result is this:

为什么k均值聚类的质心离数据远? Python

答案1

得分: 2

以下是翻译好的部分:

"如果您绘制真实数据(没有任何转换),则无法绘制质心,除非您对转换进行逆操作:"

  1. centroids = scaler.inverse_transform(kmeans.cluster_centers_)

演示:

  1. from sklearn.preprocessing import StandardScaler
  2. from sklearn.cluster import KMeans
  3. from sklearn.metrics import pairwise_distances
  4. import pandas as pd
  5. import matplotlib.pyplot as plt
  6. import seaborn as sns
  7. from textblob import TextBlob
  8. # 示例
  9. df = pd.read_csv('vaccination_tweets.csv')
  10. df['Polarity'] = df.text.apply(lambda x: TextBlob(x).polarity)
  11. df['Subjetivity'] = df.text.apply(lambda x: TextBlob(x).subjectivity)
  12. data = df[['Polarity', 'Subjetivity']].values
  13. scaler = StandardScaler()
  14. scaled_data = scaler.fit_transform(data)
  15. k = 3
  16. kmeans = KMeans(n_clusters=k, n_init=10).fit(scaled_data)
  17. cluster_labels = kmeans.labels_
  18. df['Cluster'] = cluster_labels
  19. # 质心
  20. scaled_centroids = kmeans.cluster_centers_
  21. real_centroids = scaler.inverse_transform(scaled_centroids)
  22. ax = sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', marker='.', data=df, palette='flare')
  23. ax.scatter(real_centroids[:, 0], real_centroids[:, 1], s=100, marker='x', color='red', label='Centroid (real)')
  24. ax.scatter(scaled_centroids[:, 0], scaled_centroids[:, 1], s=100, marker='x', color='green', label='Centroid (scaled)')
  25. plt.legend(loc='lower right')
  26. plt.show()

为什么k均值聚类的质心离数据远? Python

英文:

You can't plot centroids without rescaled them if you plot your real data (without any transformation). You have to inverse the transformation:

  1. centroids = scaler.inverse_transform(kmeans.cluster_centers_)

Demo:

  1. from sklearn.preprocessing import StandardScaler
  2. from sklearn.cluster import KMeans
  3. from sklearn.metrics import pairwise_distances
  4. import pandas as pd
  5. import matplotlib.pyplot as plt
  6. import seaborn as sns
  7. from textblob import TextBlob
  8. # Sample
  9. df = pd.read_csv('vaccination_tweets.csv')
  10. df['Polarity'] = df.text.apply(lambda x: TextBlob(x).polarity)
  11. df['Subjetivity'] = df.text.apply(lambda x: TextBlob(x).subjectivity)
  12. data = df[['Polarity', 'Subjetivity']].values
  13. scaler = StandardScaler()
  14. scaled_data = scaler.fit_transform(data)
  15. k = 3
  16. kmeans = KMeans(n_clusters=k, n_init=10).fit(scaled_data)
  17. cluster_labels = kmeans.labels_
  18. df['Cluster'] = cluster_labels
  19. # Centroid
  20. scaled_centroids = kmeans.cluster_centers_
  21. real_centroids = scaler.inverse_transform(scaled_centroids)
  22. ax = sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', marker='.', data=df, palette='flare')
  23. ax.scatter(real_centroids[:, 0], real_centroids[:, 1], s=100, marker='x', color='red', label='Centroid (real)')
  24. ax.scatter(scaled_centroids[:, 0], scaled_centroids[:, 1], s=100, marker='x', color='green', label='Centroid (scaled)')
  25. plt.legend(loc='lower right')
  26. plt.show()

为什么k均值聚类的质心离数据远? Python

huangapple
  • 本文由 发表于 2023年6月6日 12:14:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76411411.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定