英文:
Why are the kmeans centroids far from the data? Python
问题
I'm making a kmeans model with the data from Twitter, but when I apply the polarity and subjectivity analysis on the scatterplot, the centroids (red x) appear far from the data:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
data = df[['Polarity', 'Subjetivity']].values
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
k = 3
kmeans = KMeans(n_clusters=k).fit(scaled_data)
cluster_labels = kmeans.labels_
df['Cluster'] = cluster_labels
centroides = kmeans.cluster_centers_
df_kmeans_center = pd.DataFrame(
{
'x1': centroides[:,0],
'x2': centroides[:,1]
}
)
sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', data=df,
palette="flare")
sns.scatterplot(data=df_kmeans_center, x='x1', y='x2', marker='X', size=500, color='red')
plt.title('Seg. Tweets')
plt.xlabel('Polarity')
plt.ylabel('Subjetividad')
plt.show()
the result is this:
英文:
I'm making a kmeans model with the data from Twitter, but when I apply the polarity and subjectivity analysis on the scatterplot, the centroids (red x) appear far from the data:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
data = df[['Polarity', 'Subjetivity']].values
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
k = 3
kmeans = KMeans(n_clusters=k).fit(scaled_data)
cluster_labels = kmeans.labels_
df['Cluster'] = cluster_labels
centroides = kmeans.cluster_centers_
df_kmeans_center = pd.DataFrame(
{
'x1': centroides[:,0],
'x2': centroides[:,1]
}
)
sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', data=df,
palette="flare")
sns.scatterplot(data=df_kmeans_center, x='x1', y='x2', marker='X',size=500, color='red')
plt.title('Seg. Tweets')
plt.xlabel('Polarity')
plt.ylabel('Subjetividad')
plt.show()
the result is this:
答案1
得分: 2
以下是翻译好的部分:
"如果您绘制真实数据(没有任何转换),则无法绘制质心,除非您对转换进行逆操作:"
centroids = scaler.inverse_transform(kmeans.cluster_centers_)
演示:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob
# 示例
df = pd.read_csv('vaccination_tweets.csv')
df['Polarity'] = df.text.apply(lambda x: TextBlob(x).polarity)
df['Subjetivity'] = df.text.apply(lambda x: TextBlob(x).subjectivity)
data = df[['Polarity', 'Subjetivity']].values
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
k = 3
kmeans = KMeans(n_clusters=k, n_init=10).fit(scaled_data)
cluster_labels = kmeans.labels_
df['Cluster'] = cluster_labels
# 质心
scaled_centroids = kmeans.cluster_centers_
real_centroids = scaler.inverse_transform(scaled_centroids)
ax = sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', marker='.', data=df, palette='flare')
ax.scatter(real_centroids[:, 0], real_centroids[:, 1], s=100, marker='x', color='red', label='Centroid (real)')
ax.scatter(scaled_centroids[:, 0], scaled_centroids[:, 1], s=100, marker='x', color='green', label='Centroid (scaled)')
plt.legend(loc='lower right')
plt.show()
英文:
You can't plot centroids without rescaled them if you plot your real data (without any transformation). You have to inverse the transformation:
centroids = scaler.inverse_transform(kmeans.cluster_centers_)
Demo:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob
# Sample
df = pd.read_csv('vaccination_tweets.csv')
df['Polarity'] = df.text.apply(lambda x: TextBlob(x).polarity)
df['Subjetivity'] = df.text.apply(lambda x: TextBlob(x).subjectivity)
data = df[['Polarity', 'Subjetivity']].values
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
k = 3
kmeans = KMeans(n_clusters=k, n_init=10).fit(scaled_data)
cluster_labels = kmeans.labels_
df['Cluster'] = cluster_labels
# Centroid
scaled_centroids = kmeans.cluster_centers_
real_centroids = scaler.inverse_transform(scaled_centroids)
ax = sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', marker='.', data=df, palette='flare')
ax.scatter(real_centroids[:, 0], real_centroids[:, 1], s=100, marker='x', color='red', label='Centroid (real)')
ax.scatter(scaled_centroids[:, 0], scaled_centroids[:, 1], s=100, marker='x', color='green', label='Centroid (scaled)')
plt.legend(loc='lower right')
plt.show()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论