2023年6月6日 12:14:31go评论102阅读模式

英文:

Why are the kmeans centroids far from the data? Python

问题

I'm making a kmeans model with the data from Twitter, but when I apply the polarity and subjectivity analysis on the scatterplot, the centroids (red x) appear far from the data:

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
data = df[['Polarity', 'Subjetivity']].values
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
k = 3
kmeans = KMeans(n_clusters=k).fit(scaled_data)
cluster_labels = kmeans.labels_
df['Cluster'] = cluster_labels
centroides = kmeans.cluster_centers_
df_kmeans_center = pd.DataFrame(
    {
        'x1': centroides[:,0],
        'x2': centroides[:,1]
    }
)
sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', data=df,
                palette="flare")
sns.scatterplot(data=df_kmeans_center, x='x1', y='x2', marker='X', size=500, color='red')
plt.title('Seg. Tweets')
plt.xlabel('Polarity')
plt.ylabel('Subjetividad')
plt.show()

the result is this:

英文:

I'm making a kmeans model with the data from Twitter, but when I apply the polarity and subjectivity analysis on the scatterplot, the centroids (red x) appear far from the data:

     from sklearn.preprocessing import StandardScaler
            from sklearn.cluster import KMeans
    data = df[[&#39;Polarity&#39;, &#39;Subjetivity&#39;]].values
    
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data)
k = 3 
kmeans = KMeans(n_clusters=k).fit(scaled_data)
cluster_labels = kmeans.labels_
df[&#39;Cluster&#39;] = cluster_labels
centroides = kmeans.cluster_centers_
df_kmeans_center = pd.DataFrame(
    {
        &#39;x1&#39;: centroides[:,0],
        &#39;x2&#39;: centroides[:,1]
    }
)
sns.scatterplot(x=&#39;Polarity&#39;, y=&#39;Subjetivity&#39;, hue=&#39;Cluster&#39;, data=df,
                palette=&quot;flare&quot;)
sns.scatterplot(data=df_kmeans_center, x=&#39;x1&#39;, y=&#39;x2&#39;, marker=&#39;X&#39;,size=500, color=&#39;red&#39;)
plt.title(&#39;Seg. Tweets&#39;)
plt.xlabel(&#39;Polarity&#39;)
plt.ylabel(&#39;Subjetividad&#39;)
plt.show()

the result is this:

答案1

得分: 2

以下是翻译好的部分：

"如果您绘制真实数据（没有任何转换），则无法绘制质心，除非您对转换进行逆操作："

centroids = scaler.inverse_transform(kmeans.cluster_centers_)

演示：

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob
# 示例
df = pd.read_csv('vaccination_tweets.csv')
df['Polarity'] = df.text.apply(lambda x: TextBlob(x).polarity)
df['Subjetivity'] = df.text.apply(lambda x: TextBlob(x).subjectivity)
data = df[['Polarity', 'Subjetivity']].values
    
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
k = 3 
kmeans = KMeans(n_clusters=k, n_init=10).fit(scaled_data)
cluster_labels = kmeans.labels_
df['Cluster'] = cluster_labels
# 质心
scaled_centroids = kmeans.cluster_centers_
real_centroids = scaler.inverse_transform(scaled_centroids)
ax = sns.scatterplot(x='Polarity', y='Subjetivity', hue='Cluster', marker='.', data=df, palette='flare')
ax.scatter(real_centroids[:, 0], real_centroids[:, 1], s=100, marker='x', color='red', label='Centroid (real)')
ax.scatter(scaled_centroids[:, 0], scaled_centroids[:, 1], s=100, marker='x', color='green', label='Centroid (scaled)')
plt.legend(loc='lower right')
plt.show()

英文:

You can't plot centroids without rescaled them if you plot your real data (without any transformation). You have to inverse the transformation:

centroids = scaler.inverse_transform(kmeans.cluster_centers_)

Demo:

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob
# Sample
df = pd.read_csv(&#39;vaccination_tweets.csv&#39;)
df[&#39;Polarity&#39;] = df.text.apply(lambda x: TextBlob(x).polarity)
df[&#39;Subjetivity&#39;] = df.text.apply(lambda x: TextBlob(x).subjectivity)
data = df[[&#39;Polarity&#39;, &#39;Subjetivity&#39;]].values
    
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
k = 3 
kmeans = KMeans(n_clusters=k, n_init=10).fit(scaled_data)
cluster_labels = kmeans.labels_
df[&#39;Cluster&#39;] = cluster_labels
# Centroid
scaled_centroids = kmeans.cluster_centers_
real_centroids = scaler.inverse_transform(scaled_centroids)
ax = sns.scatterplot(x=&#39;Polarity&#39;, y=&#39;Subjetivity&#39;, hue=&#39;Cluster&#39;, marker=&#39;.&#39;, data=df, palette=&#39;flare&#39;)
ax.scatter(real_centroids[:, 0], real_centroids[:, 1], s=100, marker=&#39;x&#39;, color=&#39;red&#39;, label=&#39;Centroid (real)&#39;)
ax.scatter(scaled_centroids[:, 0], scaled_centroids[:, 1], s=100, marker=&#39;x&#39;, color=&#39;green&#39;, label=&#39;Centroid (scaled)&#39;)
plt.legend(loc=&#39;lower right&#39;)
plt.show()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么k均值聚类的质心离数据远？ Python

问题

答案1

如何修复无法使用Python和ASGI与Django Channels服务器建立WebSocket连接的错误？

将scipy稀疏矩阵与一个3D numpy数组相乘。

Plot year by year in the same plot (plotly)：在同一图中按年份绘制。

将 y 轴刻度在 Python 绘图中垂直移动

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。