英文:
Difference between histplot and pyplot?
问题
我有一个名为Price的csv文件,只有列'phil',共有7412行,如下所示:
图片。
我使用histplot
和plot
来绘制正态分布,代码如下:
df4 = pd.read_csv(r'C:\Users\ThuyNT13\Desktop\price.csv')
sns.histplot(df4['phil'], color='red', kde=True, stat='density')
mean4 = statistics.mean(df4['phil'])
sd4 = statistics.stdev(df4['phil'])
pdf4 = norm.pdf(df4['phil'].sort_values(), mean4, sd4)
plt.plot(df4['phil'].sort_values(), pdf4, label='Philippines', color='blue')
plt.ticklabel_format(style='plain')
plt.show()
结果显示不同的曲线具有不同的模式:
图片。
为什么会有差异,每个曲线的含义是什么?
英文:
I have a csv file named Price which only has column 'phil' and 7412 rows as in:
image.
I use histplot
and plot
to draw a normal distribution with the code:
df4 = pd.read_csv(r'C:\Users\ThuyNT13\Desktop\price.csv')
sns.histplot(df4['phil'], color='red',kde= True, stat = 'density')
mean4 = statistics.mean(df4['phil'])
sd4 = statistics.stdev(df4['phil'])
pdf4 = norm.pdf(df4['phil'].sort_values(), mean4, sd4)
plt.plot(df4['phil'].sort_values(), pdf4, label = 'Philippines', color = 'blue')
plt.ticklabel_format(style= 'plain')
plt.show()
The result show different curves with different patterns:
image.
Why are there differences and what is the meaning of each curve?
答案1
得分: 1
KDE(红色部分)只是分布密度的平滑处理。所以,由于你有相当大的数据,它与直方图基本相同(显然有一个分辨率快捷方式,但它遵循直方图)。
你计算的概率密度函数(PDF)是正态分布的PDF,其均值是你的数据的均值,标准差是你的数据的标准差。
如果你的数据确实服从正态分布,那么这两条曲线会大致相同。
直方图显示了数据的实际分布。红色曲线是相同事物的平滑版本。蓝色曲线是正态分布的密度概率(具有相同的均值和标准差)。由于数据恰好是按照正态分布抽取的,因此不出所料,蓝色曲线很好地拟合了数据(实际的均值和标准差为3495, 505,这在使用正态分布抽取10000个数字时是可以预期的)。
现在,让我们使用一个完全不是正态分布的数据进行同样的操作。
同样如前所述:直方图是实际数据的分布(在500和7000之间均匀抽取)。红色曲线只是这个数据的平滑版本。而蓝色曲线则是均值为3775(我的均匀数据的均值)和标准差为1876(我的数据的标准分布)的正态分布。
当然,这根本不适合数据:均值和标准差相同,当然。但一个是正态分布,另一个不是。
对于你的数据也是同样的情况:显然它们不是服从正态分布的。所以你的直方图和红色曲线遵循你的数据分布。而蓝色曲线则遵循如果它们服从与相同均值和标准差的正态分布的分布会是什么样子的。
你可以看到你的数据右尾有多长,与左尾相比。非常大的值,即使数量不多,也会将均值偏向右侧(有点像当你比较中位数收入和均值收入时:均值因为富人而人工提高,富人数量不多,但非常富有)。所以不出所料,正态分布(对称的,而你的数据不对称)的均值偏向右侧。因此,由于所有这些大值,它的标准差也要大得多,尽管实际数据的集中程度可能不像你想象的那样,因为均值已经偏离了主要的数据群。因此,正态分布的分布区域更广(它们都有无限的区域,当然,但更大的“95%区间”。而且,你的数据,因为它们被称为“价格”,显然不能为负值。而正态分布,由于在10000左右有一个峰值,但在200000有值,经过峰值后是190000,因此从正态分布的角度看,你应该有-180000的数据。或者更现实的是,由于你的数据中有一些不多但不少的值在50000左右,你应该有相同数量的-30000。所以,正态分布更广,因此峰值不太高,因为总数应该相同(曲线下的面积相同,为1)。
总之,长话短说:你的数据不符合正态分布,所以毫不奇怪,正态分布的密度曲线不像你的数据密度曲线。
英文:
kde (in red) is just the smoothing of the distribution density.
So, since you have quite large data, it is more or less the same as the histogram (with an obvious resolution shortcut, but it follows the histogram).
The pdf you compute is the one of the normal law whose mean is the mean of your data and standard deviation is the one of your data.
Both curve would be the same (roughly) if your data was indeed abiding a normal law.
import seaborn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats
df=pd.DataFrame({'phil':np.random.normal(3500, 500, 10000)})
seaborn.histplot(df.phil, color='red', kde=True, stat='density')
μ,σ=df.phil.mean(), df.phil.std()
phsort=df.phil.sort_values()
pdf=scipy.stats.norm.pdf(phsort, μ, σ)
plt.plot(phsort, pdf)
plt.show()
Histogram show the actual distribution of the data.
Red curve is a smoothed version of the same thing.
Blue curve is the density probability of the normal law (with same mean and standard deviation).
And since the data happens to have been drawn following a normal law, unsurprisingly, blue curve fit well the data (the actual μ and σ for this curve are 3495, 505, which is well within what is expected when you draw 10000 numbers with normal law(3500,500))
Now, let's do the same thing with a not normal at all law
df=pd.DataFrame({'phil':np.random.uniform(500, 7000, 10000)})
seaborn.histplot(df.phil, color='red', kde=True, stat='density')
μ,σ=df.phil.mean(), df.phil.std()
phsort=df.phil.sort_values()
pdf=scipy.stats.norm.pdf(phsort, μ, σ)
plt.plot(phsort, pdf)
plt.show()
Same as before: histogram is the distribution of actual data (draws uniformly between 500 and 7000). Red curve is just a smoothed version of that.
And blue curve is the normal law for μ=3775 (mean of my uniform data) and σ=1876 (standard distribution of my data).
Which of course, doesn't fit at all the data: same mean and std, sure. But one is the normal law, the other is not.
Same goes for your data: obviously they are not following normal law. So your histogram and your red curve follow the distribution of your data. The blue curve follow the distribution of what would be the data if they were following the normal law, with same mean and standard deviation.
You can see how long is the right tail of your data compared to the left tail.
Very big values, even if not numerous, skew the mean to the right.
(A little bit like when you compare median income with mean income: mean is artificially high because of a few superrich, who are not numerous, but very rich). So unsurprisingly, the normal law (which is symmetric, when your data are not) have a mean more the the right.
And therefore, it has also a bigger standard deviation than the actual concentration of your data may suggest, because of all those big values. Obviously, most of the data must fit in 2 standard deviation interval from the mean. But since the mean has been shifted from the main group, well, standard deviation has to be big.
Hence a normal distribution that has a wider distribution area (both have an infinite one, of course. But a larger "95% interval". Plus, your data, since they are called "price" can't obviously go to negative value. Where as, for a normal law, since you have a peak at 10000 or so, but values at 200000, that is 190000 after the peak, you should have, from normal law point of view, as many data at -180000. Or, more realisticly, since you have, not majority but not exceptional, values around 50000, you should have the same amount of -30000. So, normal law is more wide, and therefore with a less high peak, since the total should be the same (area under curves are the same, and 1).
So, long story short: your data are not following normal law, so, unsurprisingly, the normal law density curve doesn't look like your data density curve.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论