找到从计算的伽玛分布中的百分位数和值

huangapple go评论101阅读模式
英文:

Finding Percentiles and Values From Calculated Gamma Distribution

问题

Here is the translated text from your provided content:

背景

我正在使用Numpy(ndarray)计算一系列最佳拟合伽马曲线的工作,这是一个关于这个问题起源的先前问题,可以在这里找到。

以前使用过Scipy(scipy.gamma.stats),但是该库并不针对多维数组进行优化,因此编写了一个最基本的函数来满足目标。我已经成功地拟合了一个曲线(尽管不像Scipy那样干净),该曲线如下所示。

当前问题

我想要获取给定值的百分位数,反之亦然,沿着计算得到的伽马分布。但是,我得到的拟合曲线上的期望值与预期值不符。例如,提供50th百分位数会得到一个值为4.471的值,这与下面显示的曲线拟合不符。可以进行哪些修改或整体更改以从提供的数据中得出百分位数和值?

图表

找到从计算的伽玛分布中的百分位数和值

代码

  1. import sys, os, math
  2. import numpy as np
  3. import scipy as sci
  4. import matplotlib.pyplot as plt
  5. data = np.array([0.00, 0.00, 11.26399994, 0.00, 0.00, 0.00, 0.00, 0.00, ... ]) # 这里是你的数据
  6. def scigamma(data):
  7. # 这是你的函数实现
  8. return cdf
  9. scicdf = scigamma(data)
  10. # 这是更多的代码,包括伽马分布拟合和绘图等。
  1. 请注意,这是您提供的内容的翻译,其中包括代码部分。如果您需要任何进一步的帮助或有其他问题,请随时提问。
  2. <details>
  3. <summary>英文:</summary>
  4. **Background**
  5. I am working on computing a series of best-fit gamma curves for a 2-D dataset in Numpy (ndarray), a prior question for the genesis of this can be found [here][1].
  6. Scipy was previously utilized (scipy.gamma.stats), however this library is not optimized for multi-dimensional arrays and a barebones function was written to meet the objective. I&#39;ve successfully fit (albeit not as cleanly as Scipy) a curve to the dataset, which is provided below.
  7. **Current Issue**
  8. I want to obtain the percentile of a given value and vice versa along the calculated gamma distribution. However, I&#39;m not obtaining expected values off the fitted curve. For example, providing the 50th percentile yields a value of 4.471, which does not match up with the curve fit shown below. What modifications or wholesale alterations can be made to yield both percentiles and values from supplied data?
  9. **Graph**
  10. [![Gamma Fitting Output][2]][2]
  11. **Code**
  12. import sys, os, math
  13. import numpy as np
  14. import scipy as sci
  15. import matplotlib.pyplot as plt
  16. data = np.array([0.00, 0.00, 11.26399994, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 17.06399918, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 8.33279991, 0.00, 7.54879951, 0.00, 0.00, 0.00, 4.58799982, 7.9776001, 0.00, 0.00, 0.00, 0.00, 11.45040035, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 18.73279953, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 8.94559956, 0.00, 7.73040009, 0.00, 0.00, 0.00, 5.03599977, 8.62639999, 0.00, 0.00, 0.00, 0.00, 11.11680031, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 14.37839985, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 8.16479969, 0.00, 7.30719948, 0.00, 0.00, 0.00, 3.41039991, 7.17280006, 0.00, 0.00, 0.00, 0.00, 10.0099199963, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 13.97839928, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 7.6855998, 0.00, 6.86559963, 0.00, 0.00, 0.00, 3.21600008, 7.93599987, 0.00, 0.00, 0.00, 0.00, 11.55999947, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 18.76399994, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 10.0033039951, 0.00, 8.10639954, 0.00, 0.00, 0.00, 4.76480007, 6.87679958, 0.00, 0.00, 0.00, 0.00, 11.42239952, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 19.42639732, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 10.0052400017, 0.00, 8.2567997, 0.00, 0.00, 0.00, 5.08239985, 7.9776001, 0.00, 0.00, 0.00, 0.00, 10.0099839973, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 11.5855999, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 7.88399982, 0.00, 5.96799994, 0.00, 0.00, 0.00, 3.07679987, 7.81360006, 0.00, 0.00, 0.00, 0.00, 11.51119995, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 20.0030959892, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 10.0050879955, 0.00, 8.20479965, 0.00, 0.00, 0.00, 5.51599979, 9.02879906, 0.00, 0.00])
  17. def scigamma(data):
  18. param = sci.stats.gamma.fit(data)
  19. x = np.linspace(0, np.max(data), 250)
  20. cdf = sci.stats.gamma.cdf(x, *param)
  21. value = np.round((sci.stats.gamma.cdf(0.10, *param) * 100), 2)
  22. percentile = np.round((sci.stats.gamma.ppf(50.00, *param) * 100), 2)
  23. return cdf
  24. scicdf = scigamma(data)
  25. # Method of Moments estimation
  26. mean = np.mean(data)
  27. variance = np.var(data)
  28. alpha = mean**2 / variance
  29. beta = variance / mean
  30. # Generate x-axis values for the curves
  31. x = np.linspace(0, np.max(data), 250)
  32. # Calculate the gamma distribution PDF values
  33. pdf = (x ** (alpha - 1) * np.exp(-x / beta)) / (beta ** alpha * np.math.gamma(alpha))
  34. # Calculate the gamma distribution CDF values
  35. cdf = np.zeros_like(x)
  36. cdf[x &gt; 0] = np.cumsum(pdf[x &gt; 0]) / np.sum(pdf[x &gt; 0])
  37. # Estimate the probability of zero values
  38. num_zeros = np.count_nonzero(data == 0)
  39. zero_probability = np.count_nonzero(data == 0) / len(data)
  40. # Calculate the PDF and CDF values at zero
  41. pdf_zero = zero_probability / (beta ** alpha * np.math.gamma(alpha))
  42. cdf_zero = zero_probability
  43. value = 2.50
  44. percentile = 0.50
  45. index = np.argmax(pdf &gt;= value)
  46. # Calculate the percentile using numerical integration
  47. pct = np.trapz(pdf[:index+1], dx=1) + (value - pdf[index]) * (cdf[index] - cdf[index-1]) / (pdf[index-1] - pdf[index])
  48. index = np.argmax(cdf &gt;= percentile)
  49. # Calculate the value using numerical integration
  50. val = np.trapz(cdf[:index+1], dx=1) + (percentile - cdf[index-1]) * (pdf[index] - pdf[index-1]) / (cdf[index] - cdf[index-1])
  51. # Plot the data histogram
  52. plt.hist(data, bins=30, density=True, alpha=0.5, label=&#39;data&#39;)
  53. # Plot the gamma distribution CDF curve
  54. plt.plot(x, cdf, &#39;b&#39;, label=&#39;Gamma CDF | Custom Fit&#39;)
  55. plt.plot(x, scicdf, &#39;k&#39;, label=&#39;Gamma CDF | SciPy Fit&#39;)
  56. # Set plot labels and legend
  57. plt.xlabel(&#39;data&#39;)
  58. plt.ylabel(&#39;Probability&#39;)
  59. plt.legend()
  60. [1]: https://stackoverflow.com/q/75598344/10266106
  61. [2]: https://i.stack.imgur.com/QchWe.png
  62. </details>
  63. # 答案1
  64. **得分**: 1
  65. I'm not clear on why you don't trust `sci.stats`. If it's because of performance, have you actually profiled? It's worth noting that neither your fit nor scipy's fit are particularly good, because your data are strongly skewed toward 0.
  66. When initializing your second `x`, don't start it at 0 because your calculations are not well-defined for that point, which (a) dumps warnings on the console and (b) requires a bunch of indexing gymnastics later.
  67. I'm also not clear on why you're attempting `trapz`. Why integrate when you already have a CDF? Just use the CDF; and also don't `argmax`; use `searchsorted`.
  68. Finally, your plotting is slightly a disaster. For apples-to-apples comparison it's very important that you pass `cumulative` to `hist`.
  69. ```python
  70. import numpy as np
  71. import scipy as sci
  72. import matplotlib.pyplot as plt
  73. data = np.array([
  74. 0. , 0. , 11.26399994, 0. , 0. ,
  75. 0. , 0. , 0. , 0. , 0. ,
  76. 17.06399918, 0. , 0. , 0. , 0. ,
  77. 0. , 0. , 8.33279991, 0. , 7.54879951,
  78. 0. , 0. , 0. , 4.58799982, 7.9776001 ,
  79. 0. , 0. , 0. , 0. , 11.45040035,
  80. 0. , 0. , 0. , 0. , 0. ,
  81. 0. , 0. , 18.73279953, 0. , 0. ,
  82. 0. , 0. , 0. , 0. , 8.94559956,
  83. 0. , 7.73040009, 0. , 0. , 0. ,
  84. 5.03599977, 8.62639999, 0. , 0. , 0. ,
  85. 0. , 11.11680031, 0. , 0. , 0. ,
  86. 0. , 0. , 0. , 0. , 14.37839985,
  87. 0. , 0. , 0. , 0. , 0. ,
  88. 0. , 8.16479969, 0. , 7.30719948, 0. ,
  89. 0. , 0. , 3.41039991, 7.17280006, 0. ,
  90. 0. , 0. , 0. , 10.00992 , 0. ,
  91. 0. , 0. , 0. , 0. , 0. ,
  92. 0. , 13.97839928, 0. , 0. , 0. ,
  93. 0. , 0. , 0. , 7.6855998 , 0. ,
  94. 6.86559963, 0. , 0. , 0. , 3.21600008,
  95. 7.93599987, 0. , 0. , 0. , 0. ,
  96. 11.55999947, 0. , 0. , 0. , 0. ,
  97. 0. , 0. , 0. , 18.76399994, 0. ,
  98. 0. , 0. , 0. , 0. , 0. ,
  99. 10.003304 , 0. , 8.10639954, 0. , 0. ,
  100. 0. , 4.76480007, 6.87679958, 0. , 0. ,
  101. 0. , 0. , 11.42239952, 0. , 0. ,
  102. 0. , 0. , 0. , 0. , 0. ,
  103. 19.42639732, 0. , 0. , 0. , 0. ,
  104. 0. , 0. , 10.00524 , 0. , 8.2567997 ,
  105. 0. , 0. , 0. , 5.08239985, 7.9776001 ,
  106. 0. , 0. , 0. , 0. , 10.009984 ,
  107. 0. , 0. , 0. , 0. , 0. ,
  108. 0. , 0. , 11.5855999 , 0. , 0. ,
  109. 0. , 0. , 0. , 0. , 7.88399982,
  110. 0. , 5.96799994, 0. , 0. , 0. ,
  111. 3.07679987, 7.81360006, 0. , 0. , 0. ,
  112. 0. , 11.51119995, 0. , 0. , 0. ,
  113. 0. , 0. , 0. , 0. , 20.00309599,
  114. 0. , 0. , 0. , 0. , 0. ,
  115. 0. , 10.005088 , 0. , 8.20479965, 0. ,
  116. <details>
  117. <summary>英文:</summary>
  118. I&#39;m not clear on why you don&#39;t trust `sci.stats`. If it&#39;s because of performance, have you actually profiled? It&#39;s worth noting that neither your fit nor scipy&#39;s fit are particularly good, because your data are strongly skewed toward 0.
  119. When initialising your second `x`, don&#39;t start it at 0 because your calculations are not well-defined for that point, which (a) dumps warnings on the console and (b) requires a bunch of indexing gymnastics later.
  120. I&#39;m also not clear on why you&#39;re attempting `trapz`. Why integrate when you already have a CDF? Just use the CDF; and also don&#39;t `argmax`; use `searchsorted`.
  121. Finally, your plotting is slightly a disaster. For apples-to-apples comparison it&#39;s very important that you pass `cumulative` to `hist`.
  122. ```python
  123. import numpy as np
  124. import scipy as sci
  125. import matplotlib.pyplot as plt
  126. data = np.array([
  127. 0. , 0. , 11.26399994, 0. , 0. ,
  128. 0. , 0. , 0. , 0. , 0. ,
  129. 17.06399918, 0. , 0. , 0. , 0. ,
  130. 0. , 0. , 8.33279991, 0. , 7.54879951,
  131. 0. , 0. , 0. , 4.58799982, 7.9776001 ,
  132. 0. , 0. , 0. , 0. , 11.45040035,
  133. 0. , 0. , 0. , 0. , 0. ,
  134. 0. , 0. , 18.73279953, 0. , 0. ,
  135. 0. , 0. , 0. , 0. , 8.94559956,
  136. 0. , 7.73040009, 0. , 0. , 0. ,
  137. 5.03599977, 8.62639999, 0. , 0. , 0. ,
  138. 0. , 11.11680031, 0. , 0. , 0. ,
  139. 0. , 0. , 0. , 0. , 14.37839985,
  140. 0. , 0. , 0. , 0. , 0. ,
  141. 0. , 8.16479969, 0. , 7.30719948, 0. ,
  142. 0. , 0. , 3.41039991, 7.17280006, 0. ,
  143. 0. , 0. , 0. , 10.00992 , 0. ,
  144. 0. , 0. , 0. , 0. , 0. ,
  145. 0. , 13.97839928, 0. , 0. , 0. ,
  146. 0. , 0. , 0. , 7.6855998 , 0. ,
  147. 6.86559963, 0. , 0. , 0. , 3.21600008,
  148. 7.93599987, 0. , 0. , 0. , 0. ,
  149. 11.55999947, 0. , 0. , 0. , 0. ,
  150. 0. , 0. , 0. , 18.76399994, 0. ,
  151. 0. , 0. , 0. , 0. , 0. ,
  152. 10.003304 , 0. , 8.10639954, 0. , 0. ,
  153. 0. , 4.76480007, 6.87679958, 0. , 0. ,
  154. 0. , 0. , 11.42239952, 0. , 0. ,
  155. 0. , 0. , 0. , 0. , 0. ,
  156. 19.42639732, 0. , 0. , 0. , 0. ,
  157. 0. , 0. , 10.00524 , 0. , 8.2567997 ,
  158. 0. , 0. , 0. , 5.08239985, 7.9776001 ,
  159. 0. , 0. , 0. , 0. , 10.009984 ,
  160. 0. , 0. , 0. , 0. , 0. ,
  161. 0. , 0. , 11.5855999 , 0. , 0. ,
  162. 0. , 0. , 0. , 0. , 7.88399982,
  163. 0. , 5.96799994, 0. , 0. , 0. ,
  164. 3.07679987, 7.81360006, 0. , 0. , 0. ,
  165. 0. , 11.51119995, 0. , 0. , 0. ,
  166. 0. , 0. , 0. , 0. , 20.00309599,
  167. 0. , 0. , 0. , 0. , 0. ,
  168. 0. , 10.005088 , 0. , 8.20479965, 0. ,
  169. 0. , 0. , 5.51599979, 9.02879906, 0. ,
  170. 0. ])
  171. def scigamma(data: np.ndarray) -&gt; np.ndarray:
  172. param = sci.stats.gamma.fit(data)
  173. x = np.linspace(0, np.max(data), 250)
  174. cdf = sci.stats.gamma.cdf(x, *param)
  175. return cdf
  176. scicdf = scigamma(data)
  177. # Method of Moments estimation
  178. mean = np.mean(data)
  179. variance = np.var(data)
  180. alpha = mean**2 / variance
  181. beta = variance / mean
  182. # Generate x-axis values for the curves
  183. # Do not start at 0
  184. x = np.linspace(0.1, np.max(data), 250)
  185. # Calculate the gamma distribution PDF values
  186. pdf = x**(alpha - 1) * np.exp(-x / beta) / (beta**alpha * np.math.gamma(alpha))
  187. # Calculate the gamma distribution CDF values
  188. cdf = np.cumsum(pdf) / np.sum(pdf)
  189. def plot() -&gt; None:
  190. plt.hist(data, bins=50, density=True, cumulative=True, alpha=0.5, label=&#39;data&#39;)
  191. plt.plot(x, cdf, &#39;b&#39;, label=&#39;Gamma CDF | Custom Fit&#39;)
  192. plt.plot(x, scicdf, &#39;k&#39;, label=&#39;Gamma CDF | SciPy Fit&#39;)
  193. plt.xlabel(&#39;data&#39;)
  194. plt.ylabel(&#39;Probability&#39;)
  195. plt.legend()
  196. plt.show()
  197. def check_percentiles() -&gt; None:
  198. # Calculate the percentile using numerical integration
  199. value = 2.50
  200. index = np.searchsorted(x, value)
  201. pct = cdf[index]
  202. print(f&#39;x={value} is at percentile {pct:.3f}&#39;)
  203. # Calculate the value using numerical integration
  204. pct = 0.50
  205. index = np.searchsorted(cdf, pct)
  206. val = cdf[index]
  207. print(f&#39;percentile {pct} is x={val:.3f}&#39;)
  208. check_percentiles()
  209. plot()
  1. x=2.5 is at percentile 0.680
  2. percentile 0.5 is x=0.503

找到从计算的伽玛分布中的百分位数和值

Much more directly, though,

  1. param = sci.stats.gamma.fit(data)
  2. # Calculate the percentile
  3. value = 2.50
  4. pct = scipy.stats.gamma.cdf(value, *param)
  5. print(f&#39;x={value} is at percentile {pct:.3f}&#39;)
  6. # Calculate the value
  7. pct = 0.50
  8. val = scipy.stats.gamma.ppf(pct, *param)
  9. print(f&#39;percentile {pct} is x={val:.3f}&#39;)

It's also worth noting that if you want scipy to produce a fit similar to yours, then pass method=&#39;MM&#39; to fit():

找到从计算的伽玛分布中的百分位数和值

As in the comments, a single gamma will not fit your data well. Consider instead a simple bimodal where the first CDF is a Heaviside step and the second CDF is still assumed gamma:

  1. x = np.linspace(start=0, stop=20, num=250)
  2. # first mode of the bimodal distribution produces a heaviside step in the CDF
  3. nonzero_data = data[data.nonzero()]
  4. heaviside = 1 - nonzero_data.size/data.size
  5. cdf = np.full_like(x, fill_value=heaviside)
  6. # second mode of the bimodal distribution assumed still gamma
  7. params = scipy.stats.gamma.fit(nonzero_data)
  8. print(&#39;Gamma params:&#39;, params)
  9. cdf += scipy.stats.gamma.cdf(x, *params)*(1 - heaviside)
  10. plt.hist(data, bins=50, density=True, cumulative=True, alpha=0.5, label=&#39;data&#39;)
  11. plt.plot(x, cdf, label=&#39;fit&#39;)
  12. plt.xlabel(&#39;data&#39;)
  13. plt.ylabel(&#39;Probability&#39;)
  14. plt.legend(loc=&#39;right&#39;)
  15. plt.show()
  1. Gamma params: (3.9359070026702017, 1.3587246910542263, 2.0440735483494272)

找到从计算的伽玛分布中的百分位数和值

答案2

得分: 0

使用以下变量:

  1. x = np.linspace(0, np.max(data), 250)
  2. cdf[x > 0] = np.cumsum(pdf[x > 0]) / np.sum(pdf[x > 0])

你可以使用 scipy 的 interpolate 来拟合函数,以获取从数值到百分位数和反之亦然的结果:

  1. from scipy import interpolate
  2. f_x_to_percentile = interpolate.interp1d(x, cdf)
  3. x_val = 2.5
  4. print(f"Percentile of value {x_val}: {f_x_to_percentile(x_val)}")
  5. # Percentile of value 2.5: 0.6864894891059873
  6. f_percentile_to_x = interpolate.interp1d(cdf, x)
  7. percentile_val = 0.5
  8. print(f"Value at percentile {percentile_val*100}: {f_percentile_to_x(percentile_val)}")
  9. # Value at percentile 50.0: 1.015695412520532

更新:

如果你决定继续使用 scipy 来拟合你的分布,你可以直接使用 param 对象中的拟合参数,使用 .cdf().ppf() 方法获取结果:

  1. from scipy import stats
  2. param = stats.gamma.fit(data)
  3. x_val = 2.5
  4. percentile = stats.gamma(*param).cdf(x_val)
  5. print(f"Percentile of value {x_val}: {percentile}")
  6. # Percentile of value 2.5: 0.7681790728985926
  7. percentile_val = 0.5
  8. value_at_percentile = stats.gamma(*param).ppf(0.5)
  9. print(f"Value at percentile {percentile_val*100}: {value_at_percentile}")
  10. # Value at percentile 50.0: 0.7536517552749796
英文:

Using the variables below

  1. x = np.linspace(0, np.max(data), 250)
  2. cdf[x &gt; 0] = np.cumsum(pdf[x &gt; 0]) / np.sum(pdf[x &gt; 0])

you could fit a curve using scipy's interpolate to fit functions to obtain percentile from values and vice versa:

  1. from scipy import interpolate
  2. f_x_to_percentile = interpolate.interp1d(x, cdf)
  3. x_val = 2.5
  4. print(f&quot;Percentile of value {x_val}: {f_x_to_percentile(x_val)}&quot;)
  5. # Percentile of value 2.5: 0.6864894891059873
  6. f_percentile_to_x= interpolate.interp1d(cdf, x)
  7. percentile_val = 0.5
  8. print(f&quot;Value at percentile {percentile_val*100}: {f_percentile_to_x(percentile_val)}&quot;)
  9. # Value at percentile 50.0: 1.015695412520532

Update:

If you decide to continue using scipy to fit your distribution, you can directly use the fitted parameters in your param object to obtain the results using the .cdf() and .ppf() methods:

  1. from scipy import stats
  2. param = stats.gamma.fit(data)
  3. x_val = 2.5
  4. percentile = stats.gamma(*param).cdf(x_val)
  5. print(f&quot;Percentile of value {x_val}: {percentile}&quot;)
  6. # Percentile of value 2.5: 0.7681790728985926
  7. percentile_val = 0.5
  8. value_at_percentile = stats.gamma(*param).ppf(0.5)
  9. print(f&quot;Value at percentile {percentile_val*100}: {value_at_percentile}&quot;)
  10. # Value at percentile 50.0: 0.7536517552749796

huangapple
  • 本文由 发表于 2023年5月23日 00:15:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76308118.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定