英文:
Python fit and chi square comparison
问题
我想知道通过计算卡方值来确定哪个函数更适合我的数据,然后进行比较。
我拥有的数据可能看起来像高斯分布,也可能像一条水平线,所以我的目标是每次都使用scipy.curve_fit
将我的数据拟合到这两个函数,然后计算卡方值(使用scipy.stats
)。但我可以直接比较这两个值并选择"最小"的值吗?因为这两个函数具有相同数量的数据点,但不同的参数(因此自由度不同):高斯分布有4个参数,而水平线只有1个参数。
def horiz(x, c):
return np.full_like(x, c)
# 定义高斯函数
def gauss(x, H, A, x0, sigma):
return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
以下是代码示例:
p0_c = [moy]
popt_c, pcov = curve_fit(horiz, x, Smooth_data, p0=p0_c)
fit_c = horiz(x, *popt_c)
chi_c, p_c = stats.chisquare(Counts_list[i], fit_c)
结束。
英文:
I want to know which functions fits better to my data by calculating the chi square value and then compare it.
The data i have could like either like a gaussian or kind of like an horizontal line, so my goal would be to fit each time my data (using scipy.curve fit) to both functions and then calculate the chisquare value (using scipy.stats), but can i directly compare both values and take the 'lowest' one as both functions have the same number of points but different parameters (hence not the same df): 4 parameters for my gaussian and 1 for my horizontal line ?
def horiz(x,c):
return np.full_like(x, c)
#Define the Gaussian function
def gauss(x, H, A, x0, sigma):
return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
example of code below:
p0_c = [moy]
popt_c, pcov = curve_fit(horiz, x, Smooth_data, p0 = p0_c)
fit_c = horiz(x, *popt_c)
chi_c, p_c = stats.chisquare(Counts_list[i], fit_c )
end.
答案1
得分: 0
您可以使用类似于贝叶斯信息准则(BIC)进行模型选择,该准则考虑了每个模型中参数数量的不同,并对具有更多参数的模型施加了惩罚因子。
模拟您示例的版本,可以按照以下步骤完成。
-
设置模型函数并生成包含高斯“峰”的一些模拟数据。
import numpy as np from matplotlib import pyplot as plt from scipy.optimize import curve_fit def gauss(x, H, A, x0, sigma): return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2)) def horiz(x,c): return np.full_like(x, c) # 生成包含高斯峰的一些数据 n = 10_000 # 数据点的数量 x = np.linspace(0, 1_000_000, n, dtype=float) # 随机高斯噪声 y = np.random.randn(n) * 5e-8 # 高斯峰参数 sigma = 30_000 x0 = 400_000 H = 0 A = 4.5e-7 y += gauss(x, H, A, x0, sigma)
-
使用
curve_fit
来获得最佳拟合、最大似然模型(我不得不缩放参数以使拟合合理工作)。# 缩放因子 scaleamp = 1e8 scalerange = 1e-5 # 初始参数猜测 iniguess = (0.1e-7 * scaleamp, 1e-7 * scaleamp, 350000.0 * scalerange, 10000.0 * scalerange) # 拟合高斯峰模型 gaussfit = curve_fit(gauss, x * scalerange, y * scaleamp, p0=iniguess) # 拟合水平线模型 linefit = curve_fit(horiz, x, y * scaleamp, p0=(0.0,))
-
生成最佳拟合模型并计算给定数据的(对数)似然度(对于似然函数,因为我们没有数据不确定性的估计,我们可以使用Sivia中第53页的方程3.38中定义的类似于学生t分布)。
Hf, Af, x0f, sigmaf = gaussfit[0] # 重新缩放最佳拟合参数以生成高斯模型 bestgaussfit = gauss(x, Hf / scaleamp, Af / scaleamp, x0f / scalerange, sigmaf / scalerange) # 获取对数似然度 logLgauss = -((n - 1) / 2) * np.log(np.sum((y - bestgaussfit)**2)) # 生成最佳拟合水平线模型 bestlinefit = horiz(x, linefit[0][0] / scaleamp) # 获取对数似然度 logLhoriz = -((n - 1) / 2) * np.log(np.sum((y - bestlinefit)**2))
-
为两个模型生成BIC。
ngaussparams = 4 # 高斯模型参数的数量 bicgauss = ngaussparams * np.log(n) - 2 * logLgauss nlineparams = 1 # 平坦线的参数数量 bicline = nlineparams * np.log(n) - 2 * logLhoriz
使用BIC,较低参数数量的模型是首选模型。在这种情况下,高斯模型的BIC要低得多(对于我生成的数据,低约16000)。
英文:
You could use something like the Bayesian Information Criterion (BIC) for model selection, which accounts for the different numbers of parameters in each model with a penalty factor for more parameters.
Mocking up a version of your example, this could be done with the following steps.
-
Set up model functions and generate some simulated data containing a Gaussian "bump".
import numpy as np from matplotlib import pyplot as plt from scipy.optimize import curve_fit def gauss(x, H, A, x0, sigma): return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2)) def horiz(x,c): return np.full_like(x, c) # generate some data containing a Gaussian bump n = 10_000 # number of data points x = np.linspace(0, 1_000_000, n, dtype=float) # random Gaussian noise y = np.random.randn(n) * 5e-8 # Gaussian bump parameters sigma = 30_000 x0 = 400_000 H = 0 A = 4.5e-7 y += gauss(x, H, A, x0, sigma)
-
Use
curve_fit
to get a best fit, maximum likelihood, model (I've had to scale the parameters to get the fitting to work reasonably).# scale factors scaleamp = 1e8 scalerange = 1e-5 # initial parameter guesses iniguess = (0.1e-7 * scaleamp, 1e-7 * scaleamp, 350000.0 * scalerange, 10000.0 * scalerange) # fit Gaussian bump model gaussfit = curve_fit(gauss, x * scalerange, y * scaleamp, p0=iniguess) # fit horizontal line model linefit = curve_fit(horiz, x, y * scaleamp, p0=(0.0,))
-
Generate the best fit models and calculate the (log)likelihood for that model given the data (for the likelihood function, because we don't have an estimate of the uncertainty in the data, we can use the Student's t-like distribution defined in Eqn 3.38 in Sivia on pg. 53).
Hf, Af, x0f, sigmaf = gaussfit[0] # rescale best fit parameters to generate the Gaussian model bestgaussfit = gauss(x, Hf / scaleamp, Af / scaleamp, x0f / scalerange, sigmaf / scalerange) # get log likelihood logLgauss = -((n - 1) / 2) * np.log(np.sum((y - bestgaussfit)**2)) # generate the best fit horizontal line model bestlinefit = horiz(x, linefit[0][0] / scaleamp) # get log likelihood logLhoriz = -((n - 1) / 2) * np.log(np.sum((y - bestlinefit)**2))
-
Generate the BIC for both models.
ngaussparams = 4 # number of Gaussian model parameters bicgauss = ngaussparams * np.log(n) - 2 * logLgauss nlineparams = 1 # number of parameters for the flat line bicline = nlineparams * np.log(n) - 2 * logLhoriz
With the BIC, the model with the lower number is the preferred model. In this case the BIC for the Gaussian model is a lot lower (by about 16000 for the data I generated).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论