Python拟合和卡方比较

huangapple go评论71阅读模式
英文:

Python fit and chi square comparison

问题

我想知道通过计算卡方值来确定哪个函数更适合我的数据,然后进行比较。

我拥有的数据可能看起来像高斯分布,也可能像一条水平线,所以我的目标是每次都使用scipy.curve_fit将我的数据拟合到这两个函数,然后计算卡方值(使用scipy.stats)。但我可以直接比较这两个值并选择"最小"的值吗?因为这两个函数具有相同数量的数据点,但不同的参数(因此自由度不同):高斯分布有4个参数,而水平线只有1个参数。

Python拟合和卡方比较
Python拟合和卡方比较
Python拟合和卡方比较
Python拟合和卡方比较
Python拟合和卡方比较

def horiz(x, c):
    return np.full_like(x, c)
# 定义高斯函数
def gauss(x, H, A, x0, sigma):
    return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))

以下是代码示例:

p0_c = [moy]
popt_c, pcov = curve_fit(horiz, x, Smooth_data, p0=p0_c)
fit_c = horiz(x, *popt_c)
chi_c, p_c = stats.chisquare(Counts_list[i], fit_c)

结束。

英文:

I want to know which functions fits better to my data by calculating the chi square value and then compare it.

The data i have could like either like a gaussian or kind of like an horizontal line, so my goal would be to fit each time my data (using scipy.curve fit) to both functions and then calculate the chisquare value (using scipy.stats), but can i directly compare both values and take the 'lowest' one as both functions have the same number of points but different parameters (hence not the same df): 4 parameters for my gaussian and 1 for my horizontal line ?

Python拟合和卡方比较
Python拟合和卡方比较
Python拟合和卡方比较
Python拟合和卡方比较
Python拟合和卡方比较

def horiz(x,c):
    return np.full_like(x, c)
#Define the Gaussian function
def gauss(x, H, A, x0, sigma):
    return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))

example of code below:

p0_c = [moy]
popt_c, pcov = curve_fit(horiz, x, Smooth_data, p0 = p0_c)
fit_c = horiz(x, *popt_c)
chi_c, p_c = stats.chisquare(Counts_list[i], fit_c )

end.

答案1

得分: 0

您可以使用类似于贝叶斯信息准则(BIC)进行模型选择,该准则考虑了每个模型中参数数量的不同,并对具有更多参数的模型施加了惩罚因子。

模拟您示例的版本,可以按照以下步骤完成。

  1. 设置模型函数并生成包含高斯“峰”的一些模拟数据。

    import numpy as np
    from matplotlib import pyplot as plt
    from scipy.optimize import curve_fit
    
    def gauss(x, H, A, x0, sigma):
        return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
    
    def horiz(x,c):
        return np.full_like(x, c)
    
    # 生成包含高斯峰的一些数据
    n = 10_000  # 数据点的数量
    x = np.linspace(0, 1_000_000, n, dtype=float)
    
    # 随机高斯噪声
    y = np.random.randn(n) * 5e-8
    
    # 高斯峰参数
    sigma = 30_000
    x0 = 400_000
    H = 0
    A = 4.5e-7
    
    y += gauss(x, H, A, x0, sigma)
    
  2. 使用curve_fit来获得最佳拟合、最大似然模型(我不得不缩放参数以使拟合合理工作)。

    # 缩放因子
    scaleamp = 1e8
    scalerange = 1e-5
    
    # 初始参数猜测
    iniguess = (0.1e-7 * scaleamp, 1e-7 * scaleamp, 350000.0 * scalerange, 10000.0 * scalerange)
    
    # 拟合高斯峰模型
    gaussfit = curve_fit(gauss, x * scalerange, y * scaleamp, p0=iniguess)
    
    # 拟合水平线模型
    linefit = curve_fit(horiz, x, y * scaleamp, p0=(0.0,))
    
  3. 生成最佳拟合模型并计算给定数据的(对数)似然度(对于似然函数,因为我们没有数据不确定性的估计,我们可以使用Sivia中第53页的方程3.38中定义的类似于学生t分布)。

    Hf, Af, x0f, sigmaf = gaussfit[0]
    
    # 重新缩放最佳拟合参数以生成高斯模型
    bestgaussfit = gauss(x, Hf / scaleamp, Af / scaleamp, x0f / scalerange, sigmaf / scalerange)
    
    # 获取对数似然度
    logLgauss = -((n - 1) / 2) * np.log(np.sum((y - bestgaussfit)**2))
    
    # 生成最佳拟合水平线模型
    bestlinefit = horiz(x, linefit[0][0] / scaleamp)
    
    # 获取对数似然度
    logLhoriz = -((n - 1) / 2) * np.log(np.sum((y - bestlinefit)**2))
    
  4. 为两个模型生成BIC。

    ngaussparams = 4  # 高斯模型参数的数量
    bicgauss = ngaussparams * np.log(n) - 2 * logLgauss
    
    nlineparams = 1  # 平坦线的参数数量
    bicline = nlineparams * np.log(n) - 2 * logLhoriz
    

使用BIC,较低参数数量的模型是首选模型。在这种情况下,高斯模型的BIC要低得多(对于我生成的数据,低约16000)。

英文:

You could use something like the Bayesian Information Criterion (BIC) for model selection, which accounts for the different numbers of parameters in each model with a penalty factor for more parameters.

Mocking up a version of your example, this could be done with the following steps.

  1. Set up model functions and generate some simulated data containing a Gaussian "bump".

    import numpy as np
    from matplotlib import pyplot as plt
    from scipy.optimize import curve_fit
    
    def gauss(x, H, A, x0, sigma):
        return H + A * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
    
    def horiz(x,c):
        return np.full_like(x, c)
    
    # generate some data containing a Gaussian bump
    n = 10_000  # number of data points
    x = np.linspace(0, 1_000_000, n, dtype=float)
    
    # random Gaussian noise
    y = np.random.randn(n) * 5e-8
    
    # Gaussian bump parameters
    sigma = 30_000
    x0 = 400_000
    H = 0
    A = 4.5e-7
    
    y += gauss(x, H, A, x0, sigma)
    
  2. Use curve_fit to get a best fit, maximum likelihood, model (I've had to scale the parameters to get the fitting to work reasonably).

    # scale factors
    scaleamp = 1e8
    scalerange = 1e-5
    
    # initial parameter guesses
    iniguess = (0.1e-7 * scaleamp, 1e-7 * scaleamp, 350000.0 * scalerange, 10000.0 * scalerange)
    
    # fit Gaussian bump model
    gaussfit = curve_fit(gauss, x * scalerange, y * scaleamp, p0=iniguess)
    
    # fit horizontal line model
    linefit = curve_fit(horiz, x, y * scaleamp, p0=(0.0,))
    
  3. Generate the best fit models and calculate the (log)likelihood for that model given the data (for the likelihood function, because we don't have an estimate of the uncertainty in the data, we can use the Student's t-like distribution defined in Eqn 3.38 in Sivia on pg. 53).

    Hf, Af, x0f, sigmaf = gaussfit[0]
    
    # rescale best fit parameters to generate the Gaussian model
    bestgaussfit = gauss(x, Hf / scaleamp, Af / scaleamp, x0f / scalerange, sigmaf / scalerange)
    
    # get log likelihood
    logLgauss = -((n - 1) / 2) * np.log(np.sum((y - bestgaussfit)**2))
    
    # generate the best fit horizontal line model
    bestlinefit = horiz(x, linefit[0][0] / scaleamp)
    
    # get log likelihood
    logLhoriz = -((n - 1) / 2) * np.log(np.sum((y - bestlinefit)**2))
    
  4. Generate the BIC for both models.

    ngaussparams = 4  # number of Gaussian model parameters
    bicgauss = ngaussparams * np.log(n) - 2 * logLgauss
    
    nlineparams = 1  # number of parameters for the flat line
    bicline = nlineparams * np.log(n) - 2 * logLhoriz
    

With the BIC, the model with the lower number is the preferred model. In this case the BIC for the Gaussian model is a lot lower (by about 16000 for the data I generated).

huangapple
  • 本文由 发表于 2023年6月1日 16:45:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76380131.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定