英文:
Having trouble fitting an exponential decay model to these data in R:
问题
-
我的问题:我有大约300行的数据(比附加的两个变量更多,但只与我的问题相关),需要从 "seed_prev_num"(x)预测 "xaq"(Y)。它们的关系似乎是指数衰减的,但我所有尝试使用非线性最小二乘法(NLS)都产生了交替的错误,要么是"缺失值或产生无穷大",要么是"初始参数估计的梯度矩阵奇异"。
- 注意:数据集中没有零、空值或无穷大。
- 这个模型的目的是为了在阈值化其他模型的业务规则中提供指导。目前它是一种启发式/最佳猜测性能的方法,我们希望更全面地自动化这个过程。
- 还很重要的是,我能够导出系数,以便从这个模型中为其他利益相关者提供Excel方程式。
- 我宁愿不移除离群值,因为它们对我们的数据具有意义,从主观上讲。
-
我尝试过:对数变换、多个起始函数、NLS参数(y的最大值和指数函数中b的不同值)、样条回归以及分割成“更高”和“更低”的数据集以分别建模。对数变换模型表现还可以,但在中间情况下失去了一些能力,不幸的是,对于我们的业务规则来说,这是大部分数据会出现的地方。
-
这里是一些 dput() 输出的代码,让您开始工作,如果您有问题,请告诉我!
structure(list(seed_prev_num = c(2671089.52136, 1723563.978132,
1519478.513963, 10359757.487541, 435039.876186, 400860.993997,
4824556.66506, 7345496.18374, 7167489.404247, 1541803.448572,
802314.685373, 2589889.980437, 1578945.817656, 593092.510586,
9524646.88053, 1517305.29024, 12332649.496439, 1225895.745565,
1723563.978132, 4029354.348235, 486209.416573, 2473918.859946,
2200685.368227, 1687211.87222, 34608587.788775, 127034.804899,
2174606.683551, 694443.762395, 1297414.562631, 269677.307445),
xaq = c(3.140014793, 9.363136176, 10.842283188, 2.495966588,
6.017711172, 3.124199113, 3.369492219, 2.702313072, 2.587612668,
3.000256279, 2.360256095, 4.987947212, 7.002252252, 16.762824783,
2.35521676, 10.671484375, 3.007449177, 7.73650282, 4.812929849,
2.976955136, 17.230394149, 4.272320716, 5.298590538, 6.414285714,
2.321626944, 9.362363919, 3.303806668, 10.004836415, 8.26450434,
5.726739927)), row.names = c(NA, -30L), class = c("tbl_df",
"tbl", "data.frame"))
英文:
- My problem: I have about 300 rows of data (more variables than the two attached, but only relevant on my end) and need to predict "xaq" (Y) from "seed_prev_num" (x). Their relationship seems to be an exponential decay, but all my attempts with NLS have yielded alternating errors of Missing value or an infinity produced or singular gradient matrix at initial parameter estimates.
- Note: there are NO zeros, nulls or infinities in the dataset.
- The purpose of this model is to have something that guides business rules in thresholding other models. Right now it is heuristic/best guess for performance and we want to more fully automate that process.
- It is also important that I can export the coefficients to give another stakeholder an Excel equation from this model
Here is the graphical relationship between the two variables:
- I would prefer not to have to remove outliers as they are meaningful to our data, subjectively.
-
I have tried: log-transform, multiple starting functions, NLS parameters (max of y and different values for b in the exponential function), spline regression and splitting into "higher" and "lower" datasets to model each separately. The log-transform model performs OK, but loses some power in the middle cases, which unfortunately for our business rules, is where the bulk of our data will be arriving.
-
Here's some code output from dput() to get you started, please let me know if you have questions below!
structure(list(seed_prev_num = c(2671089.52136, 1723563.978132,
1519478.513963, 10359757.487541, 435039.876186, 400860.993997,
4824556.66506, 7345496.18374, 7167489.404247, 1541803.448572,
802314.685373, 2589889.980437, 1578945.817656, 593092.510586,
9524646.88053, 1517305.29024, 12332649.496439, 1225895.745565,
1723563.978132, 4029354.348235, 486209.416573, 2473918.859946,
2200685.368227, 1687211.87222, 34608587.788775, 127034.804899,
2174606.683551, 694443.762395, 1297414.562631, 269677.307445),
xaq = c(3.140014793, 9.363136176, 10.842283188, 2.495966588,
6.017711172, 3.124199113, 3.369492219, 2.702313072, 2.587612668,
3.000256279, 2.360256095, 4.987947212, 7.002252252, 16.762824783,
2.35521676, 10.671484375, 3.007449177, 7.73650282, 4.812929849,
2.976955136, 17.230394149, 4.272320716, 5.298590538, 6.414285714,
2.321626944, 9.362363919, 3.303806668, 10.004836415, 8.26450434,
5.726739927)), row.names = c(NA, -30L), class = c("tbl_df",
"tbl", "data.frame"))
答案1
得分: 2
log-linear fit:
m1 <- lm(log(xaq) ~ seed_prev_num, dd)
假设偏差是对数正态,即log(xaq)
服从具有恒定方差的正态分布。我不明白你所说的“在中间情况下失去了一些效力”是什么意思...
log-link Gaussian GLM:
这拟合了模型y ~ Normal(exp(a+b*x), sigma)
m2 <- glm(xaq ~ seed_prev_num, family = gaussian(link = "log"), dd)
nls with log-link GLM starting values:
这拟合了相同的模型。
m3 <- nls(xaq ~ a*exp(b*seed_prev_num),
data = dd,
start = list(a = exp(coef(m2)[1]), b = coef(m2)[2]))
这两者都为TRUE:
all.equal(coef(m2)[["(Intercept)"]], log(coef(m3)[["a"]]), tolerance = 1e-5)
all.equal(coef(m2)[["seed_prev_num"]], coef(m3)[["b"]], tolerance = 1e-5)
这个技巧可能是你需要的全部。如果使用对数线性拟合的起始值(start = list(a = exp(coef(m1)[1]), b = coef(m1)[2]))
), 它也适用。
nls with arbitrary starting values but scaled x:
我怀疑主要问题是你的x变量的缩放...
m4 <- nls(xaq ~ a*exp(b*seed_prev_num/1e7),
data = dd,
start = list(a = 1, b = 1))
all.equal(coef(m4)[["b"]]/1e7, coef(m3)[["b"]],
tolerance = 1e-4)
拟合结果并没有显著改善:你的管理层可能不得不接受这些嘈杂的数据事实。
- 我还尝试过
mgcv::gam
与对数链接的高斯GLM和seed_prev_num
的平滑函数,但它最终回到指数拟合。无论如何,这个模型都会违反你的“容易导出系数并在Excel中计算”的要求。 - 一个幂指数模型:
m6 <- nls(xaq ~ a*exp(b*(seed_prev_num/1e7)^c),
data = dd,
start = list(a = 9, b = -2, c = 1.1)
)
稍微好一些(增加了预测和观察之间的平方相关性,也就是R^2的一种形式,提高到0.34),但根据AIC来说并不是更好的模型...
英文:
These all seem to work fine:
log-linear fit
m1 <- lm(log(xaq) ~ seed_prev_num, dd)
This assumes that the deviation is log-normal, i.e. that log(xaq)
is Normally distributed with a constant variance. I don't understand what you mean by "loses some power in the middle cases" ...
log-link Gaussian GLM
This fits the model y ~ Normal(exp(a+b*x), sigma)
m2 <- glm(xaq ~ seed_prev_num, family = gaussian(link = "log"), dd)
nls with log-link GLM starting values
This fits the same model.
m3 <- nls(xaq ~ a*exp(b*seed_prev_num),
data = dd,
start = list(a = exp(coef(m2)[1]), b = coef(m2)[2]))
These are both TRUE:
all.equal(coef(m2)[["(Intercept)"]], log(coef(m3)[["a"]]), tolerance = 1e-5)
all.equal(coef(m2)[["seed_prev_num"]], coef(m3)[["b"]], tolerance = 1e-5)
This trick is probably all you need. It also works if you use the starting values from the log-linear fit (start = list(a = exp(coef(m1)[1]), b = coef(m1)[2]))
).
nls with arbitrary starting values but scaled x
I suspect that the primary problem is the scaling of your x variable ...
m4 <- nls(xaq ~ a*exp(b*seed_prev_num/1e7),
data = dd,
start = list(a = 1, b = 1))
all.equal(coef(m4)[["b"]]/1e7, coef(m3)[["b"]],
tolerance = 1e-4)
The fits don't get much better: your management may have to live with the fact that these are noisy data.
- I also tried
mgcv::gam
with a log-link Gaussian GLM and a smooth function ofseed_prev_num
, but it collapses back to an exponential fit. In any case, this model would violate your "easy to export coefficients and compute in Excel" rubric. - a power-exponential model:
m6 <- nls(xaq ~ a*exp(b*(seed_prev_num/1e7)^c),
data = dd,
start = list(a = 9, b = -2, c = 1.1)
)
does slightly better (increases the squared correlation between predictions and observed, aka one form of R^2, to 0.34) but is not a better model according to AIC ...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论