2023年5月17日 23:43:17go评论99阅读模式

英文:

Having trouble fitting an exponential decay model to these data in R:

问题

我的问题：我有大约300行的数据（比附加的两个变量更多，但只与我的问题相关），需要从 "seed_prev_num"（x）预测 "xaq"（Y）。它们的关系似乎是指数衰减的，但我所有尝试使用非线性最小二乘法（NLS）都产生了交替的错误，要么是"缺失值或产生无穷大"，要么是"初始参数估计的梯度矩阵奇异"。
- 注意：数据集中没有零、空值或无穷大。
- 这个模型的目的是为了在阈值化其他模型的业务规则中提供指导。目前它是一种启发式/最佳猜测性能的方法，我们希望更全面地自动化这个过程。
- 还很重要的是，我能够导出系数，以便从这个模型中为其他利益相关者提供Excel方程式。

这是两个变量之间的图形关系：

我宁愿不移除离群值，因为它们对我们的数据具有意义，从主观上讲。

我尝试过：对数变换、多个起始函数、NLS参数（y的最大值和指数函数中b的不同值）、样条回归以及分割成“更高”和“更低”的数据集以分别建模。对数变换模型表现还可以，但在中间情况下失去了一些能力，不幸的是，对于我们的业务规则来说，这是大部分数据会出现的地方。
这里是一些 dput() 输出的代码，让您开始工作，如果您有问题，请告诉我！

structure(list(seed_prev_num = c(2671089.52136, 1723563.978132, 
1519478.513963, 10359757.487541, 435039.876186, 400860.993997, 
4824556.66506, 7345496.18374, 7167489.404247, 1541803.448572, 
802314.685373, 2589889.980437, 1578945.817656, 593092.510586, 
9524646.88053, 1517305.29024, 12332649.496439, 1225895.745565, 
1723563.978132, 4029354.348235, 486209.416573, 2473918.859946, 
2200685.368227, 1687211.87222, 34608587.788775, 127034.804899, 
2174606.683551, 694443.762395, 1297414.562631, 269677.307445), 
    xaq = c(3.140014793, 9.363136176, 10.842283188, 2.495966588, 
    6.017711172, 3.124199113, 3.369492219, 2.702313072, 2.587612668, 
    3.000256279, 2.360256095, 4.987947212, 7.002252252, 16.762824783, 
    2.35521676, 10.671484375, 3.007449177, 7.73650282, 4.812929849, 
    2.976955136, 17.230394149, 4.272320716, 5.298590538, 6.414285714, 
    2.321626944, 9.362363919, 3.303806668, 10.004836415, 8.26450434, 
    5.726739927)), row.names = c(NA, -30L), class = c("tbl_df", 
"tbl", "data.frame"))

英文:

My problem: I have about 300 rows of data (more variables than the two attached, but only relevant on my end) and need to predict "xaq" (Y) from "seed_prev_num" (x). Their relationship seems to be an exponential decay, but all my attempts with NLS have yielded alternating errors of Missing value or an infinity produced or singular gradient matrix at initial parameter estimates.

Note: there are NO zeros, nulls or infinities in the dataset.
The purpose of this model is to have something that guides business rules in thresholding other models. Right now it is heuristic/best guess for performance and we want to more fully automate that process.
It is also important that I can export the coefficients to give another stakeholder an Excel equation from this model

Here is the graphical relationship between the two variables:

I would prefer not to have to remove outliers as they are meaningful to our data, subjectively.

I have tried: log-transform, multiple starting functions, NLS parameters (max of y and different values for b in the exponential function), spline regression and splitting into "higher" and "lower" datasets to model each separately. The log-transform model performs OK, but loses some power in the middle cases, which unfortunately for our business rules, is where the bulk of our data will be arriving.
Here's some code output from dput() to get you started, please let me know if you have questions below!

structure(list(seed_prev_num = c(2671089.52136, 1723563.978132, 
1519478.513963, 10359757.487541, 435039.876186, 400860.993997, 
4824556.66506, 7345496.18374, 7167489.404247, 1541803.448572, 
802314.685373, 2589889.980437, 1578945.817656, 593092.510586, 
9524646.88053, 1517305.29024, 12332649.496439, 1225895.745565, 
1723563.978132, 4029354.348235, 486209.416573, 2473918.859946, 
2200685.368227, 1687211.87222, 34608587.788775, 127034.804899, 
2174606.683551, 694443.762395, 1297414.562631, 269677.307445), 
    xaq = c(3.140014793, 9.363136176, 10.842283188, 2.495966588, 
    6.017711172, 3.124199113, 3.369492219, 2.702313072, 2.587612668, 
    3.000256279, 2.360256095, 4.987947212, 7.002252252, 16.762824783, 
    2.35521676, 10.671484375, 3.007449177, 7.73650282, 4.812929849, 
    2.976955136, 17.230394149, 4.272320716, 5.298590538, 6.414285714, 
    2.321626944, 9.362363919, 3.303806668, 10.004836415, 8.26450434, 
    5.726739927)), row.names = c(NA, -30L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

答案1

得分: 2

log-linear fit:

m1 <- lm(log(xaq) ~ seed_prev_num, dd)

假设偏差是对数正态，即log(xaq)服从具有恒定方差的正态分布。我不明白你所说的“在中间情况下失去了一些效力”是什么意思...

log-link Gaussian GLM:

这拟合了模型y ~ Normal(exp(a+b*x), sigma)

m2 <- glm(xaq ~ seed_prev_num,  family = gaussian(link = "log"), dd)

nls with log-link GLM starting values:

这拟合了相同的模型。

m3 <- nls(xaq ~ a*exp(b*seed_prev_num),
          data = dd,
          start = list(a = exp(coef(m2)[1]), b = coef(m2)[2]))

这两者都为TRUE：

all.equal(coef(m2)[["(Intercept)"]], log(coef(m3)[["a"]]), tolerance = 1e-5)
all.equal(coef(m2)[["seed_prev_num"]], coef(m3)[["b"]], tolerance = 1e-5)

这个技巧可能是你需要的全部。如果使用对数线性拟合的起始值(start = list(a = exp(coef(m1)[1]), b = coef(m1)[2]))), 它也适用。

nls with arbitrary starting values but scaled x:

我怀疑主要问题是你的x变量的缩放...

m4 <- nls(xaq ~ a*exp(b*seed_prev_num/1e7),
          data = dd,
          start = list(a = 1, b = 1))
all.equal(coef(m4)[["b"]]/1e7, coef(m3)[["b"]],
     tolerance = 1e-4)

拟合结果并没有显著改善：你的管理层可能不得不接受这些嘈杂的数据事实。

我还尝试过mgcv::gam与对数链接的高斯GLM和seed_prev_num的平滑函数，但它最终回到指数拟合。无论如何，这个模型都会违反你的“容易导出系数并在Excel中计算”的要求。
一个幂指数模型：

m6 <- nls(xaq ~ a*exp(b*(seed_prev_num/1e7)^c),
          data = dd,
          start = list(a = 9, b = -2, c =  1.1)
)

稍微好一些（增加了预测和观察之间的平方相关性，也就是R^2的一种形式，提高到0.34），但根据AIC来说并不是更好的模型...

英文:

These all seem to work fine:

log-linear fit

m1 &lt;- lm(log(xaq) ~ seed_prev_num, dd)

This assumes that the deviation is log-normal, i.e. that log(xaq) is Normally distributed with a constant variance. I don't understand what you mean by "loses some power in the middle cases" ...

log-link Gaussian GLM

This fits the model y ~ Normal(exp(a+b*x), sigma)

m2 &lt;- glm(xaq ~ seed_prev_num,  family = gaussian(link = &quot;log&quot;), dd)

nls with log-link GLM starting values

This fits the same model.

m3 &lt;- nls(xaq ~ a*exp(b*seed_prev_num),
          data = dd,
          start = list(a = exp(coef(m2)[1]), b = coef(m2)[2]))

These are both TRUE:

all.equal(coef(m2)[[&quot;(Intercept)&quot;]], log(coef(m3)[[&quot;a&quot;]]), tolerance = 1e-5)
all.equal(coef(m2)[[&quot;seed_prev_num&quot;]], coef(m3)[[&quot;b&quot;]], tolerance = 1e-5)

This trick is probably all you need. It also works if you use the starting values from the log-linear fit (start = list(a = exp(coef(m1)[1]), b = coef(m1)[2]))).

nls with arbitrary starting values but scaled x

I suspect that the primary problem is the scaling of your x variable ...

m4 &lt;- nls(xaq ~ a*exp(b*seed_prev_num/1e7),
          data = dd,
          start = list(a = 1, b = 1))
all.equal(coef(m4)[[&quot;b&quot;]]/1e7, coef(m3)[[&quot;b&quot;]],
     tolerance = 1e-4)

The fits don't get much better: your management may have to live with the fact that these are noisy data.

I also tried mgcv::gam with a log-link Gaussian GLM and a smooth function of seed_prev_num, but it collapses back to an exponential fit. In any case, this model would violate your "easy to export coefficients and compute in Excel" rubric.
a power-exponential model:

m6 &lt;- nls(xaq ~ a*exp(b*(seed_prev_num/1e7)^c),
          data = dd,
          start = list(a = 9, b = -2, c =  1.1)
)

does slightly better (increases the squared correlation between predictions and observed, aka one form of R^2, to 0.34) but is not a better model according to AIC ...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中将指数衰减模型拟合到这些数据时遇到了困难：

问题

答案1

log-linear fit

log-link Gaussian GLM

nls with log-link GLM starting values

nls with arbitrary starting values but scaled x

R: 基于嵌套群组计算比例

R Shiny 使用 Highcharter 的 hw_grid 不起作用

Regex for determining if any characters were input besides allowed input, and correcting the input accordingly

如何在使用R terra读取COG时忽略/强制缩放和偏移？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。