2023年5月29日 06:30:14go评论97阅读模式

英文:

Perform linear regression over all columns of data fame where first column is predictor

问题

我已搜索StackOverflow以寻找关于这个问题的答案，但仍然感到困惑 - 如果这看起来太像一个重复的问题，我道歉。

我有一个类似于这样的数据框：

df <- data.frame(Cohort = c('con', 'con', 'dis', 'dis', 'con', 'dis'),
                 Sex = c('M', 'F', 'M', 'F', 'M', 'M'),
                 P1 = c(50, 40, 70, 80, 45, 75),
                 P2 = c(10, 9, 15, 13, 10, 8))

我想对我的数据框的所有数值列执行线性回归，使用"Cohort"作为预测变量（以后可以添加其他特征，比如"Sex"）。

我将数据框子集化以删除所有不相关的列（在这个示例中是Sex）：

new_df <- df[,-c(Sex)]

然后我像这样执行回归：

fit <- lapply(new_df[-1], function(y){summary(lm(y ~ Cohort, data=new_df))})

当我在我的数据框的一个小子集上测试这个时，它可以正常工作。实际上，我的数据框有大约7300列。当我在完整的数据框上运行这个命令时，我得到以下错误：

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'

然后我假设这是一个N/A值的问题，但当我执行这个命令时，返回的是'0'：

sum(is.na(new_df))

我还尝试过na.action=na.omit，但这也没有帮助解决错误。

我的最终目标是执行这些回归并提取p值和R-squared值，使用anova(fit)$'Pr(>F)'和summary(fit)$r.squared，分别。

如何纠正这个错误，或者是否有更好的方法来实现这个目标？另外，将来如何在添加其他特征到回归时不对数据框进行子集化？

英文:

I have searched StackOverflow for answers to this question but am still struggling - apologies if this looks too much like a duplicate question.

I have a dataframe similar to this:

df &lt;- data.frame(Cohort = c(&#39;con&#39;, &#39;con&#39;, &#39;dis&#39;, &#39;dis&#39;, &#39;con&#39;, &#39;dis&#39;),
                 Sex = c(&#39;M&#39;, &#39;F&#39;, &#39;M&#39;, &#39;F&#39;, &#39;M&#39;, &#39;M&#39;),
                 P1 = c(50, 40, 70, 80, 45, 75),
                 P2 = c(10, 9, 15, 13, 10, 8))

I want to perform a linear regression on all numeric columns of my dataframe using "Cohort" as the predictor (with the intent of adding features, such as "Sex", in future analysis).

I subset my dataframe to drop all irrelevant columns (in this toy example, Sex):

new_df &lt;- df[,-c(Sex)]

Then I perform the regression like this:

fit &lt;- lapply(new_df[-1], function(y){summary(lm(y ~ Cohort, data=new_df))})

When I test this on a small subset of my df (~5 columns) it works fine. In reality my df is ~7300 columns. When I run the command on the full dataframe I get this error:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in &#39;y&#39;

I then assumed it was an issue with N/A values but when I do this I get back '0':

sum(is.na(new_df))

I have also tried the na.action=na.omit but that did not help the error either.

My end goal is to perform these regressions and extract the p-value and r-squared values using anova(fit)$'Pr(>F)' and summary(fit)$r.squared, respectively.

How can I correct this error, or is there a better method to do this? Additionally, moving forward how can I perform this by not subsetting my dataframe when I add other features to the regression?

EDIT:

@Parfait A dput() example of my df:

dput(new_data[1:4, 1:4])
structure(list(Cohort = c(&quot;Disease&quot;, &quot;Disease&quot;, &quot;Control&quot;, &quot;Control&quot;), 
    seq.10010.10 = c(8.33449676839042, 8.39959836912012, 8.34385193344212, 
    8.43546191447928), seq.10011.65 = c(11.5222872738433, 11.7652860987237, 
    11.1661630826461, 11.008848763327), seq.10012.5 = c(10.5414838640543, 
    10.6862378767518, 10.5408061105915, 10.726558779105)), class = c(&quot;soma_adat&quot;, 
&quot;data.frame&quot;), row.names = c(&quot;258633854330_1&quot;, &quot;258633854330_3&quot;, 
&quot;258633854330_5&quot;, &quot;258633854330_6&quot;)

答案1

得分: 1

考虑将列名传递给方法，并使用reformulate动态构建公式。甚至可以使用tryCatch来处理所有列并捕获引发错误的列。以下返回从模型结果中提取的统计信息的数据框列表。

fit_df_list <- sapply(
    colnames(new_df)[-1],
    function(col) {
        tryCatch({
          fml <- reformulate("Cohort", col)
          fit <- lm(fml, data = new_df)
          results <- summary(fit)
          data.frame(
              variable = col,
              r_squared = results$r.squared,
              f_stat = results$fstatistic["value"],
              f_pvalue = anova(fit)$'Pr(>F)'[1]
          )
        }, error = \(e) paste("Error on", col, ":", e)
        )
    },
    simplify = FALSE
)
# 筛选有问题的列
fit_err_list <- Filter(is.character, fit_df_list)
# 构建单一的主数据框
fit_df <- do.call(rbind, Filter(is.data.frame, fit_df_list))

英文:

Consider passing the column name into method and build formula dynamically with reformulate. Even run tryCatch to process all columns and capture the columns raising errors. Below returns a list of data frames of extracted stats from model results.

fit_df_list &lt;- sapply(
    colnames(new_df)[-1],
    function(col) {
        tryCatch({
          fml &lt;- reformulate(&quot;Cohort&quot;, col)
          fit &lt;- lm(fml, data = new_df)
          results &lt;- summary(fit)
          data.frame(
              variable = col,
              r_squared = results$r.squared,
              f_stat = results$fstatistic[&quot;value&quot;],
              f_pvalue = anova(fit)$&#39;Pr(&gt;F)&#39;[1]
          )
        }, error = \(e) paste(&quot;Error on&quot;, col, &quot;:&quot;, e)
        )
    },
    simplify = FALSE
)
# FILTER FOR PROBLEMATIC COLUMNS
fit_err_list &lt;- Filter(is.character, fit_df_list)
# BUILD SINGLE, MASTER DATA FRAME
fit_df &lt;- do.call(rbind, Filter(is.data.frame, fit_df_list))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

执行线性回归，其中数据帧的所有列都以第一列作为预测变量。

问题

答案1

分割并展平一个数据框成多个数据框

highcharter: chart.hide is not a function

如何根据条件在R中删除重复的行？

将小时和分钟转换为分钟

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。