2023年7月28日 00:08:57go评论115阅读模式

英文:

Apply a function across multiple columns of one dataframe using input from another dataframe (R)

问题

我有一些包含字符和数字混合的大型数据框，我想要快速计算它们的频率，而不使用循环。

让我们以以下数据框作为示例：

df <- data.frame(
  id = paste0("SubID_", 1:100),
  score = as.character(sample(1:100, 100, replace=TRUE)),
  dob = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 100)
)

我使用以下函数来找到数据中最频繁出现的值：

mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

为了获得以下输出：

f <- data.frame(sapply(df, mode))
          sapply.df..mode.
id             SubID_1
score               84
dob              10739

其中行名基本上是初始数据框的列名。（这将附加到一个文件中以进行数据汇总报告）。

接下来，我想要估算数据中最频繁分数所占的比例，我尝试使用以下函数：

frequencycounter <- function(x, df, f) {
  sum(str_count(df[, x], f[x, ])) / length(df[, x])
}

其中"x"是表示列名的字符值。

但无论我尝试使用lapply或sapply运行它，都需要很长时间才能完成：

lapply(colnames(df), frequencycounter, df = df, f = f)
lapply(list(colnames(df)), frequencycounter, y = df, z = f)
sapply(colnames(df), frequencycounter, df = df, f = f)

我确信有一种使用mutate或dplyr的解决方案来调用summarize，这比这个方法要快得多，但我无法立刻想到。

英文:

I have some large dataframes with a mix of characters and characters as numerics I'm trying to quickly calculate frequencies for without utilizing a loop.

Let's use the following dataframe as an example for the sake of this question:

df &lt;- data.frame(
  id = paste0(&quot;SubID_&quot;, 1:((100))),
  score = as.character(sample(1:100, 100, replace=TRUE)),
  dob = sample(seq(as.Date(&#39;1999/01/01&#39;), as.Date(&#39;2000/01/01&#39;), by=&quot;day&quot;), 100)
)

I used the following function to find the most frequent value in the data:

  # Taken from:
  # https://www.tutorialspoint.com/r/r_mean_median_mode.htm
  mode &lt;- function(v) {
    uniqv &lt;- unique(v)
    uniqv[which.max(tabulate(match(v, uniqv)))]
  }

To get an output like the following:

    f &lt;- data.frame(sapply(df, mode))
      sapply.df..mode.
id             SubID_1
score               84
dob              10739

Where the row names are basically the column names from the initial dataframe. (It's all getting appended to one file for a data summary report).

What I'd like to do from here is get an estimate of how much of the data is comprised of the most frequent score, which I attempted to do with the following function:

  frequencycounter &lt;- function(x,df,f){
  sum(str_count(df[,x], f[x,]))/length(df[,x])
  }

Where "x" is character value representing a column name.

However, whenever I try lapply or sapply on it, it takes a while to run to completion:

    lapply(colnames(df),frequencycounter,df=df,f=f)
    lapply(list(colnames(df)),frequencycounter,y=df,z=f)
    sapply(colnames(df),frequencycounter,df=df,f=f)

I'm sure there's a mutate or mdplyr solution to call summarize which is much faster than this, but it just isn't jumping out at me.

答案1

得分: 1

我们可以修改您提供的mode()函数：

mostfrequent <- function(v){
  uniqv <- unique(v)
  max(tabulate(match(v, uniqv)))/length(v)
}
data.frame(sapply(df, mostfrequent))
      sapply.df..mostfrequent.
id                        0.01
score                     0.04
dob                       0.01

英文:

We can modify the mode() function that you provided:

mostfrequent &lt;- function(v){
  uniqv &lt;- unique(v)
  max(tabulate(match(v, uniqv)))/length(v)
}
data.frame(sapply(df, mostfrequent))
      sapply.df..mostfrequent.
id                        0.01
score                     0.04
dob                       0.01

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在一个数据框中应用一个函数，使用另一个数据框的输入（R）。

问题

答案1

使用Leaflet自定义气泡地图的图例。

添加工具提示到R闪亮仪表板valueBox时出现问题。

如何使用两个或更多行的聚合来创建新行？

理解 “Error in ggplot(data = penguins) : object ‘penguins’ not found”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。