在一个数据框中应用一个函数,使用另一个数据框的输入(R)。

huangapple go评论68阅读模式
英文:

Apply a function across multiple columns of one dataframe using input from another dataframe (R)

问题

我有一些包含字符和数字混合的大型数据框,我想要快速计算它们的频率,而不使用循环。

让我们以以下数据框作为示例:

df <- data.frame(
  id = paste0("SubID_", 1:100),
  score = as.character(sample(1:100, 100, replace=TRUE)),
  dob = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 100)
)

我使用以下函数来找到数据中最频繁出现的值:

mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

为了获得以下输出:

f <- data.frame(sapply(df, mode))
          sapply.df..mode.
id             SubID_1
score               84
dob              10739

其中行名基本上是初始数据框的列名。(这将附加到一个文件中以进行数据汇总报告)。

接下来,我想要估算数据中最频繁分数所占的比例,我尝试使用以下函数:

frequencycounter <- function(x, df, f) {
  sum(str_count(df[, x], f[x, ])) / length(df[, x])
}

其中"x"是表示列名的字符值。

但无论我尝试使用lapply或sapply运行它,都需要很长时间才能完成:

lapply(colnames(df), frequencycounter, df = df, f = f)
lapply(list(colnames(df)), frequencycounter, y = df, z = f)
sapply(colnames(df), frequencycounter, df = df, f = f)

我确信有一种使用mutatedplyr的解决方案来调用summarize,这比这个方法要快得多,但我无法立刻想到。

英文:

I have some large dataframes with a mix of characters and characters as numerics I'm trying to quickly calculate frequencies for without utilizing a loop.

Let's use the following dataframe as an example for the sake of this question:

df &lt;- data.frame(
  id = paste0(&quot;SubID_&quot;, 1:((100))),
  score = as.character(sample(1:100, 100, replace=TRUE)),
  dob = sample(seq(as.Date(&#39;1999/01/01&#39;), as.Date(&#39;2000/01/01&#39;), by=&quot;day&quot;), 100)
)

I used the following function to find the most frequent value in the data:

  # Taken from:
  # https://www.tutorialspoint.com/r/r_mean_median_mode.htm
  mode &lt;- function(v) {
    uniqv &lt;- unique(v)
    uniqv[which.max(tabulate(match(v, uniqv)))]
  }

To get an output like the following:

    f &lt;- data.frame(sapply(df, mode))
      sapply.df..mode.
id             SubID_1
score               84
dob              10739

Where the row names are basically the column names from the initial dataframe. (It's all getting appended to one file for a data summary report).

What I'd like to do from here is get an estimate of how much of the data is comprised of the most frequent score, which I attempted to do with the following function:

  frequencycounter &lt;- function(x,df,f){
  sum(str_count(df[,x], f[x,]))/length(df[,x])
  }

Where "x" is character value representing a column name.

However, whenever I try lapply or sapply on it, it takes a while to run to completion:

    lapply(colnames(df),frequencycounter,df=df,f=f)
    lapply(list(colnames(df)),frequencycounter,y=df,z=f)
    sapply(colnames(df),frequencycounter,df=df,f=f)

I'm sure there's a mutate or mdplyr solution to call summarize which is much faster than this, but it just isn't jumping out at me.

答案1

得分: 1

我们可以修改您提供的mode()函数:

mostfrequent <- function(v){
  uniqv <- unique(v)
  max(tabulate(match(v, uniqv)))/length(v)
}
data.frame(sapply(df, mostfrequent))
      sapply.df..mostfrequent.
id                        0.01
score                     0.04
dob                       0.01
英文:

We can modify the mode() function that you provided:

mostfrequent &lt;- function(v){
  uniqv &lt;- unique(v)
  max(tabulate(match(v, uniqv)))/length(v)
}
data.frame(sapply(df, mostfrequent))
      sapply.df..mostfrequent.
id                        0.01
score                     0.04
dob                       0.01

huangapple
  • 本文由 发表于 2023年7月28日 00:08:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76781611.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定