英文:
Apply a function across multiple columns of one dataframe using input from another dataframe (R)
问题
我有一些包含字符和数字混合的大型数据框,我想要快速计算它们的频率,而不使用循环。
让我们以以下数据框作为示例:
df <- data.frame(
id = paste0("SubID_", 1:100),
score = as.character(sample(1:100, 100, replace=TRUE)),
dob = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 100)
)
我使用以下函数来找到数据中最频繁出现的值:
mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
为了获得以下输出:
f <- data.frame(sapply(df, mode))
sapply.df..mode.
id SubID_1
score 84
dob 10739
其中行名基本上是初始数据框的列名。(这将附加到一个文件中以进行数据汇总报告)。
接下来,我想要估算数据中最频繁分数所占的比例,我尝试使用以下函数:
frequencycounter <- function(x, df, f) {
sum(str_count(df[, x], f[x, ])) / length(df[, x])
}
其中"x"是表示列名的字符值。
但无论我尝试使用lapply或sapply运行它,都需要很长时间才能完成:
lapply(colnames(df), frequencycounter, df = df, f = f)
lapply(list(colnames(df)), frequencycounter, y = df, z = f)
sapply(colnames(df), frequencycounter, df = df, f = f)
我确信有一种使用mutate
或dplyr
的解决方案来调用summarize
,这比这个方法要快得多,但我无法立刻想到。
英文:
I have some large dataframes with a mix of characters and characters as numerics I'm trying to quickly calculate frequencies for without utilizing a loop.
Let's use the following dataframe as an example for the sake of this question:
df <- data.frame(
id = paste0("SubID_", 1:((100))),
score = as.character(sample(1:100, 100, replace=TRUE)),
dob = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 100)
)
I used the following function to find the most frequent value in the data:
# Taken from:
# https://www.tutorialspoint.com/r/r_mean_median_mode.htm
mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
To get an output like the following:
f <- data.frame(sapply(df, mode))
sapply.df..mode.
id SubID_1
score 84
dob 10739
Where the row names are basically the column names from the initial dataframe. (It's all getting appended to one file for a data summary report).
What I'd like to do from here is get an estimate of how much of the data is comprised of the most frequent score, which I attempted to do with the following function:
frequencycounter <- function(x,df,f){
sum(str_count(df[,x], f[x,]))/length(df[,x])
}
Where "x" is character value representing a column name.
However, whenever I try lapply or sapply on it, it takes a while to run to completion:
lapply(colnames(df),frequencycounter,df=df,f=f)
lapply(list(colnames(df)),frequencycounter,y=df,z=f)
sapply(colnames(df),frequencycounter,df=df,f=f)
I'm sure there's a mutate
or mdplyr
solution to call summarize which is much faster than this, but it just isn't jumping out at me.
答案1
得分: 1
我们可以修改您提供的mode()
函数:
mostfrequent <- function(v){
uniqv <- unique(v)
max(tabulate(match(v, uniqv)))/length(v)
}
data.frame(sapply(df, mostfrequent))
sapply.df..mostfrequent.
id 0.01
score 0.04
dob 0.01
英文:
We can modify the mode()
function that you provided:
mostfrequent <- function(v){
uniqv <- unique(v)
max(tabulate(match(v, uniqv)))/length(v)
}
data.frame(sapply(df, mostfrequent))
sapply.df..mostfrequent.
id 0.01
score 0.04
dob 0.01
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论