英文:
Randomly sample a dataframe until all individuals are detected
问题
以下是您的翻译内容:
以下是我的数据示例(尽管我有数百行)。 每个ID都是唯一的,可能与一个ID关联多个个体(例如,个体A和D)。
我想要随机选择一个ID 1000次,有机会重新选择相同的ID,并记录在这个抽样方案中累积了多少个唯一的个体。
然后,我想生成一个图表,显示需要选择多少个ID才能累积所有唯一的个体,以便曲线在x轴上是ID,y轴上是个体时达到渐近线。
编辑:
我上面对所需图表的描述不清楚。编辑如下:
我想生成一个图表,显示选择(带替代)一个ID多少次才能积累所有唯一的个体... 而不是与唯一个体相关的唯一ID数量。曲线应该在选择了多少次ID后达到渐近线,而在每次样本后将ID放回池中。
例如,如果有500个与200个个体相关联的ID,我想要随机抽样这500个ID的池1000次(或者无论多少次),在每次样本后将ID放回池中,以查看我们需要多少次才能积累所有200个个体。
英文:
An example of my data is as follows (though I have hundreds of rows). Each ID is unique, and there may be multiple individuals associated with one ID (for example, individuals A and D).
ID individual
1 A
2 B
3 A
4 C
5 D
6 D
7 D
I would like to randomly select an ID 1000 times with the opportunity to re-sample the same ID, and store how many unique individuals are accumulated over this sampling scheme.
I would then like to generate a plot that shows how many ID's would need to be selected to accumulate all of the unique individuals, so that the curve reaches asymptote with ID's on the x-axis and individuals on the y-axis.
EDIT:
My description of the desired plot above is unclear. The edit is as follows:
I would like to generate a plot that shows the number of times an ID has to get selected (with replacements) for all the unique individuals to accumulate.. NOT the number of unique IDs associated with unique individuals. The curve should then reach asymptote with x number of times IDs were selected once all of the unique individuals accumulate.
For example, if there were 500 ID's associated with 200 individuals, I would like to sample the pool of 500 ID's 1000 times (or however many times), while putting IDs back in the pool after each sample, to see how many times we would have to sample that pool in order for all 200 individuals to accumulate.
答案1
得分: 1
I have translated the code parts for you:
我想要随机选择一个ID 1000次,有重新抽样相同ID的机会,并记录在这个抽样方案中累积了多少独特的个体。
您可以使用`dplyr`中的`slice_sample`从数据框中抽样行。例如:
```R
library(dplyr)
N_id <- 2000
N_individuals <- 50
df_full <- data.frame(id = 1:N_id,
individual = sample(1:N_individuals, N_id, replace = TRUE))
df_sample <- slice_sample(df_full, n = 1000, replace = TRUE)
unique_individuals <- length(unique(df_sample$individual))
然后,我想生成一个图表,显示需要选择多少个ID来积累所有的独特个体,以便曲线在X轴上显示ID,Y轴上显示个体时趋于渐近。
您可以将其包装成一个函数,以生成不同数量的ID和个体(以及抽样次数)的值,然后使用ggplot
或其他绘图工具绘制不同的值。然而,这似乎是一个组合数学问题,也许更适合https://math.stackexchange.com/,因为这些值完全取决于ID和个体的数量。
在函数中编辑:
N_id <- 2000
N_individuals <- 50
N_draws <- 1000
sample_df_parameterized <- function(n_id, n_individuals, n_draws) {
df_full <- data.frame(id = 1:n_id,
individual = sample(1:n_individuals, n_id, replace = TRUE))
df_sample <- slice_sample(df_full, n = n_draws, replace = TRUE)
unique_individuals <- length(unique(df_sample$individual))
result_df <- data.frame(n_id = n_id,
n_individuals = n_individuals,
n_draws = n_draws,
unique_individuals = unique_individuals)
return(result_df)
}
sample_df_parameterized(n_id = N_id,
n_individuals = N_individuals,
n_draws = N_draws)
Please note that the translation includes the code only, as you requested.
<details>
<summary>英文:</summary>
> I would like to randomly select an ID 1000 times with the opportunity
> to re-sample the same ID, and store how many unique individuals are
> accumulated over this sampling scheme.
You can use `slice_sample` from `dplyr` to sample rows from a data frame. For example:
library(dplyr)
N_id <- 2000
N_individuals <- 50
df_full <- data.frame(id = 1:N_id,
individual = sample(1:N_individuals, N_id, replace = TRUE))
df_sample <- slice_sample(df_full, n = 1000, replace = TRUE)
unique_individuals <- length(unique(df_sample$individual))
> I would then like to generate a plot that shows how many ID's would
> need to be selected to accumulate all of the unique individuals, so
> that the curve reaches asymptote with ID's on the x-axis and
> individuals on the y-axis.
You can wrap that into a function to produce values for different numbers of ids and individuals (and draws for that matter), and then plot the different values using `ggplot` or another plotting tool. However, this strikes me as a combinatorics question perhaps better suited for https://math.stackexchange.com/, since these values will fully depend on the number of ids, individuals.
Edit: in a function:
N_id <- 2000
N_individuals <- 50
N_draws <- 1000
sample_df_parameterized <- function(n_id, n_individuals, n_draws) {
df_full <- data.frame(id = 1:n_id,
individual = sample(1:n_individuals, n_id, replace = TRUE))
df_sample <- slice_sample(df_full, n = n_draws, replace = TRUE)
unique_individuals <- length(unique(df_sample$individual))
result_df <- data.frame(n_id = n_id,
n_individuals = n_individuals,
n_draws = n_draws,
unique_individuals = unique_individuals)
return(result_df)
}
sample_df_parameterized(n_id = N_id,
n_individuals = N_individuals,
n_draws = N_draws)
</details>
# 答案2
**得分**: 1
以下是翻译好的部分:
这是使用基本的R语言尝试的示例,使用自定义函数来计算累积唯一值的数量:
更大的示例数据:
```R
set.seed(2)
dat <- data.frame(ID=1:250, individual=sample(1:100, 500, replace=TRUE))
length(unique(dat$individual))
##100
抽样、计数累积值,并绘图:
tmp <- dat[sample(seq_len(nrow(dat)), 1000, replace=TRUE),]
cumfun <- function(x) lengths(Reduce(union, x, accumulate=TRUE))
idcum <- cumfun(tmp$ID)
indcum <- cumfun(tmp$individual)
plot(idcum, indcum, type="l")
如果您愿意,您还可以稍微美化图表,添加您选择的最佳拟合线和一些更好的坐标轴:
plot(idcum, indcum, type="l", ylim=c(0,100), las=1,
xlab="Cumulative ID count", ylab="Cumulative Individuals count",
cex.lab=0.8, cex.axis=0.8, lty=2)
f <- function(x,a,b) {x/(a+b*x)}
fit <- nls(indcum ~ f(idcum,a,b), start=c(a=1,b=1))
curve(do.call(f, c(list(x), coef(fit))), add=TRUE, col="red")
英文:
Here's an attempt in base R, with a custom function to do the counting of cumulative unique values:
Larger example data:
set.seed(2)
dat <- data.frame(ID=1:250, individual=sample(1:100, 500, replace=TRUE))
length(unique(dat$individual))
##100
Sample, count cumulative values, and plot:
tmp <- dat[sample(seq_len(nrow(dat)), 1000, replace=TRUE),]
cumfun <- function(x) lengths(Reduce(union, x, accumulate=TRUE))
idcum <- cumfun(tmp$ID)
indcum <- cumfun(tmp$individual)
plot(idcum, indcum, type="l")
You could also spruce the plot up a bit if you like to add a line of best fit of your choosing, and some nicer axes:
plot(idcum, indcum, type="l", ylim=c(0,100), las=1,
xlab="Cumulative ID count", ylab="Cumulative Individuals count",
cex.lab=0.8, cex.axis=0.8, lty=2)
f <- function(x,a,b) {x/(a+b*x)}
fit <- nls(indcum ~ f(idcum,a,b), start=c(a=1,b=1))
curve(do.call(f, c(list(x), coef(fit))), add=TRUE, col="red")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论