英文:
Replace values in one column with randomly generated values using group_by and mutate, while making sure every set of values is unique in R
问题
ifelse(!duplicated(data.new$ID), generateRandomString(), data.new$ID)
英文:
I am writing a function to replace ID's in one column with randomly generated ones. Suppose I have a dataset:
df <- data.frame(ID = c(1, 1, 2, 2, 1, 3), Name = c("Joseph", "Joseph", "Leo", "Leo", "Joseph", "David"))
ID Name
1 1 Joseph
2 1 Joseph
3 2 Leo
4 2 Leo
5 1 Joseph
6 3 David
The goal of the function is to group the dataset by ID and replace all unique ID's with randomly generated ones, like this:
library(tidyverse)
generateRandomString <- function() {
sample(LETTERS, 1)
}
replaceUniqueID <- function(data, column_id) {
name.id <- data[grep(column_id, colnames(data))]
data.new <- data %>%
group_by(data[grep(column_id, colnames(data))]) %>%
mutate_at(column_id, funs(generateRandomString()))
data.new
}
replaceUniqueID(df, "ID")
# A tibble: 6 × 2
# Groups: ID [3]
ID Name
<chr> <chr>
1 D Joseph
2 D Joseph
3 R Leo
4 R Leo
5 D Joseph
6 Q David
My problem is, I want the code to make sure every ID gets replaced with unique string, e.g. David
and Leo
can't both have D
as a replacement for their respective ID's.
I tried to come up with an ifelse
statement within mutate
but was unable to figure out how to do that.
答案1
得分: 0
I will only translate the code part for you:
我认为你可以使用翻译词典来相对容易地解决这个问题。
genRandomIDs <- function(ID, min_length = 2) {
ID <- unique(ID)
len <- max(min_length, length(ID) %/% 26 + 1)
ltrs <- do.call(paste0,
do.call(expand.grid, replicate(len, LETTERS, simplify=FALSE))
)
sample(ltrs, length(ID))
}
set.seed(42)
IDdict <- df %>%
distinct(ID) %>%
mutate(newID = genRandomIDs(ID))
IDdict
# ID newID
# 1 1 OV
# 2 2 IM
# 3 3 WF
df %>%
left_join(IDdict, by = "ID")
# ID Name newID
# 1 1 Joseph OV
# 2 1 Joseph OV
# 3 2 Leo IM
# 4 2 Leo IM
# 5 1 Joseph OV
# 6 3 David WF
这是代码的翻译部分。
英文:
I think you can use a translation dictionary to solve this relatively easily.
genRandomIDs <- function(ID, min_length = 2) {
ID <- unique(ID)
len <- max(min_length, length(ID) %/% 26 + 1)
ltrs <- do.call(paste0,
do.call(expand.grid, replicate(len, LETTERS, simplify=FALSE))
)
sample(ltrs, length(ID))
}
set.seed(42)
IDdict <- df %>%
distinct(ID) %>%
mutate(newID = genRandomIDs(ID))
IDdict
# ID newID
# 1 1 OV
# 2 2 IM
# 3 3 WF
df %>%
left_join(IDdict, by = "ID")
# ID Name newID
# 1 1 Joseph OV
# 2 1 Joseph OV
# 3 2 Leo IM
# 4 2 Leo IM
# 5 1 Joseph OV
# 6 3 David WF
Walk-through:
genRandomIDs
is just a helper function that internally produces a vector of alln
-long letter permutations (combined withpaste0
) and samples from them;- the
do.call(expand.grid, ...)
gives us a frame that expands on eachlen
grouping of letters; that is,expand.grid(LETTERS[1:3],LETTERS[1:3],LETTERS[1:3])
gives us3^3
permutations of three letters - the
do.call(paste0, ...)
takes that frame fromexpand.grid
(which is really just a glorifiedlist
) and produces one string per "row".
- the
distinct(ID)
reduces yourdf
to just one row perID
;- since we produce one
newID
for each uniqueID
, we now have a 1-to-1 mapping from old-to-new; - the
left_join
assigns thenewID
for each row (if you aren't familiar with merges/joins, see https://stackoverflow.com/q/1299871/3358272, https://stackoverflow.com/q/5706437/3358272)
Note: this does not really scale well: since we explode the possible combinations with expand.grid
, for a min-length of 2 letters, we produce 676 (26^2
) permutations, not a problem. 26^3
produces 17576 possible combinations, whether or not we have that many IDs to uniquify. 26^4
(4 letters) produces 456976, and its delay is "palpable". Five letters is over 11 million, which becomes "stupid" to try to scale to that length (assuming you have that many unique ID
s or choose a string of that long.
However ... while inefficient, this method is guaranteed to give you unique newID
s. There are other ways that may be guaranteed at the expense of a (however small) increase in complexity).
Okay, the "increased complexity" here for a more efficient process:
num2alpha <- function(num, chr = letters, zero = "", sep = "") {
len <- length(chr)
stopifnot(len > 1)
signs <- ifelse(!is.na(num) & sign(num) < 0, "-", "")
num <- as.integer(abs(num))
is0 <- !is.na(num) & num < 1e-9
# num[num < 1] <- NA
out <- character(length(num))
mult <- 0
while (any(!is.na(num) & num > 0)) {
not0 <- !is.na(num) & num > 0
out[not0] <- paste0(chr[(num[not0] - 1) %% len + 1], sep, out[not0])
num[not0] <- (num[not0] - 1) %/% len
}
if (nzchar(sep)) out <- sub(paste0(sep, "$"), "", out)
out[is0] <- zero
out[is.na(num)] <- NA
out[!is.na(out)] <- paste0(signs[!is.na(out)], out[!is.na(out)])
out
}
IDdict <- df %>%
distinct(ID) %>%
mutate(newID = num2alpha(row_number()))
IDdict
# ID newID
# 1 1 a
# 2 2 b
# 3 3 c
df %>%
left_join(IDdict, by = "ID")
# ID Name newID
# 1 1 Joseph a
# 2 1 Joseph a
# 3 2 Leo b
# 4 2 Leo b
# 5 1 Joseph a
# 6 3 David c
The num2alpha
works more efficiently (using lower-case here, easily changed by using num2alpha(.., chr=LETTERS)
), though it is deterministic here. If you are at all concerned about that, then
IDdict <- df %>%
distinct(ID) %>%
mutate(newID = sample(num2alpha(row_number())))
will randomize them for you.
Note that this produces single-letter strings up through 26, then cycles through 2-digit and 3-digit. It also recognized negatives, and while the defatul
num2alpha(c(-5, 0, NA, 1, 25:27, 51:53, 999999), zero="0")
# [1] "-e" "0" NA "a" "y" "z" "aa" "ay" "az" "ba" "bdwgm"
(Note that this is not a simple base-converter, since we're ignoring "0"-values. Try num2alpha(14:16, c(1:9, LETTERS[1:6]), zero="0")
. Perhaps it can be made to be more general.)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论