2023年5月14日 22:33:16go评论92阅读模式

英文:

Replace values in one column with randomly generated values using group_by and mutate, while making sure every set of values is unique in R

问题

ifelse(!duplicated(data.new$ID), generateRandomString(), data.new$ID)

英文:

I am writing a function to replace ID's in one column with randomly generated ones. Suppose I have a dataset:

df &lt;- data.frame(ID = c(1, 1, 2, 2, 1, 3), Name = c(&quot;Joseph&quot;, &quot;Joseph&quot;, &quot;Leo&quot;, &quot;Leo&quot;, &quot;Joseph&quot;, &quot;David&quot;))

  ID   Name
1  1 Joseph
2  1 Joseph
3  2    Leo
4  2    Leo
5  1 Joseph
6  3  David

The goal of the function is to group the dataset by ID and replace all unique ID's with randomly generated ones, like this:

library(tidyverse)
generateRandomString &lt;- function() {
    sample(LETTERS, 1)
}
replaceUniqueID &lt;- function(data, column_id) {
  name.id &lt;- data[grep(column_id, colnames(data))]
  data.new &lt;- data %&gt;%
      group_by(data[grep(column_id, colnames(data))]) %&gt;% 
      mutate_at(column_id, funs(generateRandomString()))
data.new
}
replaceUniqueID(df, &quot;ID&quot;)
# A tibble: 6 &#215; 2
# Groups:   ID [3]
  ID    Name  
  &lt;chr&gt; &lt;chr&gt; 
1 D     Joseph
2 D     Joseph
3 R     Leo   
4 R     Leo   
5 D     Joseph
6 Q     David

My problem is, I want the code to make sure every ID gets replaced with unique string, e.g. David and Leo can't both have D as a replacement for their respective ID's.

I tried to come up with an ifelse statement within mutate but was unable to figure out how to do that.

答案1

得分: 0

I will only translate the code part for you:

我认为你可以使用翻译词典来相对容易地解决这个问题。
genRandomIDs <- function(ID, min_length = 2) {
  ID <- unique(ID)
  len <- max(min_length, length(ID) %/% 26 + 1)
  ltrs <- do.call(paste0,
    do.call(expand.grid, replicate(len, LETTERS, simplify=FALSE))
  )
  sample(ltrs, length(ID))
}
set.seed(42)
IDdict <- df %>%
  distinct(ID) %>%
  mutate(newID = genRandomIDs(ID))
IDdict
#   ID newID
# 1  1    OV
# 2  2    IM
# 3  3    WF
df %>%
  left_join(IDdict, by = "ID")
#   ID   Name newID
# 1  1 Joseph    OV
# 2  1 Joseph    OV
# 3  2    Leo    IM
# 4  2    Leo    IM
# 5  1 Joseph    OV
# 6  3  David    WF

这是代码的翻译部分。

英文:

I think you can use a translation dictionary to solve this relatively easily.

genRandomIDs &lt;- function(ID, min_length = 2) {
  ID &lt;- unique(ID)
  len &lt;- max(min_length, length(ID) %/% 26 + 1)
  ltrs &lt;- do.call(paste0,
    do.call(expand.grid, replicate(len, LETTERS, simplify=FALSE))
  )
  sample(ltrs, length(ID))
}
set.seed(42)
IDdict &lt;- df %&gt;%
  distinct(ID) %&gt;%
  mutate(newID = genRandomIDs(ID))
IDdict
#   ID newID
# 1  1    OV
# 2  2    IM
# 3  3    WF
df %&gt;%
  left_join(IDdict, by = &quot;ID&quot;)
#   ID   Name newID
# 1  1 Joseph    OV
# 2  1 Joseph    OV
# 3  2    Leo    IM
# 4  2    Leo    IM
# 5  1 Joseph    OV
# 6  3  David    WF

Walk-through:

genRandomIDs is just a helper function that internally produces a vector of all n-long letter permutations (combined with paste0) and samples from them;
- the do.call(expand.grid, ...) gives us a frame that expands on each len grouping of letters; that is, expand.grid(LETTERS[1:3],LETTERS[1:3],LETTERS[1:3]) gives us 3^3 permutations of three letters
- the do.call(paste0, ...) takes that frame from expand.grid (which is really just a glorified list) and produces one string per "row".
distinct(ID) reduces your df to just one row per ID;
since we produce one newID for each unique ID, we now have a 1-to-1 mapping from old-to-new;
the left_join assigns the newID for each row (if you aren't familiar with merges/joins, see https://stackoverflow.com/q/1299871/3358272, https://stackoverflow.com/q/5706437/3358272)

Note: this does not really scale well: since we explode the possible combinations with expand.grid, for a min-length of 2 letters, we produce 676 (26^2) permutations, not a problem. 26^3 produces 17576 possible combinations, whether or not we have that many IDs to uniquify. 26^4 (4 letters) produces 456976, and its delay is "palpable". Five letters is over 11 million, which becomes "stupid" to try to scale to that length (assuming you have that many unique IDs or choose a string of that long.

However ... while inefficient, this method is guaranteed to give you unique newIDs. There are other ways that may be guaranteed at the expense of a (however small) increase in complexity).

Okay, the "increased complexity" here for a more efficient process:

num2alpha &lt;- function(num, chr = letters, zero = &quot;&quot;, sep = &quot;&quot;) {
  len &lt;- length(chr)
  stopifnot(len &gt; 1)
  signs &lt;- ifelse(!is.na(num) &amp; sign(num) &lt; 0, &quot;-&quot;, &quot;&quot;)
  num &lt;- as.integer(abs(num))
  is0 &lt;- !is.na(num) &amp; num &lt; 1e-9
  # num[num &lt; 1] &lt;- NA
  out &lt;- character(length(num))
  mult &lt;- 0
  while (any(!is.na(num) &amp; num &gt; 0)) {
    not0 &lt;- !is.na(num) &amp; num &gt; 0
    out[not0] &lt;- paste0(chr[(num[not0] - 1) %% len + 1], sep, out[not0])
    num[not0] &lt;- (num[not0] - 1) %/% len
  }
  if (nzchar(sep)) out &lt;- sub(paste0(sep, &quot;$&quot;), &quot;&quot;, out)
  out[is0] &lt;- zero
  out[is.na(num)] &lt;- NA
  out[!is.na(out)] &lt;- paste0(signs[!is.na(out)], out[!is.na(out)])
  out
}
IDdict &lt;- df %&gt;%
  distinct(ID) %&gt;%
  mutate(newID = num2alpha(row_number()))
IDdict
#   ID newID
# 1  1     a
# 2  2     b
# 3  3     c
df %&gt;%
  left_join(IDdict, by = &quot;ID&quot;)
#   ID   Name newID
# 1  1 Joseph     a
# 2  1 Joseph     a
# 3  2    Leo     b
# 4  2    Leo     b
# 5  1 Joseph     a
# 6  3  David     c

The num2alpha works more efficiently (using lower-case here, easily changed by using num2alpha(.., chr=LETTERS)), though it is deterministic here. If you are at all concerned about that, then

IDdict &lt;- df %&gt;%
  distinct(ID) %&gt;%
  mutate(newID = sample(num2alpha(row_number())))

will randomize them for you.

Note that this produces single-letter strings up through 26, then cycles through 2-digit and 3-digit. It also recognized negatives, and while the defatul

num2alpha(c(-5, 0, NA, 1, 25:27, 51:53, 999999), zero=&quot;0&quot;)
#  [1] &quot;-e&quot;    &quot;0&quot;     NA      &quot;a&quot;     &quot;y&quot;     &quot;z&quot;     &quot;aa&quot;    &quot;ay&quot;    &quot;az&quot;    &quot;ba&quot;    &quot;bdwgm&quot;

(Note that this is not a simple base-converter, since we're ignoring "0"-values. Try num2alpha(14:16, c(1:9, LETTERS[1:6]), zero="0"). Perhaps it can be made to be more general.)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Replace values in one column with randomly generated values using group_by and mutate, while making sure every set of values is unique in R

问题

答案1

如何使用R中的ggplot更改我的箱线图的顺序？

在Reddit上按时间段统计字符串提及次数（dplyr）。

如何将发生在时间接近的事件分组。

R/Shiny – 带有复选框和总计数的可滚动数据表格

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。