How to remove character strings that are detected/contained within other character strings, but only within a specified group_by()-argument

huangapple go评论100阅读模式
英文:

How to remove character strings that are detected/contained within other character strings, but only within a specified group_by()-argument

问题

以下是您要求的翻译的内容:

让我们假设我有:

  1. > w
  2. digest gene seq
  3. 1 InS AB0583 AAB
  4. 2 InS AB0583 AABKR
  5. 3 InS AB0583 GFHGHGG
  6. 4 PAC PU83022 EUT
  7. 5 PAC PU83022 HSFSFJF
  8. 6 PAC PU83022 EUTCK
  9. 7 PAC PU83022 EUTCKJ
  10. 8 InS PO93853 HDGJ
  11. 9 InS PO93853 HDGJU
  12. 10 InS PO93853 YTYEYD
  13. 11 InS PO93853 YTYEYDJHSGSG
  14. 12 InS PO93853 SALGHAGGEE

我已经应用了两种不同的方法来识别蛋白质(使用它们的基因名w$gene解码)。这些方法编码在w$digest中。如您所见,w$gene中的每个w$digest内可能存在w$seq的重叠序列,例如EUT也在EUTCK内,而后者又在EUTCKJ内。

我想知道已经识别出了多少个唯一的氨基酸,即w$seq中的每个字母。因此,在grouped_by(digest, gene)时,我需要删除任何被检测到在另一个字符串中的字符字符串,但仅保留具有最多字符的字符字符串。

我在tidyverse中寻求解决方案。

帮助需要:

(1) 计算字符数并按以下方式排列:

  1. w <- w %>%
  2. mutate(count = str_count(seq)) %>%
  3. arrange(digest, gene, count)

因此:

  1. > w
  2. digest gene seq count
  3. 1 InS AB0583 AAB 3
  4. 2 InS AB0583 AABKR 5
  5. 3 InS AB0583 GFHGHGG 7
  6. 4 InS PO93853 HDGJ 4
  7. 5 InS PO93853 HDGJU 5
  8. 6 InS PO93853 YTYEYD 6

(2) group_by(digest, gene),现在删除在另一个w$seq内被检测到的行(在此分组内),保留具有最多字符的行。

输出

  1. > w
  2. digest gene seq count
  3. 1 InS AB0583 AABKR 5 #*
  4. 2 InS AB0583 GFHGHGG 7
  5. 3 InS PO93853 HDGJU 5 #**
  6. 4 InS PO93853 SALGHAGGEE 10
  7. 5 InS PO93853 YTYEYDJHSGSG 12 #***
  8. 6 PAC PU83022 EUTCKJ 6 #****
  9. 7 PAC PU83022 HSFSFJF 7

因此,预期的输出

  1. > w
  2. digest gene seq count
  3. 1 InS AB0583 AABKR 5
  4. 2 InS AB0583 GFHGHGG 7
  5. 3 InS PO93853 HDGJU 5
  6. 4 InS PO93853 SALGHAGGEE 10
  7. 5 InS PO93853 YTYEYDJHSGSG 12
  8. 6 PAC PU83022 EUTCKJ 6
  9. 7 PAC PU83022 HSFSFJF 7

数据

  1. w <- data.frame(
  2. digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  3. gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  4. seq = c("AAB", "AABKR", "GFHGHGG",
  5. "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
  6. "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
  7. )

希望这能帮助您理解问题的解决方案。如果您有任何进一步的问题,请随时提出。

英文:

Let's say I have:

  1. &gt; w
  2. digest gene seq
  3. 1 InS AB0583 AAB
  4. 2 InS AB0583 AABKR
  5. 3 InS AB0583 GFHGHGG
  6. 4 PAC PU83022 EUT
  7. 5 PAC PU83022 HSFSFJF
  8. 6 PAC PU83022 EUTCK
  9. 7 PAC PU83022 EUTCKJ
  10. 8 InS PO93853 HDGJ
  11. 9 InS PO93853 HDGJU
  12. 10 InS PO93853 YTYEYD
  13. 11 InS PO93853 YTYEYDJHSGSG
  14. 12 InS PO93853 SALGHAGGEE

I have applied two different methods to identify proteins (decoded with their gene name, w$gene). The methods are encoded in w$digest. As you can see, there may be overlapping sequences of w$seq within each w$gene within each w$digest - e.g. EUT is also within EUTCK, which is within EUTCKJ.

I want to know how many unique amino acids, each letter in w$seq, were identified. Therefore, I need to remove any/all character string(s) that where detected within another character string, but only when grouped_by(digest, gene). The character string with most characters should be kept.

I seek a solution in tidyverse

Help need:

(1) Count the number of characters, and arrange as follows:

  1. w &lt;- w %&gt;%
  2. mutate(count = str_count(seq)) %&gt;%
  3. arrange(digest, gene, count)

So that

  1. &gt; w
  2. digest gene seq count
  3. 1 InS AB0583 AAB 3
  4. 2 InS AB0583 AABKR 5
  5. 3 InS AB0583 GFHGHGG 7
  6. 4 InS PO93853 HDGJ 4
  7. 5 InS PO93853 HDGJU 5
  8. 6 InS PO93853 YTYEYD 6

(2) group_by(digest, gene), and now remove rows that contain a w$seq that is detected within another w$seq (within this grouping), and keep the row where the w$seq has most characters.

Output

  1. &gt; w
  2. digest gene seq count
  3. 1 InS AB0583 AAB 3 #* found within:
  4. 2 InS AB0583 AABKR 5 #*
  5. 3 InS AB0583 GFHGHGG 7
  6. 4 InS PO93853 HDGJ 4 #** found within:
  7. 5 InS PO93853 HDGJU 5 #**
  8. 6 InS PO93853 YTYEYD 6 #***
  9. 7 InS PO93853 SALGHAGGEE 10
  10. 8 InS PO93853 YTYEYDJHSGSG 12 #***
  11. 9 PAC PU83022 EUT 3 #****
  12. 10 PAC PU83022 EUTCK 5 #****
  13. 11 PAC PU83022 EUTCKJ 6 #****
  14. 12 PAC PU83022 HSFSFJF 7

Therefore, Expected output

  1. &gt; w
  2. digest gene seq count
  3. 1 InS AB0583 AABKR 5
  4. 2 InS AB0583 GFHGHGG 7
  5. 3 InS PO93853 HDGJU 5
  6. 4 InS PO93853 SALGHAGGEE 10
  7. 5 InS PO93853 YTYEYDJHSGSG 12
  8. 6 PAC PU83022 EUTCKJ 6
  9. 7 PAC PU83022 HSFSFJF 7

Data

  1. w &lt;- data.frame(
  2. digest = c(rep(&quot;InS&quot;, 3), rep(&quot;PAC&quot;, 4), rep(&quot;InS&quot;, 5)),
  3. gene = c(rep(&quot;AB0583&quot;, 3), rep(&quot;PU83022&quot;, 4), rep(&quot;PO93853&quot;, 5)),
  4. seq = c(&quot;AAB&quot;, &quot;AABKR&quot;, &quot;GFHGHGG&quot;,
  5. &quot;EUT&quot;, &quot;HSFSFJF&quot;, &quot;EUTCK&quot;, &quot;EUTCKJ&quot;,
  6. &quot;HDGJ&quot;, &quot;HDGJU&quot;, &quot;YTYEYD&quot;, &quot;YTYEYDJHSGSG&quot;, &quot;SALGHAGGEE&quot;)
  7. )

答案1

得分: 3

对于group_by()中的每个组,您可以创建一个新的列表列,其中每一行包含该组的所有seq值。然后,您可以进行逐行操作,计算每个seq值在所有值中出现的次数。保留仅出现一次的值将给您想要的结果。

  1. library(dplyr)
  2. library(stringr)
  3. w <- data.frame(
  4. digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  5. gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  6. seq = c("AAB", "AABKR", "GFHGHGG",
  7. "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
  8. "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
  9. )
  10. w <- w %>%
  11. mutate(count = str_count(seq)) %>%
  12. arrange(digest, gene, count)
  13. w %>% group_by(digest, gene) %>%
  14. mutate(all_vals = list(seq)) %>%
  15. rowwise() %>%
  16. mutate(win = sum(grepl(seq, all_vals))) %>%
  17. filter(win == 1) %>%
  18. dplyr::select(-c(win, all_vals))

<sup>创建于2023年5月25日,使用reprex v2.0.2</sup>

英文:

For each group in group_by() you could make a new list column where each row contains all the seq values for that group. You could then do a row-wise operation where you count the number of times each value of seq shows up in all the values. Keeping the ones that only show up once will give you the result you want.

  1. library(dplyr)
  2. library(stringr)
  3. w &lt;- data.frame(
  4. digest = c(rep(&quot;InS&quot;, 3), rep(&quot;PAC&quot;, 4), rep(&quot;InS&quot;, 5)),
  5. gene = c(rep(&quot;AB0583&quot;, 3), rep(&quot;PU83022&quot;, 4), rep(&quot;PO93853&quot;, 5)),
  6. seq = c(&quot;AAB&quot;, &quot;AABKR&quot;, &quot;GFHGHGG&quot;,
  7. &quot;EUT&quot;, &quot;HSFSFJF&quot;, &quot;EUTCK&quot;, &quot;EUTCKJ&quot;,
  8. &quot;HDGJ&quot;, &quot;HDGJU&quot;, &quot;YTYEYD&quot;, &quot;YTYEYDJHSGSG&quot;, &quot;SALGHAGGEE&quot;)
  9. )
  10. w &lt;- w %&gt;%
  11. mutate(count = str_count(seq)) %&gt;%
  12. arrange(digest, gene, count)
  13. w %&gt;% group_by(digest, gene) %&gt;%
  14. mutate(all_vals = list(seq)) %&gt;%
  15. rowwise() %&gt;%
  16. mutate(win = sum(grepl(seq, all_vals))) %&gt;%
  17. filter(win == 1) %&gt;%
  18. dplyr::select(-c(win, all_vals))
  19. #&gt; # A tibble: 7 &#215; 4
  20. #&gt; # Rowwise: digest, gene
  21. #&gt; digest gene seq count
  22. #&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
  23. #&gt; 1 InS AB0583 AABKR 5
  24. #&gt; 2 InS AB0583 GFHGHGG 7
  25. #&gt; 3 InS PO93853 HDGJU 5
  26. #&gt; 4 InS PO93853 SALGHAGGEE 10
  27. #&gt; 5 InS PO93853 YTYEYDJHSGSG 12
  28. #&gt; 6 PAC PU83022 EUTCKJ 6
  29. #&gt; 7 PAC PU83022 HSFSFJF 7

<sup>Created on 2023-05-25 with reprex v2.0.2</sup>

答案2

得分: 1

以下是一个潜在的解决方案:

  1. library(tidyverse)
  2. w <- data.frame(
  3. digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  4. gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  5. seq = c("AAB", "AABKR", "GFHGHGG",
  6. "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
  7. "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
  8. )
  9. w %>%
  10. mutate(count = str_count(seq)) %>%
  11. arrange(digest, gene, count) %>%
  12. group_by(digest, gene) %>%
  13. filter(str_count(paste0(seq, collapse = "_"), seq) == 1)
  14. #> # A tibble: 7 × 4
  15. #> # Groups: digest, gene [3]
  16. #> digest gene seq count
  17. #> <chr> <chr> <chr> <int>
  18. #> 1 InS AB0583 AABKR 5
  19. #> 2 InS AB0583 GFHGHGG 7
  20. #> 3 InS PO93853 HDGJU 5
  21. #> 4 InS PO93853 SALGHAGGEE 10
  22. #> 5 InS PO93853 YTYEYDJHSGSG 12
  23. #> 6 PAC PU83022 EUTCKJ 6
  24. #> 7 PAC PU83022 HSFSFJF 7

Created on 2023-05-25 with reprex v2.0.2

原始答案:

这有点尴尬,但它“有效”:

  1. library(tidyverse)
  2. w <- data.frame(
  3. digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  4. gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  5. seq = c("AAB", "AABKR", "GFHGHGG",
  6. "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
  7. "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
  8. )
  9. w %>%
  10. mutate(count = str_count(seq)) %>%
  11. arrange(digest, gene, count) %>%
  12. group_by(digest, gene) %>%
  13. mutate(strings = paste0(seq, collapse = "|")) %>%
  14. rowwise() %>%
  15. mutate(strings = gsub(paste0("\\b", seq, "\\b"), "", strings)) %>%
  16. filter(!grepl(seq, strings)) %>%
  17. select(-strings) %>%
  18. ungroup()
  19. #> # A tibble: 7 × 4
  20. #> digest gene seq count
  21. #> <chr> <chr> <chr> <int>
  22. #> 1 InS AB0583 AABKR 5
  23. #> 2 InS AB0583 GFHGHGG 7
  24. #> 3 InS PO93853 HDGJU 5
  25. #> 4 InS PO93853 SALGHAGGEE 10
  26. #> 5 InS PO93853 YTYEYDJHSGSG 12
  27. #> 6 PAC PU83022 EUTCKJ 6
  28. #> 7 PAC PU83022 HSFSFJF 7

Created on 2023-05-25 with reprex v2.0.2

英文:

Edit:

Here is a potential solution:

  1. library(tidyverse)
  2. w &lt;- data.frame(
  3. digest = c(rep(&quot;InS&quot;, 3), rep(&quot;PAC&quot;, 4), rep(&quot;InS&quot;, 5)),
  4. gene = c(rep(&quot;AB0583&quot;, 3), rep(&quot;PU83022&quot;, 4), rep(&quot;PO93853&quot;, 5)),
  5. seq = c(&quot;AAB&quot;, &quot;AABKR&quot;, &quot;GFHGHGG&quot;,
  6. &quot;EUT&quot;, &quot;HSFSFJF&quot;, &quot;EUTCK&quot;, &quot;EUTCKJ&quot;,
  7. &quot;HDGJ&quot;, &quot;HDGJU&quot;, &quot;YTYEYD&quot;, &quot;YTYEYDJHSGSG&quot;, &quot;SALGHAGGEE&quot;)
  8. )
  9. w %&gt;%
  10. mutate(count = str_count(seq)) %&gt;%
  11. arrange(digest, gene, count) %&gt;%
  12. group_by(digest, gene) %&gt;%
  13. filter(str_count(paste0(seq, collapse = &quot;_&quot;), seq) == 1)
  14. #&gt; # A tibble: 7 &#215; 4
  15. #&gt; # Groups: digest, gene [3]
  16. #&gt; digest gene seq count
  17. #&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
  18. #&gt; 1 InS AB0583 AABKR 5
  19. #&gt; 2 InS AB0583 GFHGHGG 7
  20. #&gt; 3 InS PO93853 HDGJU 5
  21. #&gt; 4 InS PO93853 SALGHAGGEE 10
  22. #&gt; 5 InS PO93853 YTYEYDJHSGSG 12
  23. #&gt; 6 PAC PU83022 EUTCKJ 6
  24. #&gt; 7 PAC PU83022 HSFSFJF 7

<sup>Created on 2023-05-25 with reprex v2.0.2</sup>


Original answer:

This is a bit awkward, but it 'works':

  1. library(tidyverse)
  2. w &lt;- data.frame(
  3. digest = c(rep(&quot;InS&quot;, 3), rep(&quot;PAC&quot;, 4), rep(&quot;InS&quot;, 5)),
  4. gene = c(rep(&quot;AB0583&quot;, 3), rep(&quot;PU83022&quot;, 4), rep(&quot;PO93853&quot;, 5)),
  5. seq = c(&quot;AAB&quot;, &quot;AABKR&quot;, &quot;GFHGHGG&quot;,
  6. &quot;EUT&quot;, &quot;HSFSFJF&quot;, &quot;EUTCK&quot;, &quot;EUTCKJ&quot;,
  7. &quot;HDGJ&quot;, &quot;HDGJU&quot;, &quot;YTYEYD&quot;, &quot;YTYEYDJHSGSG&quot;, &quot;SALGHAGGEE&quot;)
  8. )
  9. w %&gt;%
  10. mutate(count = str_count(seq)) %&gt;%
  11. arrange(digest, gene, count) %&gt;%
  12. group_by(digest, gene) %&gt;%
  13. mutate(strings = paste0(seq, collapse = &quot;|&quot;)) %&gt;%
  14. rowwise() %&gt;%
  15. mutate(strings = gsub(paste0(&quot;\\b&quot;, seq, &quot;\\b&quot;), &quot;&quot;, strings)) %&gt;%
  16. filter(!grepl(seq, strings)) %&gt;%
  17. select(-strings) %&gt;%
  18. ungroup()
  19. #&gt; # A tibble: 7 &#215; 4
  20. #&gt; digest gene seq count
  21. #&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
  22. #&gt; 1 InS AB0583 AABKR 5
  23. #&gt; 2 InS AB0583 GFHGHGG 7
  24. #&gt; 3 InS PO93853 HDGJU 5
  25. #&gt; 4 InS PO93853 SALGHAGGEE 10
  26. #&gt; 5 InS PO93853 YTYEYDJHSGSG 12
  27. #&gt; 6 PAC PU83022 EUTCKJ 6
  28. #&gt; 7 PAC PU83022 HSFSFJF 7

<sup>Created on 2023-05-25 with reprex v2.0.2</sup>

答案3

得分: 1

Using base:

  1. # 添加计数并排序
  2. w$count &lt;- nchar(w$seq)
  3. w &lt;- w[ with(w, order(digest, gene, count)), ]
  4. # 子集
  5. w[ sapply(Map(grepl, w$seq, ave(w$seq, w[ 1:2 ], FUN = list)), sum) == 1, ]
  6. # digest gene seq count
  7. # 2 InS AB0583 AABKR 5
  8. # 3 InS AB0583 GFHGHGG 7
  9. # 9 InS PO93853 HDGJU 5
  10. # 12 InS PO93853 SALGHAGGEE 10
  11. # 11 InS PO93853 YTYEYDJHSGSG 12
  12. # 7 PAC PU83022 EUTCKJ 6
  13. # 5 PAC PU83022 HSFSFJF 7
英文:

Using base:

  1. # add count and sort
  2. w$count &lt;- nchar(w$seq)
  3. w &lt;- w[ with(w, order(digest, gene, count)), ]
  4. # subset
  5. w[ sapply(Map(grepl, w$seq, ave(w$seq, w[ 1:2 ], FUN = list)), sum) == 1, ]
  6. # digest gene seq count
  7. # 2 InS AB0583 AABKR 5
  8. # 3 InS AB0583 GFHGHGG 7
  9. # 9 InS PO93853 HDGJU 5
  10. # 12 InS PO93853 SALGHAGGEE 10
  11. # 11 InS PO93853 YTYEYDJHSGSG 12
  12. # 7 PAC PU83022 EUTCKJ 6
  13. # 5 PAC PU83022 HSFSFJF 7

huangapple
  • 本文由 发表于 2023年5月25日 17:34:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76330831.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定