如何使用数据框中的字典来替换字符串中的特定字符串?

huangapple go评论121阅读模式
英文:

How to replace specific string within string using dictionary from data frame?

问题

  1. listn2 <- stringr::str_replace_all(string = listn,
  2. pattern = ddf$old,
  3. replacement = ddf$new)
英文:

I have a dictionary of bipartite taxonomic names (each composed of two words) in the form of a data frame (ddf). Where there are columns with old (ddf$old) and new (ddf$new) names. I would like to find & replace the old names with the new names in a list of strings (listn). However, the taxonomic names are a part of the string. Some old names may occur more than once and some may never appear in the list. The old names proceed with a string of different lengths and sometimes are followed by an additional string.

  1. #sample list, the real has &gt;300000 entries
  2. listn &lt;- c(&quot;AB001440.1.1538 Pseudomonas coronafaciens pv. atropurpurea&quot;, &quot;HG530070.1.1349 Trueperella pyogenes&quot;,
  3. &quot;ET631036.837.2346 Jonquetella anthropi&quot;, &quot;AB001448.1.1538 Pseudomonas savastanoi pv. phaseolicola&quot;,
  4. &quot;HG530249.1.1462 Paucibacter toxinivorans&quot;, &quot;HG530235.1.1493 Paucibacter toxinivorans&quot;,
  5. &quot;AB001781.1.1507 Chlamydia psittaci&quot;, &quot;AB001785.1.1507 Chlamydia felis&quot;,
  6. &quot;AB001804.1.1507 Chlamydia psittaci&quot;, &quot;AB001794.1.1507 Chlamydia psittaci&quot;)
  7. #sample dictionary, the real one has &gt;400 entries
  8. ddf &lt;- data.frame(old = c(&quot;Lactobacillus casei&quot;, &quot;Trueperella pyogenes&quot;, &quot;Pseudomonas savastanoi&quot;),
  9. new = c(&quot;Newbacillus casei&quot;, &quot;Newperella pyogenes&quot;, &quot;Newudomonas savastanoi&quot;))

I tried:

  1. listn2 &lt;- stringr::str_replace_all(string = listn,
  2. pattern = ddf$old,
  3. replacement = ddf$new)
  4. list2 &lt;- gsub(ddf$old, ddf$new, listn)

None of the methods worked. I expect to have two specific words replaced by another two words based on the dictionary from the data frame. The old and new names are of different lengths.
I tried ChatGPT, but it wasn't smart enough and provided such a script:

  1. # Sample input data
  2. selected_strings &lt;- c(&quot;John likes apples&quot;, &quot;Mary eats bananas&quot;, &quot;David enjoys grapes&quot;)
  3. dictionary &lt;- data.frame(old_name = c(&quot;John&quot;, &quot;Mary&quot;, &quot;David&quot;),
  4. new_name = c(&quot;Peter&quot;, &quot;Alice&quot;, &quot;Michael&quot;),
  5. stringsAsFactors = FALSE)
  6. # Function to replace bipartite names within selected strings
  7. replace_names &lt;- function(strings, dictionary) {
  8. # Split the strings into individual words
  9. words &lt;- unlist(strsplit(strings, &quot; &quot;))
  10. # Replace the names using the dictionary
  11. replaced_words &lt;- words
  12. for (i in seq_len(nrow(dictionary))) {
  13. replaced_words[words == dictionary$old_name[i]] &lt;- dictionary$new_name[i]
  14. }
  15. # Reconstruct the modified strings
  16. modified_strings &lt;- sapply(strsplit(strings, &quot; &quot;), function(x) paste(replaced_words[x], collapse = &quot; &quot;))
  17. return(modified_strings)
  18. }
  19. # Call the function to replace names within selected strings
  20. modified_strings &lt;- replace_names(selected_strings, dictionary)
  21. # Print the modified strings
  22. print(modified_strings)

答案1

得分: 1

你可以使用一个命名的替换:

  1. library(tidyverse)
  2. str_replace_all(listn, deframe(ddf))

同样可以使用以下方式:

  1. str_replace_all(listn, setNames(ddf$new, ddf$old))
英文:

You can use a named replacement:

  1. library(tidyverse)
  2. str_replace_all(listn, deframe(ddf))
  3. [1] &quot;AB001440.1.1538 Pseudomonas coronafaciens pv. atropurpurea&quot;
  4. [2] &quot;HG530070.1.1349 Newperella pyogenes&quot;
  5. [3] &quot;ET631036.837.2346 Jonquetella anthropi&quot;
  6. [4] &quot;AB001448.1.1538 Newudomonas savastanoi pv. phaseolicola&quot;
  7. [5] &quot;HG530249.1.1462 Paucibacter toxinivorans&quot;
  8. [6] &quot;HG530235.1.1493 Paucibacter toxinivorans&quot;
  9. [7] &quot;AB001781.1.1507 Chlamydia psittaci&quot;
  10. [8] &quot;AB001785.1.1507 Chlamydia felis&quot;
  11. [9] &quot;AB001804.1.1507 Chlamydia psittaci&quot;
  12. [10] &quot;AB001794.1.1507 Chlamydia psittaci&quot;

same as:


  1. str_replace_all(listn, setNames(ddf$new, ddf$old))

huangapple
  • 本文由 发表于 2023年7月17日 23:04:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76705786.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定