如何使用数据框中的字典来替换字符串中的特定字符串?

huangapple go评论83阅读模式
英文:

How to replace specific string within string using dictionary from data frame?

问题

listn2 <- stringr::str_replace_all(string = listn,
                                   pattern = ddf$old,
                                   replacement = ddf$new)
英文:

I have a dictionary of bipartite taxonomic names (each composed of two words) in the form of a data frame (ddf). Where there are columns with old (ddf$old) and new (ddf$new) names. I would like to find & replace the old names with the new names in a list of strings (listn). However, the taxonomic names are a part of the string. Some old names may occur more than once and some may never appear in the list. The old names proceed with a string of different lengths and sometimes are followed by an additional string.

#sample list, the real has &gt;300000 entries
listn &lt;- c(&quot;AB001440.1.1538 Pseudomonas coronafaciens pv. atropurpurea&quot;, &quot;HG530070.1.1349 Trueperella pyogenes&quot;,                      
          &quot;ET631036.837.2346 Jonquetella anthropi&quot;, &quot;AB001448.1.1538 Pseudomonas savastanoi pv. phaseolicola&quot;,   
          &quot;HG530249.1.1462 Paucibacter toxinivorans&quot;, &quot;HG530235.1.1493 Paucibacter toxinivorans&quot;,                  
          &quot;AB001781.1.1507 Chlamydia psittaci&quot;, &quot;AB001785.1.1507 Chlamydia felis&quot;,                           
          &quot;AB001804.1.1507 Chlamydia psittaci&quot;, &quot;AB001794.1.1507 Chlamydia psittaci&quot;)   

#sample dictionary, the real one has &gt;400 entries
ddf &lt;- data.frame(old = c(&quot;Lactobacillus casei&quot;, &quot;Trueperella pyogenes&quot;, &quot;Pseudomonas savastanoi&quot;), 
                  new = c(&quot;Newbacillus casei&quot;, &quot;Newperella pyogenes&quot;, &quot;Newudomonas savastanoi&quot;))

I tried:

listn2 &lt;- stringr::str_replace_all(string = listn,
                                   pattern = ddf$old,
                                   replacement = ddf$new)  

list2 &lt;- gsub(ddf$old, ddf$new, listn)

None of the methods worked. I expect to have two specific words replaced by another two words based on the dictionary from the data frame. The old and new names are of different lengths.
I tried ChatGPT, but it wasn't smart enough and provided such a script:

# Sample input data
selected_strings &lt;- c(&quot;John likes apples&quot;, &quot;Mary eats bananas&quot;, &quot;David enjoys grapes&quot;)
dictionary &lt;- data.frame(old_name = c(&quot;John&quot;, &quot;Mary&quot;, &quot;David&quot;),
                         new_name = c(&quot;Peter&quot;, &quot;Alice&quot;, &quot;Michael&quot;),
                         stringsAsFactors = FALSE)

# Function to replace bipartite names within selected strings
replace_names &lt;- function(strings, dictionary) {
  # Split the strings into individual words
  words &lt;- unlist(strsplit(strings, &quot; &quot;))
  
  # Replace the names using the dictionary
  replaced_words &lt;- words
  for (i in seq_len(nrow(dictionary))) {
    replaced_words[words == dictionary$old_name[i]] &lt;- dictionary$new_name[i]
  }
  
  # Reconstruct the modified strings
  modified_strings &lt;- sapply(strsplit(strings, &quot; &quot;), function(x) paste(replaced_words[x], collapse = &quot; &quot;))
  
  return(modified_strings)
}

# Call the function to replace names within selected strings
modified_strings &lt;- replace_names(selected_strings, dictionary)

# Print the modified strings
print(modified_strings)

答案1

得分: 1

你可以使用一个命名的替换:

library(tidyverse)
str_replace_all(listn, deframe(ddf))

同样可以使用以下方式:

str_replace_all(listn, setNames(ddf$new, ddf$old))
英文:

You can use a named replacement:

library(tidyverse)
str_replace_all(listn, deframe(ddf))

 [1] &quot;AB001440.1.1538 Pseudomonas coronafaciens pv. atropurpurea&quot;
 [2] &quot;HG530070.1.1349 Newperella pyogenes&quot;                       
 [3] &quot;ET631036.837.2346 Jonquetella anthropi&quot;                    
 [4] &quot;AB001448.1.1538 Newudomonas savastanoi pv. phaseolicola&quot;   
 [5] &quot;HG530249.1.1462 Paucibacter toxinivorans&quot;                  
 [6] &quot;HG530235.1.1493 Paucibacter toxinivorans&quot;                  
 [7] &quot;AB001781.1.1507 Chlamydia psittaci&quot;                        
 [8] &quot;AB001785.1.1507 Chlamydia felis&quot;                           
 [9] &quot;AB001804.1.1507 Chlamydia psittaci&quot;                        
[10] &quot;AB001794.1.1507 Chlamydia psittaci&quot;  

same as:


str_replace_all(listn, setNames(ddf$new, ddf$old))

huangapple
  • 本文由 发表于 2023年7月17日 23:04:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76705786.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定