重新组织一个数据框按列数值 R

huangapple go评论131阅读模式
英文:

Reorganize a dataframe in by column values R

问题

基于第一列的最后两个字符 'AH',我想要将Res1、Pos1和Chain1设置为与 'A' 相关的部分,因此第一行应该是:

File_pdb Res1 Pos1 Chain1 Res2 Pos2 Chain2
6wps_AH CYS 361 A TRP 105 H

这应该是一个通用的操作,因为有些行可能已经是正确的,而且在同一个数据框中可能有不同的行,有些是正确的,有些不是。谢谢!

以下是数据:

data.table::as.data.table(structure(list(File_pdb = c("6wps_AH", "6wps_AH", "6wps_AH", 
"6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", 
"6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", 
"6wps_AH"), Res1 = c("TRP", "GLU", "PHE", "SER", "ILE", "TYR", 
"GLY", "PHE", "LEU", "TRP", "PHE", "PRO", "ILE", "TRP", "ALA", 
"THR"), Pos1 = c("105", "108", "106", "109", "111", "100", "103", 
"106", "110", "105", "106", "28", "111", "105", "104", "30"), 
    Chain1 = c("H", "H", "H", "H", "H", "H", "H", "H", "H", "H", 
    "H", "H", "H", "H", "H", "H"), Res2 = c("CYS", "LYS", "LYS", 
    "THR", "THR", "ASN", "GLU", "GLU", "GLU", "PRO", "PRO", "LEU", 
    "ASN", "GLU", "GLU", "ASN"), Pos2 = c("361", "356", "356", 
    "345", "345", "343", "340", "340", "340", "337", "337", "335", 
    "343", "340", "340", "334"), Chain2 = c("A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"
    )), row.names = c(NA, -16L), class = c("data.table", "data.frame"
)))

希望这可以帮助你。如果有其他问题,请告诉我。

英文:

I have this df

File_pdb Res1 Pos1 Chain1 Res2 Pos2 Chain2
6wps_AH  TRP  105      H  CYS  361      A
6wps_AH  GLU  108      H  LYS  356      A
6wps_AH  PHE  106      H  LYS  356      A
6wps_AH  SER  109      H  THR  345      A
6wps_AH  ILE  111      H  THR  345      A
6wps_AH  TYR  100      H  ASN  343      A

Based on the last two character on the first column 'AH', I-'d like to have as Res1, Pos1 and Chain1 what is referring to 'A' so CYS 361 A. So, the first line would be

File_pdb Res1 Pos1 Chain1 Res2 Pos2 Chain2
6wps_AH  CYS  361      A  TRP  105      H

It should be something general since it can happen that some rows are already right and also, we can have different rows in the same df, some right and others not. Thanks in advance!

here the data:

data.table::as.data.table(structure(list(File_pdb = c("6wps_AH", "6wps_AH", "6wps_AH", 
"6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", 
"6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", "6wps_AH", 
"6wps_AH"), Res1 = c("TRP", "GLU", "PHE", "SER", "ILE", "TYR", 
"GLY", "PHE", "LEU", "TRP", "PHE", "PRO", "ILE", "TRP", "ALA", 
"THR"), Pos1 = c("105", "108", "106", "109", "111", "100", "103", 
"106", "110", "105", "106", "28", "111", "105", "104", "30"), 
    Chain1 = c("H", "H", "H", "H", "H", "H", "H", "H", "H", "H", 
    "H", "H", "H", "H", "H", "H"), Res2 = c("CYS", "LYS", "LYS", 
    "THR", "THR", "ASN", "GLU", "GLU", "GLU", "PRO", "PRO", "LEU", 
    "ASN", "GLU", "GLU", "ASN"), Pos2 = c("361", "356", "356", 
    "345", "345", "343", "340", "340", "340", "337", "337", "335", 
    "343", "340", "340", "334"), Chain2 = c("A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"
    )), row.names = c(NA, -16L), class = c("data.table", "data.frame"
)))

答案1

得分: 2

这是一个可行的方法(使用data.table),可以完成您的需求。最后的列顺序未保留,但可以使用setcolorder或类似方法手动处理。

我在这里使用magrittr::%>%进行管道操作;这并非必需,可以使用data.table的管道操作([...][...][...])或使用临时对象来替代。

library(data.table)
quux %>%
  .[, i := .I] %>%
  melt(id.vars = c("File_pdb", "i"), measure.vars = patterns(Res="^Res", Pos="^Pos", Chain="^Chain")) %>%
  .[, last2 := strsplit(substring(File_pdb, nchar(File_pdb) - 1, nchar(File_pdb)), "")] %>%
  .[, num := mapply(function(nm, l2) match(nm, l2), Chain, last2) ] %>%
  .[, c("variable", "last2") := NULL ] %>%
  dcast(File_pdb + i ~ num, value.var = c("Res", "Pos", "Chain"), sep = "") %>%
  .[, i := NULL] %>%
  .[]
#     File_pdb   Res1   Res2   Pos1   Pos2 Chain1 Chain2
#       <char> <char> <char> <char> <char> <char> <char>
#  1:  6wps_AH    CYS    TRP    361    105      A      H
#  2:  6wps_AH    LYS    GLU    356    108      A      H
#  3:  6wps_AH    LYS    PHE    356    106      A      H
#  4:  6wps_AH    THR    SER    345    109      A      H
#  5:  6wps_AH    THR    ILE    345    111      A      H
#  6:  6wps_AH    ASN    TYR    343    100      A      H
#  7:  6wps_AH    GLU    GLY    340    103      A      H
#  8:  6wps_AH    GLU    PHE    340    106      A      H
#  9:  6wps_AH    GLU    LEU    340    110      A      H
# 10:  6wps_AH    PRO    TRP    337    105      A      H
# 11:  6wps_AH    PRO    PHE    337    106      A      H
# 12:  6wps_AH    LEU    PRO    335     28      A      H
# 13:  6wps_AH    ASN    ILE    343    111      A      H
# 14:  6wps_AH    GLU    TRP    340    105      A      H
# 15:  6wps_AH    GLU    ALA    340    104      A      H
# 16:  6wps_AH    ASN    THR    334     30      A      H

顺便说一句:您可能更喜欢保持数据的长格式(使用melt),而不是将其重新塑造为宽格式。许多分析过程更喜欢长格式(包括ggplot2用于绘图),尽管如果您的工具集已经适应宽格式,这可能会带来一些挑战。

英文:

Here's a working approach (using data.table) that does what you need. The final column order is not preserved, but that can be handled manually using setcolorder or similar.

I'm using magrittr::%&gt;% for piping here; this is not required and can be removed in favor of data.table-piping ([...][...][...]) for much of it or the use of temporary objects.

library(data.table)
quux %&gt;%
  .[, i := .I] %&gt;%
  melt(id.vars = c(&quot;File_pdb&quot;, &quot;i&quot;), measure.vars = patterns(Res=&quot;^Res&quot;, Pos=&quot;^Pos&quot;, Chain=&quot;^Chain&quot;)) %&gt;%
  .[, last2 := strsplit(substring(File_pdb, nchar(File_pdb) - 1, nchar(File_pdb)), &quot;&quot;)] %&gt;%
  .[, num := mapply(function(nm, l2) match(nm, l2), Chain, last2) ] %&gt;%
  .[, c(&quot;variable&quot;, &quot;last2&quot;) := NULL ] %&gt;%
  dcast(File_pdb + i ~ num, value.var = c(&quot;Res&quot;, &quot;Pos&quot;, &quot;Chain&quot;), sep = &quot;&quot;) %&gt;%
  .[, i := NULL] %&gt;%
  .[]
#     File_pdb   Res1   Res2   Pos1   Pos2 Chain1 Chain2
#       &lt;char&gt; &lt;char&gt; &lt;char&gt; &lt;char&gt; &lt;char&gt; &lt;char&gt; &lt;char&gt;
#  1:  6wps_AH    CYS    TRP    361    105      A      H
#  2:  6wps_AH    LYS    GLU    356    108      A      H
#  3:  6wps_AH    LYS    PHE    356    106      A      H
#  4:  6wps_AH    THR    SER    345    109      A      H
#  5:  6wps_AH    THR    ILE    345    111      A      H
#  6:  6wps_AH    ASN    TYR    343    100      A      H
#  7:  6wps_AH    GLU    GLY    340    103      A      H
#  8:  6wps_AH    GLU    PHE    340    106      A      H
#  9:  6wps_AH    GLU    LEU    340    110      A      H
# 10:  6wps_AH    PRO    TRP    337    105      A      H
# 11:  6wps_AH    PRO    PHE    337    106      A      H
# 12:  6wps_AH    LEU    PRO    335     28      A      H
# 13:  6wps_AH    ASN    ILE    343    111      A      H
# 14:  6wps_AH    GLU    TRP    340    105      A      H
# 15:  6wps_AH    GLU    ALA    340    104      A      H
# 16:  6wps_AH    ASN    THR    334     30      A      H

Side note: you might prefer to keep your data in the long (melted) format instead of reshaping back to the wide format. Many analytic processes prefer long (including ggplot2 for plotting), though it does come with some challenges if your toolsets are already set for wide format.

huangapple
  • 本文由 发表于 2023年6月9日 00:08:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76433810.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定