如何在R中编写循环以操作数据框?

huangapple go评论126阅读模式
英文:

How do I write a loop to manipulate dataframes in R?

问题

I am trying to make a large table of results from a large set of dataframes in R. I have over 1000 tables I am trying to combine and so a loop or at least function is necessary to do this efficiently but I can't seem to get it working properly. I'm not sure if using dplyr::adply would be more efficient, but I'm less familiar with that type of scripting.

The tables must be imported into R, cut down to two columns, filtered, and then the resulting column needs to be appended to a growing table. The outcome would be a table with 1000 columns, one column from each data frame I am importing.

This is my manual adjustment to the file that I am trying to put into loop/function form:

#import files
x <- read.delim("~/maf_files/patient1.maf", comment.char="#")
#cut down to three relevant columns
df <- x[,c(1,9,16)]
#filter for Missense Mutations only
df2 <- df[df$Variant_Classification == "Missense_Mutation",]

library(tidyverse)

#filter for unique gene names
df3<-df2 %>% 
  group_by(Hugo_Symbol) %>% 
  filter(n() == 1)

#change column names
colnames(df3)[3] <- "patient1"
#remove Missense Column
df3$Variant_Classification <- NULL
#change column values to 1
df3$patient1 <- 1

#append column to existing table - should have 1000 columns
test <- full_join(test, df3, by = "Hugo_Symbol")

This is what I tried to do first, making a function that uses all of these steps:

read_maf <- function(file){
x <- read.delim("~/maf_files/file.maf", comment.char="#")
df <- x[,c(1,9,16)]
df2 <- df[df$Variant_Classification == "Missense_Mutation",]
df3<-df2 %>%
group_by(Hugo_Symbol) %>%
filter(n() == 1)

colnames(df3)[3] <- "patient1"
df3$Variant_Classification <- NULL
df3$patient1 <- 1
test <- full_join(test, df3, by = "Hugo_Symbol")

})
英文:

I am trying to make a large table of results from a large set of dataframes in R. I have over 1000 tables I am trying to combine and so a loop or at least function is necessary to do this efficiently but I can't seem to get it working properly. I'm not sure if using dplyr::adply would be more efficient, but I'm less familiar with that type of scripting.

The tables must be imported into R, cut down to two columns, filtered, and then the resulting column needs to be appended to a growing table. The outcome would be a table with 1000 columns, one column from each data frame I am importing.

This is my manual adjustment to the file that I am trying to put into loop/function form:
&lt;pre&gt; &lt;code&gt;

#import files
x &lt;- read.delim(&quot;~/maf_files/patient1.maf&quot;, comment.char=&quot;#&quot;)
#cut down to three relevant columns
df &lt;- `x`[,c(1,9,16)]
#filter for Missense Mutations only
df2 &lt;- df[df$Variant_Classification == &quot;Missense_Mutation&quot;,]

library(tidyverse)

#filter for unique gene names
df3&lt;-df2 %&gt;% 
  group_by(Hugo_Symbol) %&gt;% 
  filter(n() == 1)

#change column names
colnames(df3)[3] &lt;- &quot;patient1&quot;
#remove Missense Column
df3$Variant_Classification &lt;- NULL
#change column values to 1
df3$`patient1` &lt;-1

#append column to existing table - should have 1000 columns
test &lt;- full_join(test, df3, by = &quot;Hugo_Symbol&quot;)&lt;/kbd&gt;

This is what I tried to do first, making a function that uses all of these steps:

&lt;pre&gt; &lt;code&gt;

read_maf &lt;- function(file){
x &lt;- read.delim(&quot;~/maf_files/file.maf&quot;, comment.char=&quot;#&quot;)
df &lt;- x[,c(1,9,16)]
df2 &lt;- df[df$Variant_Classification == &quot;Missense_Mutation&quot;,]
df3&lt;-df2 %&gt;% 
group_by(Hugo_Symbol) %&gt;% 
filter(n() == 1)

colnames(df3)[3] &lt;- &quot;patient1&quot;
df3$Variant_Classification &lt;- NULL
df3$`patient1` &lt;-1
test &lt;- full_join(test, df3, by = &quot;Hugo_Symbol&quot;)

})

答案1

得分: 0

尝试类似以下的内容。这是您的函数,用于读取和调整每个单独的文件,然后将它们全部绑定到一个单一的数据框中,您可以使用purrr::map_dfr。list_of_files 应该是您要读取并绑定的所有文件的列表,如下所示:

list_of_files <- c('path1.maf', 'path2.maf')

read_maf <- function(file){
  x <- read.delim(glue("~/maf_files/{file}.maf"), comment.char="#")
  df <- x[, c(1, 9, 16)]
  df2 <- df[df$Variant_Classification == "Missense_Mutation", ]
  df3 <- df2 %>%
    group_by(Hugo_Symbol) %>%
    filter(n() == 1)

  colnames(df3)[3] <- "patient1"
  df3$Variant_Classification <- NULL
  df3$patient1 <- 1
  return(df3)
}

big_frame <- purrr::map_dfr(list_of_files, read_maf)

希望对您有所帮助。

英文:

try something like what is below. that is your function that reads and adjust each individal files. to bind them all into a single dataframe, you can use purrr::map_dfr. list_of_files should be a list of all the files you are reading in and want to bind. like so:

list_of_files &lt;- (&#39;path1.maf&#39;,&#39;path2.maf&#39;)

read_maf &lt;- function(file){
x &lt;- read.delim(glue(&quot;~/maf_files/{file}.maf&quot;), comment.char=&quot;#&quot;)
df &lt;- x[,c(1,9,16)]
df2 &lt;- df[df$Variant_Classification == &quot;Missense_Mutation&quot;,]
df3&lt;-df2 %&gt;% 
group_by(Hugo_Symbol) %&gt;% 
filter(n() == 1)

colnames(df3)[3] &lt;- &quot;patient1&quot;
df3$Variant_Classification &lt;- NULL
df3$patient1 &lt;-1
return(df3})

big_frame &lt;- purrr::map_dfr(list_of_files,read_maf)

答案2

得分: 0

这里发生了多个事情。首先,你的函数总是会返回相同的结果,因为你没有指明 file 元素应该在哪里使用。假设它是指文件位置,你应该在函数内部将 read.delim("&tilde;/maf_files/file.maf", comment.char="#") 更改为 read.delim(file, comment.char="#")。此外,我建议在函数的末尾添加 return(test)。确保它与 read_maf("&tilde;/maf_files/file.maf") 一起正常工作。它应该返回与你的第一个代码中相同的数据框。

其次,你需要一个包含每个要循环的文件位置的向量。你可以使用以下代码获取它。

files <- list.files("&tilde;/maf_files", ".maf", full.names = T)

一旦你的函数正常工作并且有了文件的位置,你应该使用你创建的函数循环文件的位置。由于你想要将所有内容合并到一个数据框中,我建议你使用 purrr:map_dfr()。它与 lapply() 做的事情相同,但最终将所有内容合并在一起。类似这样:

final_df <- purrr:map_dfr(files, read_maf)
英文:

Multiple things are going on here. First, your function will always return the same because you didn't indicate where the file element should be used. Assuming it refers to the file location, you should change read.delim(&quot;~/maf_files/file.maf&quot;, comment.char=&quot;#&quot;) with read.delim(file, comment.char=&quot;#&quot;) inside the function. Also, I would add return(test) at the end of the function. Ensure it works properly with read_maf(&quot;~/maf_files/file.maf&quot;) . It should return the same data frame as in your first code.

Second, you need a vector with the location of each one of the files that you are going to loop. You can get it with the following code.

files &lt;– list.files(&quot;~/maf_files&quot;, &quot;.maf&quot;, full.names = T)

Once you have the function working properly and the files' location, you should loop the files’ location with your created function. Since you want to join everything in one data frame, I recommend you use purrr:map_dfr(). It does the same than lapply() but joins everything at the end. Something like this:

final_df &lt;- purrr:map_dfr(files, read_maf)

huangapple
  • 本文由 发表于 2023年6月8日 02:03:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76425967.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定