如何根据特定列自动对数据框进行子集化并存储在单独的数据框中,在R中。

huangapple go评论71阅读模式
英文:

How to automatically subset a dataframe based on certain columns and store in separate dfs in R

问题

我正在使用R处理一个数据集。这个数据集包含一些值列和一些城市列,每个城市作为虚拟变量(0和1)。数据集如下:

df <- data.frame(A=c(1,2,2,3,4,5,1,1,2,3,4,4),
        B=c(4,4,2,3,4,2,1,5,2,2,5,1),
        C=c(rep(0:1, each=3, times=2)),
        D=round(rnorm(12, mean=50, sd=10), 2),
        City1=c(rep(0:1, each=6)),
        City2=c(rep(c(1, 0), c(6,6)))

上述数据集是一个原型。实际数据集中的"City"变量的数量不定,有时一个数据集有2个"City"列,有时有10个"City"列。

我想要一个解决方案,可以根据每个"City"的值创建单独的数据集。例如,代码创建一个基于列"City1"中值为"1"(而不是"0")的数据集,并将其存储在名为"City1"的数据框中。然后,继续处理"City2"列,创建一个基于列"City2"中值为"1"(而不是"0")的数据集,并将其存储在名为"City2"的单独数据框中,以此类推。

我知道有一些代码可以完成这个任务,但这样的话,我每次都要根据"City"变量的名称编写代码,而且每个数据集中的城市数量也不一样。

df1 <- df[df$City1==1,]
df2 <- df[df$City2==1,]

有人可以帮助我解决这个问题吗?提前感谢您。

英文:

I am working with a dataset in R. This dataset contains some columns of values and some columns of cities, each city as dummy variable (0 and 1). The dataset is something like this:

    df&lt;-data.frame(A=c(1,2,2,3,4,5,1,1,2,3,4,4),
    		B=c(4,4,2,3,4,2,1,5,2,2,5,1) ,
    		C=c(rep(0:1, each=3, times=2)),
    		D=round(rnorm(12, mean=50, sd=10), 2) ,
    		City1=c(rep(0:1, each=6)),
    		City2=c(rep(c(1, 0), c(6,6))))

The above dataset is a prototype. The real datasets have varying number of "City" variables, i.e. sometimes a dataset has 2 "City" column, sometimes it has 10 "City" column.
I want a solution that I can create separate datasets based on the values of each "City". For example, the codes creates a dataset based on the "1" values (not "0" values) in column "City1" and store in a dataframe with the name of "City1". Then, goes to the column "City2" and creates a dataset based on the "1" values (not "0" values) in column "City2" and store in a separate dataframe with the name of "City2". And so on.

I know that some codes like the below can do the job, but in this way I have to write the codes each time based on the name of "City" variables, and also the number of cities are varying in each dataset.

df1 &lt;- df[df$City1==1,]
df2 &lt;- df[df$City2==1,]

Does anybody can kindly help me in this problem?
Thank you in advance.

答案1

得分: 3

Identify city columns, then loop through them and split:

#cc &lt;- which(grepl(&quot;^City&quot;, colnames(df)))
# when cities start on 4th column.
cc &lt;- 4:ncol(df)
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })

Edit: to output list as separate dataframes into the environment, we need to name the list items then use list2env:

result &lt;- unlist(
  lapply(cc, function(i){ split(df[, -cc], df[ i ]) }),
  recursive = FALSE)
# make unique names
names(result) &lt;- make.names(names(result), unique = TRUE)

list2env(result, globalenv())
英文:

Identify city columns, then loop through them and split:

#cc &lt;- which(grepl(&quot;^City&quot;, colnames(df)))
# when cities start on 4th column.
cc &lt;- 4:ncol(df)
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })

Edit: to output list as separate dataframes into the environment, we need to name the list items then use list2env:

result &lt;- unlist(
  lapply(cc, function(i){ split(df[, -cc], df[ i ]) }),
  recursive = FALSE)
# make unique names
names(result) &lt;- make.names(names(result), unique = TRUE)

list2env(result, globalenv())

答案2

得分: 2

以下是代码的翻译部分:

这里是使用purrr::maprlang::bind_env的一种方法。这会在全局环境中创建df1df2,请注意不要覆盖现有对象!如果你只想要一个data.frame的列表,那么只需使用map

library(purrr)
library(rlang)

grep("City", names(df), value = TRUE) %>%
  set_names() %>%
  map(~ df[df[[.x]] == 1, ]) %>%
  env_bind(.GlobalEnv, !!! .)

来自OP的数据:

df <- data.frame(A = c(1,2,2,3,4,5,1,1,2,3,4,4),
                 B = c(4,4,2,3,4,2,1,5,2,2,5,1),
                 C = c(rep(0:1, each=3, times=2)),
                 D = round(rnorm(12, mean=50, sd=10), 2),
                 City1 = c(rep(0:1, each=6)),
                 City2 = c(rep(c(1, 0), c(6,6)))
                 )

创建于2023年03月07日,使用reprex包 (v2.0.1)。

英文:

Here is one approach using purrr::map and rlang::bind_env. This creates df1 and df2 in the global environment, watch out to not overwrite existing objects! If you just want a list of data.frames then just stop with map.

library(purrr)
library(rlang)

grep(&quot;City&quot;, names(df), value = TRUE) %&gt;% 
  set_names() %&gt;% 
  map(~ df[df[[.x]] == 1, ]) %&gt;% 
  env_bind(.GlobalEnv, !!! .)

Data from OP

df &lt;- data.frame(A = c(1,2,2,3,4,5,1,1,2,3,4,4),
                 B = c(4,4,2,3,4,2,1,5,2,2,5,1),
                 C = c(rep(0:1, each=3, times=2)),
                 D = round(rnorm(12, mean=50, sd=10), 2),
                 City1 = c(rep(0:1, each=6)),
                 City2 = c(rep(c(1, 0), c(6,6)))
                 )

<sup>Created on 2023-03-07 by the reprex package (v2.0.1)</sup>

答案3

得分: 1

你可以粘贴列,然后分割

Citys <- startsWith(colnames(df), "City")
split(df, do.call("paste", df[Citys]))

或者,使用pivot_longer

library(tidyr)
library(dplyr)
df %>%
  pivot_longer(starts_with("City"), names_to = "Cities") %>%
  filter(value == 1) %>%
  split(.$Cities)

如果你想将列表转换为多个数据框在你的全局环境中使用list2env(your_list, .GlobalEnv)

英文:

You can paste the columns and then split:

Citys &lt;- startsWith(colnames(df), &quot;City&quot;)
split(df, do.call(&quot;paste&quot;, df[Citys]))

Or, with pivot_longer:

library(tidyr)
library(dplyr)
df %&gt;% 
  pivot_longer(starts_with(&quot;City&quot;), names_to = &quot;Cities&quot;) %&gt;% 
  filter(value == 1) %&gt;% 
  split(.$Cities)

Use list2env(your_list, .GlobalEnv) if you want to convert the list into multiple data frames in your global environment.

答案4

得分: 0

以下是翻译好的内容:

You can subset df for the columns starting with City using startsWith, test them if they are equal 1 == 1 and get the column where this is the case with max.col. Paste df infront of the column ans use this to split df. Use list2env to get the data.frames in the global environment.

list2env(split(df, paste0("df", max.col(df[startsWith(names(df), "City")] == 1))), globalenv())

df1
#   A B C     D City1 City2
#7  1 1 0 65.30     1     0
#8  1 5 0 45.81     1     0
#9  2 2 0 43.37     1     0
#10 3 2 1 55.14     1     0
#11 4 5 1 59.21     1     0
#12 4 1 1 50.55     1     0

df2
#  A B C     D City1 City2
#1 1 4 0 62.32     0     1
#2 2 4 0 45.78     0     1
#3 2 2 0 54.80     0     1
#4 3 3 1 44.96     0     1
#5 4 4 1 61.42     0     1
#6 5 2 1 51.26     0     1

In case to keep it in a list and assuming City is only coded with 0 or 1 you can try:

split(df, max.col(df[startsWith(names(df), "City")))

Or using lapply and subset df.

lapply(df[startsWith(names(df), "City")], \(i) df[i == 1,])

Benchmark

bench::mark(check = FALSE,
zx8754 = {cc &lt;- which(grepl("^City", colnames(df)))  #Returns something different
  lapply(cc, function(i){ split(df[, -cc], df[ i ]) })},
TimTeaFan = {grep("City", names(df), value = TRUE) %>% 
  set_names() %>% 
  map(~ df[df[.x] == 1, ])},
Maël = split(df, do.call("paste", df[startsWith(colnames(df), "City")])),
GKi = split(df,  max.col(df[startsWith(names(df), "City"))),
GKi2 = lapply(df[startsWith(names(df), "City")], \(i) df[i == 1,])
)
#  expression      min median itr/s…¹ mem_al…² gc/se…³ n_itr  n_gc total…⁴ result
#  <bch:expr> <bch:tm> <bch:>   <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t> <list>
#1 zx8754        495μs  548μs   1621.  11.27KB    10.3   788     5   486ms <NULL>
#2 TimTeaFan     226μs  247μs   3863.       0B    12.3  1877     6   486ms <NULL>
#3 Maël          250μs  264μs   3754.       0B    12.3  1824     6   486ms <NULL>
#4 GKi           302μs  321μs   3051.     240B    12.4  1480     6   485ms <NULL>
#5 GKi2          161μs  177μs   5575.   6.36KB    14.5  2694     7   483ms <NULL>
英文:

You can subset df for the columns starting with City using startsWith, test them if they are equal 1 == 1 and get the column where this is the case with max.col. Paste df infront of the column ans use this to split df. Use list2env to get the data.frames in the global environment.

list2env(split(df, paste0(&quot;df&quot;, max.col(df[startsWith(names(df), &quot;City&quot;)] ==
                                        1))), globalenv())

df1
#   A B C     D City1 City2
#7  1 1 0 65.30     1     0
#8  1 5 0 45.81     1     0
#9  2 2 0 43.37     1     0
#10 3 2 1 55.14     1     0
#11 4 5 1 59.21     1     0
#12 4 1 1 50.55     1     0

df2
#  A B C     D City1 City2
#1 1 4 0 62.32     0     1
#2 2 4 0 45.78     0     1
#3 2 2 0 54.80     0     1
#4 3 3 1 44.96     0     1
#5 4 4 1 61.42     0     1
#6 5 2 1 51.26     0     1

In case to keep it in a list and assuming City is only coded with 0 or 1 you can try:

split(df, max.col(df[startsWith(names(df), &quot;City&quot;)]))

Or using lapply and subset df.

lapply(df[startsWith(names(df), &quot;City&quot;)], \(i) df[i==1,])

Benchmark

bench::mark(check = FALSE,
zx8754 = {cc &lt;- which(grepl(&quot;^City&quot;, colnames(df)))  #Returns something different
  lapply(cc, function(i){ split(df[, -cc], df[ i ]) })},
TimTeaFan = {grep(&quot;City&quot;, names(df), value = TRUE) %&gt;% 
  set_names() %&gt;% 
  map(~ df[df[[.x]] == 1, ])},
Ma&#235;l = split(df, do.call(&quot;paste&quot;, df[startsWith(colnames(df), &quot;City&quot;)])),
GKi = split(df,  max.col(df[startsWith(names(df), &quot;City&quot;)])),
GKi2 = lapply(df[startsWith(names(df), &quot;City&quot;)], \(i) df[i==1,])
)
#  expression      min median itr/s…&#185; mem_al…&#178; gc/se…&#179; n_itr  n_gc total…⁴ result
#  &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:&gt;   &lt;dbl&gt; &lt;bch:by&gt;   &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;bch:t&gt; &lt;list&gt;
#1 zx8754        495&#181;s  548&#181;s   1621.  11.27KB    10.3   788     5   486ms &lt;NULL&gt;
#2 TimTeaFan     226&#181;s  247&#181;s   3863.       0B    12.3  1877     6   486ms &lt;NULL&gt;
#3 Ma&#235;l          250&#181;s  264&#181;s   3754.       0B    12.3  1824     6   486ms &lt;NULL&gt;
#4 GKi           302&#181;s  321&#181;s   3051.     240B    12.4  1480     6   485ms &lt;NULL&gt;
#5 GKi2          161&#181;s  177&#181;s   5575.   6.36KB    14.5  2694     7   483ms &lt;NULL&gt;

huangapple
  • 本文由 发表于 2023年3月7日 16:34:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/75659592.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定