英文:
How to automatically subset a dataframe based on certain columns and store in separate dfs in R
问题
我正在使用R处理一个数据集。这个数据集包含一些值列和一些城市列,每个城市作为虚拟变量(0和1)。数据集如下:
df <- data.frame(A=c(1,2,2,3,4,5,1,1,2,3,4,4),
B=c(4,4,2,3,4,2,1,5,2,2,5,1),
C=c(rep(0:1, each=3, times=2)),
D=round(rnorm(12, mean=50, sd=10), 2),
City1=c(rep(0:1, each=6)),
City2=c(rep(c(1, 0), c(6,6)))
上述数据集是一个原型。实际数据集中的"City"变量的数量不定,有时一个数据集有2个"City"列,有时有10个"City"列。
我想要一个解决方案,可以根据每个"City"的值创建单独的数据集。例如,代码创建一个基于列"City1"中值为"1"(而不是"0")的数据集,并将其存储在名为"City1"的数据框中。然后,继续处理"City2"列,创建一个基于列"City2"中值为"1"(而不是"0")的数据集,并将其存储在名为"City2"的单独数据框中,以此类推。
我知道有一些代码可以完成这个任务,但这样的话,我每次都要根据"City"变量的名称编写代码,而且每个数据集中的城市数量也不一样。
df1 <- df[df$City1==1,]
df2 <- df[df$City2==1,]
有人可以帮助我解决这个问题吗?提前感谢您。
英文:
I am working with a dataset in R. This dataset contains some columns of values and some columns of cities, each city as dummy variable (0 and 1). The dataset is something like this:
df<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3,4,4),
B=c(4,4,2,3,4,2,1,5,2,2,5,1) ,
C=c(rep(0:1, each=3, times=2)),
D=round(rnorm(12, mean=50, sd=10), 2) ,
City1=c(rep(0:1, each=6)),
City2=c(rep(c(1, 0), c(6,6))))
The above dataset is a prototype. The real datasets have varying number of "City" variables, i.e. sometimes a dataset has 2 "City" column, sometimes it has 10 "City" column.
I want a solution that I can create separate datasets based on the values of each "City". For example, the codes creates a dataset based on the "1" values (not "0" values) in column "City1" and store in a dataframe with the name of "City1". Then, goes to the column "City2" and creates a dataset based on the "1" values (not "0" values) in column "City2" and store in a separate dataframe with the name of "City2". And so on.
I know that some codes like the below can do the job, but in this way I have to write the codes each time based on the name of "City" variables, and also the number of cities are varying in each dataset.
df1 <- df[df$City1==1,]
df2 <- df[df$City2==1,]
Does anybody can kindly help me in this problem?
Thank you in advance.
答案1
得分: 3
Identify city columns, then loop through them and split:
#cc <- which(grepl("^City", colnames(df)))
# when cities start on 4th column.
cc <- 4:ncol(df)
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })
Edit: to output list as separate dataframes into the environment, we need to name the list items then use list2env:
result <- unlist(
lapply(cc, function(i){ split(df[, -cc], df[ i ]) }),
recursive = FALSE)
# make unique names
names(result) <- make.names(names(result), unique = TRUE)
list2env(result, globalenv())
英文:
Identify city columns, then loop through them and split:
#cc <- which(grepl("^City", colnames(df)))
# when cities start on 4th column.
cc <- 4:ncol(df)
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })
Edit: to output list as separate dataframes into the environment, we need to name the list items then use list2env:
result <- unlist(
lapply(cc, function(i){ split(df[, -cc], df[ i ]) }),
recursive = FALSE)
# make unique names
names(result) <- make.names(names(result), unique = TRUE)
list2env(result, globalenv())
答案2
得分: 2
以下是代码的翻译部分:
这里是使用purrr::map和rlang::bind_env的一种方法。这会在全局环境中创建df1和df2,请注意不要覆盖现有对象!如果你只想要一个data.frame的列表,那么只需使用map。
library(purrr)
library(rlang)
grep("City", names(df), value = TRUE) %>%
set_names() %>%
map(~ df[df[[.x]] == 1, ]) %>%
env_bind(.GlobalEnv, !!! .)
来自OP的数据:
df <- data.frame(A = c(1,2,2,3,4,5,1,1,2,3,4,4),
B = c(4,4,2,3,4,2,1,5,2,2,5,1),
C = c(rep(0:1, each=3, times=2)),
D = round(rnorm(12, mean=50, sd=10), 2),
City1 = c(rep(0:1, each=6)),
City2 = c(rep(c(1, 0), c(6,6)))
)
创建于2023年03月07日,使用reprex包 (v2.0.1)。
英文:
Here is one approach using purrr::map and rlang::bind_env. This creates df1 and df2 in the global environment, watch out to not overwrite existing objects! If you just want a list of data.frames then just stop with map.
library(purrr)
library(rlang)
grep("City", names(df), value = TRUE) %>%
set_names() %>%
map(~ df[df[[.x]] == 1, ]) %>%
env_bind(.GlobalEnv, !!! .)
Data from OP
df <- data.frame(A = c(1,2,2,3,4,5,1,1,2,3,4,4),
B = c(4,4,2,3,4,2,1,5,2,2,5,1),
C = c(rep(0:1, each=3, times=2)),
D = round(rnorm(12, mean=50, sd=10), 2),
City1 = c(rep(0:1, each=6)),
City2 = c(rep(c(1, 0), c(6,6)))
)
<sup>Created on 2023-03-07 by the reprex package (v2.0.1)</sup>
答案3
得分: 1
你可以粘贴列,然后分割:
Citys <- startsWith(colnames(df), "City")
split(df, do.call("paste", df[Citys]))
或者,使用pivot_longer:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(starts_with("City"), names_to = "Cities") %>%
filter(value == 1) %>%
split(.$Cities)
如果你想将列表转换为多个数据框在你的全局环境中使用list2env(your_list, .GlobalEnv)。
英文:
You can paste the columns and then split:
Citys <- startsWith(colnames(df), "City")
split(df, do.call("paste", df[Citys]))
Or, with pivot_longer:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(starts_with("City"), names_to = "Cities") %>%
filter(value == 1) %>%
split(.$Cities)
Use list2env(your_list, .GlobalEnv) if you want to convert the list into multiple data frames in your global environment.
答案4
得分: 0
以下是翻译好的内容:
You can subset df for the columns starting with City using startsWith, test them if they are equal 1 == 1 and get the column where this is the case with max.col. Paste df infront of the column ans use this to split df. Use list2env to get the data.frames in the global environment.
list2env(split(df, paste0("df", max.col(df[startsWith(names(df), "City")] == 1))), globalenv())
df1
# A B C D City1 City2
#7 1 1 0 65.30 1 0
#8 1 5 0 45.81 1 0
#9 2 2 0 43.37 1 0
#10 3 2 1 55.14 1 0
#11 4 5 1 59.21 1 0
#12 4 1 1 50.55 1 0
df2
# A B C D City1 City2
#1 1 4 0 62.32 0 1
#2 2 4 0 45.78 0 1
#3 2 2 0 54.80 0 1
#4 3 3 1 44.96 0 1
#5 4 4 1 61.42 0 1
#6 5 2 1 51.26 0 1
In case to keep it in a list and assuming City is only coded with 0 or 1 you can try:
split(df, max.col(df[startsWith(names(df), "City")))
Or using lapply and subset df.
lapply(df[startsWith(names(df), "City")], \(i) df[i == 1,])
Benchmark
bench::mark(check = FALSE,
zx8754 = {cc <- which(grepl("^City", colnames(df))) #Returns something different
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })},
TimTeaFan = {grep("City", names(df), value = TRUE) %>%
set_names() %>%
map(~ df[df[.x] == 1, ])},
Maël = split(df, do.call("paste", df[startsWith(colnames(df), "City")])),
GKi = split(df, max.col(df[startsWith(names(df), "City"))),
GKi2 = lapply(df[startsWith(names(df), "City")], \(i) df[i == 1,])
)
# expression min median itr/s…¹ mem_al…² gc/se…³ n_itr n_gc total…⁴ result
# <bch:expr> <bch:tm> <bch:> <dbl> <bch:by> <dbl> <int> <dbl> <bch:t> <list>
#1 zx8754 495μs 548μs 1621. 11.27KB 10.3 788 5 486ms <NULL>
#2 TimTeaFan 226μs 247μs 3863. 0B 12.3 1877 6 486ms <NULL>
#3 Maël 250μs 264μs 3754. 0B 12.3 1824 6 486ms <NULL>
#4 GKi 302μs 321μs 3051. 240B 12.4 1480 6 485ms <NULL>
#5 GKi2 161μs 177μs 5575. 6.36KB 14.5 2694 7 483ms <NULL>
英文:
You can subset df for the columns starting with City using startsWith, test them if they are equal 1 == 1 and get the column where this is the case with max.col. Paste df infront of the column ans use this to split df. Use list2env to get the data.frames in the global environment.
list2env(split(df, paste0("df", max.col(df[startsWith(names(df), "City")] ==
1))), globalenv())
df1
# A B C D City1 City2
#7 1 1 0 65.30 1 0
#8 1 5 0 45.81 1 0
#9 2 2 0 43.37 1 0
#10 3 2 1 55.14 1 0
#11 4 5 1 59.21 1 0
#12 4 1 1 50.55 1 0
df2
# A B C D City1 City2
#1 1 4 0 62.32 0 1
#2 2 4 0 45.78 0 1
#3 2 2 0 54.80 0 1
#4 3 3 1 44.96 0 1
#5 4 4 1 61.42 0 1
#6 5 2 1 51.26 0 1
In case to keep it in a list and assuming City is only coded with 0 or 1 you can try:
split(df, max.col(df[startsWith(names(df), "City")]))
Or using lapply and subset df.
lapply(df[startsWith(names(df), "City")], \(i) df[i==1,])
Benchmark
bench::mark(check = FALSE,
zx8754 = {cc <- which(grepl("^City", colnames(df))) #Returns something different
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })},
TimTeaFan = {grep("City", names(df), value = TRUE) %>%
set_names() %>%
map(~ df[df[[.x]] == 1, ])},
Maël = split(df, do.call("paste", df[startsWith(colnames(df), "City")])),
GKi = split(df, max.col(df[startsWith(names(df), "City")])),
GKi2 = lapply(df[startsWith(names(df), "City")], \(i) df[i==1,])
)
# expression min median itr/s…¹ mem_al…² gc/se…³ n_itr n_gc total…⁴ result
# <bch:expr> <bch:tm> <bch:> <dbl> <bch:by> <dbl> <int> <dbl> <bch:t> <list>
#1 zx8754 495µs 548µs 1621. 11.27KB 10.3 788 5 486ms <NULL>
#2 TimTeaFan 226µs 247µs 3863. 0B 12.3 1877 6 486ms <NULL>
#3 Maël 250µs 264µs 3754. 0B 12.3 1824 6 486ms <NULL>
#4 GKi 302µs 321µs 3051. 240B 12.4 1480 6 485ms <NULL>
#5 GKi2 161µs 177µs 5575. 6.36KB 14.5 2694 7 483ms <NULL>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论