英文:
How to automatically subset a dataframe based on certain columns and store in separate dfs in R
问题
我正在使用R处理一个数据集。这个数据集包含一些值列和一些城市列,每个城市作为虚拟变量(0和1)。数据集如下:
df <- data.frame(A=c(1,2,2,3,4,5,1,1,2,3,4,4),
B=c(4,4,2,3,4,2,1,5,2,2,5,1),
C=c(rep(0:1, each=3, times=2)),
D=round(rnorm(12, mean=50, sd=10), 2),
City1=c(rep(0:1, each=6)),
City2=c(rep(c(1, 0), c(6,6)))
上述数据集是一个原型。实际数据集中的"City"变量的数量不定,有时一个数据集有2个"City"列,有时有10个"City"列。
我想要一个解决方案,可以根据每个"City"的值创建单独的数据集。例如,代码创建一个基于列"City1"中值为"1"(而不是"0")的数据集,并将其存储在名为"City1"的数据框中。然后,继续处理"City2"列,创建一个基于列"City2"中值为"1"(而不是"0")的数据集,并将其存储在名为"City2"的单独数据框中,以此类推。
我知道有一些代码可以完成这个任务,但这样的话,我每次都要根据"City"变量的名称编写代码,而且每个数据集中的城市数量也不一样。
df1 <- df[df$City1==1,]
df2 <- df[df$City2==1,]
有人可以帮助我解决这个问题吗?提前感谢您。
英文:
I am working with a dataset in R. This dataset contains some columns of values and some columns of cities, each city as dummy variable (0 and 1). The dataset is something like this:
df<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3,4,4),
B=c(4,4,2,3,4,2,1,5,2,2,5,1) ,
C=c(rep(0:1, each=3, times=2)),
D=round(rnorm(12, mean=50, sd=10), 2) ,
City1=c(rep(0:1, each=6)),
City2=c(rep(c(1, 0), c(6,6))))
The above dataset is a prototype. The real datasets have varying number of "City" variables, i.e. sometimes a dataset has 2 "City" column, sometimes it has 10 "City" column.
I want a solution that I can create separate datasets based on the values of each "City". For example, the codes creates a dataset based on the "1" values (not "0" values) in column "City1" and store in a dataframe with the name of "City1". Then, goes to the column "City2" and creates a dataset based on the "1" values (not "0" values) in column "City2" and store in a separate dataframe with the name of "City2". And so on.
I know that some codes like the below can do the job, but in this way I have to write the codes each time based on the name of "City" variables, and also the number of cities are varying in each dataset.
df1 <- df[df$City1==1,]
df2 <- df[df$City2==1,]
Does anybody can kindly help me in this problem?
Thank you in advance.
答案1
得分: 3
Identify city columns, then loop through them and split:
#cc <- which(grepl("^City", colnames(df)))
# when cities start on 4th column.
cc <- 4:ncol(df)
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })
Edit: to output list as separate dataframes into the environment, we need to name the list items then use list2env:
result <- unlist(
lapply(cc, function(i){ split(df[, -cc], df[ i ]) }),
recursive = FALSE)
# make unique names
names(result) <- make.names(names(result), unique = TRUE)
list2env(result, globalenv())
英文:
Identify city columns, then loop through them and split:
#cc <- which(grepl("^City", colnames(df)))
# when cities start on 4th column.
cc <- 4:ncol(df)
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })
Edit: to output list as separate dataframes into the environment, we need to name the list items then use list2env:
result <- unlist(
lapply(cc, function(i){ split(df[, -cc], df[ i ]) }),
recursive = FALSE)
# make unique names
names(result) <- make.names(names(result), unique = TRUE)
list2env(result, globalenv())
答案2
得分: 2
以下是代码的翻译部分:
这里是使用purrr::map
和rlang::bind_env
的一种方法。这会在全局环境中创建df1
和df2
,请注意不要覆盖现有对象!如果你只想要一个data.frame
的列表,那么只需使用map
。
library(purrr)
library(rlang)
grep("City", names(df), value = TRUE) %>%
set_names() %>%
map(~ df[df[[.x]] == 1, ]) %>%
env_bind(.GlobalEnv, !!! .)
来自OP的数据:
df <- data.frame(A = c(1,2,2,3,4,5,1,1,2,3,4,4),
B = c(4,4,2,3,4,2,1,5,2,2,5,1),
C = c(rep(0:1, each=3, times=2)),
D = round(rnorm(12, mean=50, sd=10), 2),
City1 = c(rep(0:1, each=6)),
City2 = c(rep(c(1, 0), c(6,6)))
)
创建于2023年03月07日,使用reprex包 (v2.0.1)。
英文:
Here is one approach using purrr::map
and rlang::bind_env
. This creates df1
and df2
in the global environment, watch out to not overwrite existing objects! If you just want a list of data.frame
s then just stop with map
.
library(purrr)
library(rlang)
grep("City", names(df), value = TRUE) %>%
set_names() %>%
map(~ df[df[[.x]] == 1, ]) %>%
env_bind(.GlobalEnv, !!! .)
Data from OP
df <- data.frame(A = c(1,2,2,3,4,5,1,1,2,3,4,4),
B = c(4,4,2,3,4,2,1,5,2,2,5,1),
C = c(rep(0:1, each=3, times=2)),
D = round(rnorm(12, mean=50, sd=10), 2),
City1 = c(rep(0:1, each=6)),
City2 = c(rep(c(1, 0), c(6,6)))
)
<sup>Created on 2023-03-07 by the reprex package (v2.0.1)</sup>
答案3
得分: 1
你可以粘贴
列,然后分割
:
Citys <- startsWith(colnames(df), "City")
split(df, do.call("paste", df[Citys]))
或者,使用pivot_longer
:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(starts_with("City"), names_to = "Cities") %>%
filter(value == 1) %>%
split(.$Cities)
如果你想将列表转换为多个数据框在你的全局环境中使用list2env(your_list, .GlobalEnv)
。
英文:
You can paste
the columns and then split
:
Citys <- startsWith(colnames(df), "City")
split(df, do.call("paste", df[Citys]))
Or, with pivot_longer
:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(starts_with("City"), names_to = "Cities") %>%
filter(value == 1) %>%
split(.$Cities)
Use list2env(your_list, .GlobalEnv)
if you want to convert the list into multiple data frames in your global environment.
答案4
得分: 0
以下是翻译好的内容:
You can subset df
for the columns starting with City using startsWith
, test them if they are equal 1 == 1
and get the column where this is the case with max.col
. Paste
df infront of the column ans use this to split
df
. Use list2env
to get the data.frames
in the global environment.
list2env(split(df, paste0("df", max.col(df[startsWith(names(df), "City")] == 1))), globalenv())
df1
# A B C D City1 City2
#7 1 1 0 65.30 1 0
#8 1 5 0 45.81 1 0
#9 2 2 0 43.37 1 0
#10 3 2 1 55.14 1 0
#11 4 5 1 59.21 1 0
#12 4 1 1 50.55 1 0
df2
# A B C D City1 City2
#1 1 4 0 62.32 0 1
#2 2 4 0 45.78 0 1
#3 2 2 0 54.80 0 1
#4 3 3 1 44.96 0 1
#5 4 4 1 61.42 0 1
#6 5 2 1 51.26 0 1
In case to keep it in a list and assuming City is only coded with 0 or 1 you can try:
split(df, max.col(df[startsWith(names(df), "City")))
Or using lapply
and subset df
.
lapply(df[startsWith(names(df), "City")], \(i) df[i == 1,])
Benchmark
bench::mark(check = FALSE,
zx8754 = {cc <- which(grepl("^City", colnames(df))) #Returns something different
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })},
TimTeaFan = {grep("City", names(df), value = TRUE) %>%
set_names() %>%
map(~ df[df[.x] == 1, ])},
Maël = split(df, do.call("paste", df[startsWith(colnames(df), "City")])),
GKi = split(df, max.col(df[startsWith(names(df), "City"))),
GKi2 = lapply(df[startsWith(names(df), "City")], \(i) df[i == 1,])
)
# expression min median itr/s…¹ mem_al…² gc/se…³ n_itr n_gc total…⁴ result
# <bch:expr> <bch:tm> <bch:> <dbl> <bch:by> <dbl> <int> <dbl> <bch:t> <list>
#1 zx8754 495μs 548μs 1621. 11.27KB 10.3 788 5 486ms <NULL>
#2 TimTeaFan 226μs 247μs 3863. 0B 12.3 1877 6 486ms <NULL>
#3 Maël 250μs 264μs 3754. 0B 12.3 1824 6 486ms <NULL>
#4 GKi 302μs 321μs 3051. 240B 12.4 1480 6 485ms <NULL>
#5 GKi2 161μs 177μs 5575. 6.36KB 14.5 2694 7 483ms <NULL>
英文:
You can subset df
for the columns starting with City using startsWith
, test them if they are equal 1 == 1
and get the column where this is the case with max.col
. Paste
df infront of the column ans use this to split
df
. Use list2env
to get the data.frames
in the global environment.
list2env(split(df, paste0("df", max.col(df[startsWith(names(df), "City")] ==
1))), globalenv())
df1
# A B C D City1 City2
#7 1 1 0 65.30 1 0
#8 1 5 0 45.81 1 0
#9 2 2 0 43.37 1 0
#10 3 2 1 55.14 1 0
#11 4 5 1 59.21 1 0
#12 4 1 1 50.55 1 0
df2
# A B C D City1 City2
#1 1 4 0 62.32 0 1
#2 2 4 0 45.78 0 1
#3 2 2 0 54.80 0 1
#4 3 3 1 44.96 0 1
#5 4 4 1 61.42 0 1
#6 5 2 1 51.26 0 1
In case to keep it in a list and assuming City is only coded with 0 or 1 you can try:
split(df, max.col(df[startsWith(names(df), "City")]))
Or using lapply
and subset df
.
lapply(df[startsWith(names(df), "City")], \(i) df[i==1,])
Benchmark
bench::mark(check = FALSE,
zx8754 = {cc <- which(grepl("^City", colnames(df))) #Returns something different
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })},
TimTeaFan = {grep("City", names(df), value = TRUE) %>%
set_names() %>%
map(~ df[df[[.x]] == 1, ])},
Maël = split(df, do.call("paste", df[startsWith(colnames(df), "City")])),
GKi = split(df, max.col(df[startsWith(names(df), "City")])),
GKi2 = lapply(df[startsWith(names(df), "City")], \(i) df[i==1,])
)
# expression min median itr/s…¹ mem_al…² gc/se…³ n_itr n_gc total…⁴ result
# <bch:expr> <bch:tm> <bch:> <dbl> <bch:by> <dbl> <int> <dbl> <bch:t> <list>
#1 zx8754 495µs 548µs 1621. 11.27KB 10.3 788 5 486ms <NULL>
#2 TimTeaFan 226µs 247µs 3863. 0B 12.3 1877 6 486ms <NULL>
#3 Maël 250µs 264µs 3754. 0B 12.3 1824 6 486ms <NULL>
#4 GKi 302µs 321µs 3051. 240B 12.4 1480 6 485ms <NULL>
#5 GKi2 161µs 177µs 5575. 6.36KB 14.5 2694 7 483ms <NULL>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论