2023年3月7日 16:34:25go评论94阅读模式

英文:

How to automatically subset a dataframe based on certain columns and store in separate dfs in R

问题

我正在使用R处理一个数据集。这个数据集包含一些值列和一些城市列，每个城市作为虚拟变量（0和1）。数据集如下：

df <- data.frame(A=c(1,2,2,3,4,5,1,1,2,3,4,4),
        B=c(4,4,2,3,4,2,1,5,2,2,5,1),
        C=c(rep(0:1, each=3, times=2)),
        D=round(rnorm(12, mean=50, sd=10), 2),
        City1=c(rep(0:1, each=6)),
        City2=c(rep(c(1, 0), c(6,6)))

上述数据集是一个原型。实际数据集中的"City"变量的数量不定，有时一个数据集有2个"City"列，有时有10个"City"列。

我想要一个解决方案，可以根据每个"City"的值创建单独的数据集。例如，代码创建一个基于列"City1"中值为"1"（而不是"0"）的数据集，并将其存储在名为"City1"的数据框中。然后，继续处理"City2"列，创建一个基于列"City2"中值为"1"（而不是"0"）的数据集，并将其存储在名为"City2"的单独数据框中，以此类推。

我知道有一些代码可以完成这个任务，但这样的话，我每次都要根据"City"变量的名称编写代码，而且每个数据集中的城市数量也不一样。

df1 <- df[df$City1==1,]
df2 <- df[df$City2==1,]

有人可以帮助我解决这个问题吗？提前感谢您。

英文:

I am working with a dataset in R. This dataset contains some columns of values and some columns of cities, each city as dummy variable (0 and 1). The dataset is something like this:

    df&lt;-data.frame(A=c(1,2,2,3,4,5,1,1,2,3,4,4),
    		B=c(4,4,2,3,4,2,1,5,2,2,5,1) ,
    		C=c(rep(0:1, each=3, times=2)),
    		D=round(rnorm(12, mean=50, sd=10), 2) ,
    		City1=c(rep(0:1, each=6)),
    		City2=c(rep(c(1, 0), c(6,6))))

The above dataset is a prototype. The real datasets have varying number of "City" variables, i.e. sometimes a dataset has 2 "City" column, sometimes it has 10 "City" column.
I want a solution that I can create separate datasets based on the values of each "City". For example, the codes creates a dataset based on the "1" values (not "0" values) in column "City1" and store in a dataframe with the name of "City1". Then, goes to the column "City2" and creates a dataset based on the "1" values (not "0" values) in column "City2" and store in a separate dataframe with the name of "City2". And so on.

I know that some codes like the below can do the job, but in this way I have to write the codes each time based on the name of "City" variables, and also the number of cities are varying in each dataset.

df1 &lt;- df[df$City1==1,]
df2 &lt;- df[df$City2==1,]

Does anybody can kindly help me in this problem?
Thank you in advance.

答案1

得分: 3

Identify city columns, then loop through them and split:

#cc &lt;- which(grepl(&quot;^City&quot;, colnames(df)))
# when cities start on 4th column.
cc &lt;- 4:ncol(df)
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })

Edit: to output list as separate dataframes into the environment, we need to name the list items then use list2env:

result &lt;- unlist(
  lapply(cc, function(i){ split(df[, -cc], df[ i ]) }),
  recursive = FALSE)
# make unique names
names(result) &lt;- make.names(names(result), unique = TRUE)

list2env(result, globalenv())

英文:

Identify city columns, then loop through them and split:

#cc &lt;- which(grepl(&quot;^City&quot;, colnames(df)))
# when cities start on 4th column.
cc &lt;- 4:ncol(df)
lapply(cc, function(i){ split(df[, -cc], df[ i ]) })

Edit: to output list as separate dataframes into the environment, we need to name the list items then use list2env:

result &lt;- unlist(
  lapply(cc, function(i){ split(df[, -cc], df[ i ]) }),
  recursive = FALSE)
# make unique names
names(result) &lt;- make.names(names(result), unique = TRUE)

list2env(result, globalenv())

答案2

得分: 2

以下是代码的翻译部分：

这里是使用purrr::map和rlang::bind_env的一种方法。这会在全局环境中创建df1和df2，请注意不要覆盖现有对象！如果你只想要一个data.frame的列表，那么只需使用map。

library(purrr)
library(rlang)

grep("City", names(df), value = TRUE) %>%
  set_names() %>%
  map(~ df[df[[.x]] == 1, ]) %>%
  env_bind(.GlobalEnv, !!! .)

来自OP的数据：

df <- data.frame(A = c(1,2,2,3,4,5,1,1,2,3,4,4),
                 B = c(4,4,2,3,4,2,1,5,2,2,5,1),
                 C = c(rep(0:1, each=3, times=2)),
                 D = round(rnorm(12, mean=50, sd=10), 2),
                 City1 = c(rep(0:1, each=6)),
                 City2 = c(rep(c(1, 0), c(6,6)))
                 )

^{创建于2023年03月07日，使用reprex包 (v2.0.1)。}

英文:

Here is one approach using purrr::map and rlang::bind_env. This creates df1 and df2 in the global environment, watch out to not overwrite existing objects! If you just want a list of data.frames then just stop with map.

library(purrr)
library(rlang)

grep(&quot;City&quot;, names(df), value = TRUE) %&gt;% 
  set_names() %&gt;% 
  map(~ df[df[[.x]] == 1, ]) %&gt;% 
  env_bind(.GlobalEnv, !!! .)

Data from OP

df &lt;- data.frame(A = c(1,2,2,3,4,5,1,1,2,3,4,4),
                 B = c(4,4,2,3,4,2,1,5,2,2,5,1),
                 C = c(rep(0:1, each=3, times=2)),
                 D = round(rnorm(12, mean=50, sd=10), 2),
                 City1 = c(rep(0:1, each=6)),
                 City2 = c(rep(c(1, 0), c(6,6)))
                 )

<sup>Created on 2023-03-07 by the reprex package (v2.0.1)</sup>

答案3

得分: 1

你可以粘贴列，然后分割：

Citys <- startsWith(colnames(df), "City")
split(df, do.call("paste", df[Citys]))

或者，使用pivot_longer：

library(tidyr)
library(dplyr)
df %>%
  pivot_longer(starts_with("City"), names_to = "Cities") %>%
  filter(value == 1) %>%
  split(.$Cities)

如果你想将列表转换为多个数据框在你的全局环境中使用list2env(your_list, .GlobalEnv)。

英文:

You can paste the columns and then split:

Citys &lt;- startsWith(colnames(df), &quot;City&quot;)
split(df, do.call(&quot;paste&quot;, df[Citys]))

Or, with pivot_longer:

library(tidyr)
library(dplyr)
df %&gt;% 
  pivot_longer(starts_with(&quot;City&quot;), names_to = &quot;Cities&quot;) %&gt;% 
  filter(value == 1) %&gt;% 
  split(.$Cities)

Use list2env(your_list, .GlobalEnv) if you want to convert the list into multiple data frames in your global environment.

答案4

得分: 0

以下是翻译好的内容：

You can subset df for the columns starting with City using startsWith, test them if they are equal 1 == 1 and get the column where this is the case with max.col. Paste df infront of the column ans use this to split df. Use list2env to get the data.frames in the global environment.

list2env(split(df, paste0("df", max.col(df[startsWith(names(df), "City")] == 1))), globalenv())

df1
#   A B C     D City1 City2
#7  1 1 0 65.30     1     0
#8  1 5 0 45.81     1     0
#9  2 2 0 43.37     1     0
#10 3 2 1 55.14     1     0
#11 4 5 1 59.21     1     0
#12 4 1 1 50.55     1     0

df2
#  A B C     D City1 City2
#1 1 4 0 62.32     0     1
#2 2 4 0 45.78     0     1
#3 2 2 0 54.80     0     1
#4 3 3 1 44.96     0     1
#5 4 4 1 61.42     0     1
#6 5 2 1 51.26     0     1

In case to keep it in a list and assuming City is only coded with 0 or 1 you can try:

split(df, max.col(df[startsWith(names(df), "City")))

Or using lapply and subset df.

lapply(df[startsWith(names(df), "City")], \(i) df[i == 1,])

Benchmark

bench::mark(check = FALSE,
zx8754 = {cc &lt;- which(grepl("^City", colnames(df)))  #Returns something different
  lapply(cc, function(i){ split(df[, -cc], df[ i ]) })},
TimTeaFan = {grep("City", names(df), value = TRUE) %>% 
  set_names() %>% 
  map(~ df[df[.x] == 1, ])},
Maël = split(df, do.call("paste", df[startsWith(colnames(df), "City")])),
GKi = split(df,  max.col(df[startsWith(names(df), "City"))),
GKi2 = lapply(df[startsWith(names(df), "City")], \(i) df[i == 1,])
)
#  expression      min median itr/s…¹ mem_al…² gc/se…³ n_itr  n_gc total…⁴ result
#  <bch:expr> <bch:tm> <bch:>   <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t> <list>
#1 zx8754        495μs  548μs   1621.  11.27KB    10.3   788     5   486ms <NULL>
#2 TimTeaFan     226μs  247μs   3863.       0B    12.3  1877     6   486ms <NULL>
#3 Maël          250μs  264μs   3754.       0B    12.3  1824     6   486ms <NULL>
#4 GKi           302μs  321μs   3051.     240B    12.4  1480     6   485ms <NULL>
#5 GKi2          161μs  177μs   5575.   6.36KB    14.5  2694     7   483ms <NULL>

英文:

list2env(split(df, paste0(&quot;df&quot;, max.col(df[startsWith(names(df), &quot;City&quot;)] ==
                                        1))), globalenv())

df1
#   A B C     D City1 City2
#7  1 1 0 65.30     1     0
#8  1 5 0 45.81     1     0
#9  2 2 0 43.37     1     0
#10 3 2 1 55.14     1     0
#11 4 5 1 59.21     1     0
#12 4 1 1 50.55     1     0

df2
#  A B C     D City1 City2
#1 1 4 0 62.32     0     1
#2 2 4 0 45.78     0     1
#3 2 2 0 54.80     0     1
#4 3 3 1 44.96     0     1
#5 4 4 1 61.42     0     1
#6 5 2 1 51.26     0     1

In case to keep it in a list and assuming City is only coded with 0 or 1 you can try:

split(df, max.col(df[startsWith(names(df), &quot;City&quot;)]))

Or using lapply and subset df.

lapply(df[startsWith(names(df), &quot;City&quot;)], \(i) df[i==1,])

Benchmark

bench::mark(check = FALSE,
zx8754 = {cc &lt;- which(grepl(&quot;^City&quot;, colnames(df)))  #Returns something different
  lapply(cc, function(i){ split(df[, -cc], df[ i ]) })},
TimTeaFan = {grep(&quot;City&quot;, names(df), value = TRUE) %&gt;% 
  set_names() %&gt;% 
  map(~ df[df[[.x]] == 1, ])},
Ma&#235;l = split(df, do.call(&quot;paste&quot;, df[startsWith(colnames(df), &quot;City&quot;)])),
GKi = split(df,  max.col(df[startsWith(names(df), &quot;City&quot;)])),
GKi2 = lapply(df[startsWith(names(df), &quot;City&quot;)], \(i) df[i==1,])
)
#  expression      min median itr/s…&#185; mem_al…&#178; gc/se…&#179; n_itr  n_gc total…⁴ result
#  &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:&gt;   &lt;dbl&gt; &lt;bch:by&gt;   &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;bch:t&gt; &lt;list&gt;
#1 zx8754        495&#181;s  548&#181;s   1621.  11.27KB    10.3   788     5   486ms &lt;NULL&gt;
#2 TimTeaFan     226&#181;s  247&#181;s   3863.       0B    12.3  1877     6   486ms &lt;NULL&gt;
#3 Ma&#235;l          250&#181;s  264&#181;s   3754.       0B    12.3  1824     6   486ms &lt;NULL&gt;
#4 GKi           302&#181;s  321&#181;s   3051.     240B    12.4  1480     6   485ms &lt;NULL&gt;
#5 GKi2          161&#181;s  177&#181;s   5575.   6.36KB    14.5  2694     7   483ms &lt;NULL&gt;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何根据特定列自动对数据框进行子集化并存储在单独的数据框中，在R中。

问题

答案1

答案2

答案3

答案4

R: Dplyr：如何检查一个变量的值是否包含在另一个变量中

Subsetting a long-data.table using values of a column within the data.table and casting the other values

将多列中的值扩展为二进制值

有效的代码来删除包含非唯一最大值的行？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论