在R中,从一个因子创建有效范围,然后应用另一个因子范围。

huangapple go评论125阅读模式
英文:

Create valid range from a factor and apply another factor range in R

问题

我有一个包含两列的CSV文件,第一列是表示船舶尺寸范围的因子类别,第二列是属于该尺寸类别的船舶类别。我需要使用这些数据来填充一个新的表格,其中包含不同的已建立的船舶尺寸范围。例如,我的初始原始数据是两列的:

dat.start <- data.frame(category = c(rep("1-10", 3), rep("11-20", 3), rep("21-30", 3), rep("32-40", 3), rep("41-50", 3), rep("51-59", 3)), class = rep(c("a", "b", "c"), 6))

当我按类别对类别进行聚合,例如:

ag.dat <- aggregate(class ~ category, data = dat.start, length)

你会看到我得到一个具有str(ag.dat)结构的数据框,包含字符列和整数列。

接下来的问题是,我需要将这些船舶尺寸的频率分配到一张新的、预定的船舶尺寸类别表中,这些类别与初始数据中的不同。例如,下面是新的尺寸类别和基于原始dat.start数据的船舶类别频率:

dat.end <- data.frame(category = c("1-20", "21-50", ">50"), class = c(6, 9, 3))

所以我的问题是如何从dat.startdat.end?我的第一个想法是以某种方式拆分字符字符串类别,并创建可以数值解释的新dat.startdat.end范围,类似于cut函数产生的结果。但当涉及到下一步实际创建基于新类别的船舶类别频率时,我陷入了困境。此外,将字符字符串范围转换为数值范围也让我感到困惑。

我在网上找到的最接近的解决方案是这里,我认为是这个链接;但是这似乎是针对Python/Pandas编写的,而我需要在R中完成。谢谢。

英文:

I have a csv file of 2 columns, first column are factor categories representing vessel size ranges, and the second column is a vessel class that falls into that size category. I need to use these data to then fill in a new table of different established vessel size ranges. Fore example: my initial raw data is in two columns;

dat.start&lt;-data.frame(category=c(rep(&quot;1-10&quot;,3), rep(&quot;11-20&quot;,3), rep(&quot;21-30&quot;,3), rep(&quot;32-40&quot;,3), rep(&quot;41-50&quot;,3), rep(&quot;51-59&quot;,3)), class=rep(c(&quot;a&quot;,&quot;b&quot;,&quot;c&quot;),6))

When I aggregate by class by category eg

ag.dat&lt;-aggregate(class ~ category, data = dat.start, length)

you'll see I get a df with a structure str(ag.dat) consisting of chr column and int column.

The next problem is that I need to assign those frequencies of vessel sizes into a table of new, predetermined vessel size categories that are different from the first. For example, below are the new size categories and the frequency of vessel classes based on the original dat.start data

dat.end&lt;-data.frame(category=c(&quot;1-20&quot;, &quot;21-50&quot;, &quot;&gt;50&quot;), class=c(6, 9, 3))

So my question is how to go from dat.start to dat.end? My first thoughts were to somehow split up the chr string categories and create new dat.start and dat.end ranges that can be numerically interpreted, such as what is produced by cut. But then I drew a blank when it came to going the next step of actually creating the frequencies of vessel class based on new categories. Plus the converting of chr string ranges to numerical ranges also stumped me.

The closest solution I found on the web was here I think; https://stackoverflow.com/questions/71078299/identify-the-matching-range-from-a-list-of-valid-range

but this looks like its written for Python/Pandas and I need to do it in R. Thanks.

答案1

得分: 2

如果你将"category"分为两列,你可以进行数值比较,例如:

library(tidyverse)

dat.start<-data.frame(category=c(rep("1-10",3), rep("11-20",3), rep("21-30",3), rep("32-40",3), rep("41-50",3), rep("51-59",3)), class=rep(c("a","b","c"),6))

dat.start
#>    category class
#> 1      1-10     a
#> 2      1-10     b
#> 3      1-10     c
#> 4     11-20     a
#> 5     11-20     b
#> 6     11-20     c
#> 7     21-30     a
#> 8     21-30     b
#> 9     21-30     c
#> 10    32-40     a
#> 11    32-40     b
#> 12    32-40     c
#> 13    41-50     a
#> 14    41-50     b
#> 15    41-50     c
#> 16    51-59     a
#> 17    51-59     b
#> 18    51-59     c

dat.end<-data.frame(category=c("1-20", "21-50", ">50"), class=c(6, 9, 3))

dat.end
#>   category class
#> 1     1-20     6
#> 2    21-50     9
#> 3      >50     3

dat.start %>%
  separate(category, into = c("min", "max"), sep = "-") %>%
  mutate(category = case_when(max <= 20 ~ "1-20",
                              min > 20 & max <= 50 ~ "21-50",
                              min > 50 ~ ">50")) %>%
  summarise(class = n(), .by = category)
#>   category class
#> 1     1-20     6
#> 2    21-50     9
#> 3      >50     3

或者另一种潜在的方法是使用一个"查找表",例如:

lookup_table <- setNames(c("1-20", "1-20", "21-50",
                           "21-50", "21-50", ">50"),
                         unique(dat.start$category))
lookup_table
#>    1-10   11-20   21-30   32-40   41-50   51-59 
#>  "1-20"  "1-20" "21-50" "21-50" "21-50"   ">50"

dat.start %>%
  mutate(category = recode(category, !!!lookup_table)) %>%
  summarise(class = n(), .by = category)
#>   category class
#> 1     1-20     6
#> 2    21-50     9
#> 3      >50     3

创建于2023-03-07,使用 reprex v2.0.2

有很多不同的方法可以使用查找表来完成这种任务,详细方法和示例请参见 这里

英文:

If you separate "category" into two columns you can make numerical comparisons, e.g.

library(tidyverse)

dat.start&lt;-data.frame(category=c(rep(&quot;1-10&quot;,3), rep(&quot;11-20&quot;,3), rep(&quot;21-30&quot;,3), rep(&quot;32-40&quot;,3), rep(&quot;41-50&quot;,3), rep(&quot;51-59&quot;,3)), class=rep(c(&quot;a&quot;,&quot;b&quot;,&quot;c&quot;),6))

dat.start
#&gt;    category class
#&gt; 1      1-10     a
#&gt; 2      1-10     b
#&gt; 3      1-10     c
#&gt; 4     11-20     a
#&gt; 5     11-20     b
#&gt; 6     11-20     c
#&gt; 7     21-30     a
#&gt; 8     21-30     b
#&gt; 9     21-30     c
#&gt; 10    32-40     a
#&gt; 11    32-40     b
#&gt; 12    32-40     c
#&gt; 13    41-50     a
#&gt; 14    41-50     b
#&gt; 15    41-50     c
#&gt; 16    51-59     a
#&gt; 17    51-59     b
#&gt; 18    51-59     c

dat.end&lt;-data.frame(category=c(&quot;1-20&quot;, &quot;21-50&quot;, &quot;&gt;50&quot;), class=c(6, 9, 3))

dat.end
#&gt;   category class
#&gt; 1     1-20     6
#&gt; 2    21-50     9
#&gt; 3      &gt;50     3

dat.start %&gt;%
  separate(category, into = c(&quot;min&quot;, &quot;max&quot;), sep = &quot;-&quot;) %&gt;%
  mutate(category = case_when(max &lt;= 20 ~ &quot;1-20&quot;,
                              min &gt; 20 &amp; max &lt;= 50 ~ &quot;21-50&quot;,
                              min &gt; 50 ~ &quot;&gt;50&quot;)) %&gt;%
  summarise(class = n(), .by = category)
#&gt;   category class
#&gt; 1     1-20     6
#&gt; 2    21-50     9
#&gt; 3      &gt;50     3

Or another potential approach is to use a 'look up' table, e.g.

lookup_table &lt;- setNames(c(&quot;1-20&quot;, &quot;1-20&quot;, &quot;21-50&quot;,
                           &quot;21-50&quot;, &quot;21-50&quot;, &quot;&gt;50&quot;),
                         unique(dat.start$category))
lookup_table
#&gt;    1-10   11-20   21-30   32-40   41-50   51-59 
#&gt;  &quot;1-20&quot;  &quot;1-20&quot; &quot;21-50&quot; &quot;21-50&quot; &quot;21-50&quot;   &quot;&gt;50&quot;

dat.start %&gt;%
  mutate(category = recode(category, !!!lookup_table)) %&gt;%
  summarise(class = n(), .by = category)
#&gt;   category class
#&gt; 1     1-20     6
#&gt; 2    21-50     9
#&gt; 3      &gt;50     3

<sup>Created on 2023-03-07 with reprex v2.0.2</sup>

There are many different ways to use a lookup table for this type of task, see https://stackoverflow.com/questions/67081496/canonical-tidyverse-method-to-update-some-values-of-a-vector-from-a-look-up-tabl for more methods / examples

huangapple
  • 本文由 发表于 2023年3月7日 10:08:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75657456.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定