在R中识别数据框中因子之间不重叠的数值。

huangapple go评论86阅读模式
英文:

Identifying non-overlapping values between factors in a dataframe in R

问题

我想要识别数据框中组(因子)之间的所有不重叠值。让我们使用iris数据集来说明。iris数据集包括三种植物物种(setosaversicolorvirginica)的萼片长度、萼片宽度、花瓣长度和花瓣宽度的测量数据。这三个物种在萼片长度和宽度的测量值上有重叠。在花瓣长度和宽度的测量值上,setosaversicolorvirginica都不重叠。

我希望可以使用各种函数(如范围值或散点图)来手动可视化所需的信息:

tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)

# 或

library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()

但对于大型数据集,手动执行这些操作是不切实际的,因此我希望编写一个函数,用于识别像iris这样的数据框中因子之间的不重叠值。输出可以是一个包含TRUE或FALSE(表示不重叠和重叠,分别)的矩阵列表,每个变量都有一个矩阵。例如,对于iris数据集,输出将是包含4个矩阵的列表:

$1.Sepal.Length
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$2.Sepal.Width
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$3.Petal.Length
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

$4.Petal.Width
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

我接受不同输出的建议,只要它们能够识别所有不重叠的值。

英文:

I would like to identify all non-overlapping values between groups (factors) in a dataframe. Let's use iris to illustrate. The iris dataset has measurements of sepal length, sepal width, petal length, and petal width for three plant species (setosa, versicolor, and virginica). All three species overlap in measurements of sepal length and width. In measurements of both petal length and width, setosa doesn't overlap with both versicolor and virginica.

What I want can be easily visualized manually using a variety of functions such as range values or scatter plots:

tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)

# or

library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()

But it's impractical to do this manually for large datasets, so I'd like to write a function that identifies non-overlapping values between factors in dataframes like iris. The output could be a list of matrices with TRUE or FALSE (indicating non-overlap and overlap, respectively), one for each variable in the dataset. For example, the output for iris would be a list of 4 matrices:

$1.Sepal.Length
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$2.Sepal.Width
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$3.Petal.Length
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

$4.Petal.Width
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

I accept suggestions of different outputs, as long as they identify all non-overlapping values.

答案1

得分: 3

这是在 tidyverse 中的一个可能的解决方案:

library(dplyr)

# 构建自定义函数
my_fun <- function(x){
    # 从输入数据(具有度量值的列)和 iris 数据集的 Species 向量构建 tibble
    myDf <- dplyr::tibble(Species = as.character(iris$Species), Vals = as.numeric(x)) %>%
        # 按 Species 分组
        dplyr::group_by(Species) %>%
        # 计算每个 Species 的最小值和最大值
        dplyr::summarise(mini = min(Vals), maxi = max(Vals)) 

    ret <- myDf %>%
        # 通过全连接合并数据
        dplyr::full_join(myDf, by = character(), suffix = c("_1", "_2")) %>%
        # 转换操作为按行执行
        dplyr::rowwise() %>%
        # 如果 Species 相同,则生成 NA,否则检查是否重叠 - 此处使用否定,因为如果它们重叠,希望结果为 FALSE
        dplyr::mutate(res = ifelse(Species_1 == Species_2, NA, !(dplyr::between(mini_1, mini_2, maxi_2) | dplyr::between(maxi_1, mini_2, maxi_2) | between(mini_2, mini_1, maxi_1) | dplyr::between(maxi_2, mini_1, maxi_1) ))) %>%
        # 将 tibble 转换为宽格式以获得所需的布局
        tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, values_from = res) %>%
        # 需要设置行名称
        as.data.frame()

    # 从列设置行名称
    row.names(ret) <- ret$Species_1
    # 移除用于命名行的列
    ret$Species_1 <- NULL
    return(ret)
}

purrr::map(iris[, 1:4], ~my_fun(.x))

此代码中的函数执行一系列数据操作,最终生成一个数据框,其中包含了每个度量值之间的关系信息。

英文:

this is one possible solution within the tidyverse

library(dplyr)

# build custom function
my_fun &lt;- function(x){
    # build tibble from input data (colum with metric) and Species vector from iris
    myDf &lt;- dplyr::tibble(Species = as.character(iris$Species), Vals = as.numeric(x)) %&gt;%
        # find min and max value per species
        dplyr::group_by(Species) %&gt;%
        dplyr::summarise(mini = min(Vals), maxi = max(Vals)) 

    ret &lt;- myDf %&gt;%
        # build full join from data
        dplyr::full_join(myDf, by = character(), suffix = c(&quot;_1&quot;, &quot;_2&quot;)) %&gt;% 
        # convert operation to row wise
        dplyr::rowwise() %&gt;% 
        # if species are the same generate NA else check if between  - I do negate here as if they are overlapping you want it to be FALSE
        dplyr::mutate(res = ifelse(Species_1 == Species_2, NA, !(dplyr::between(mini_1, mini_2, maxi_2) | dplyr::between(maxi_1, mini_2, maxi_2) | between(mini_2, mini_1, maxi_1) | dplyr::between(maxi_2, mini_1, maxi_1) ))) %&gt;%
        # make tibble wide to get the wanted layout
        tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, values_from = res) %&gt;%
        # need it to be able to set row names
        as.data.frame()

    # set row names from column
    row.names(ret) &lt;- ret$Species_1
    # remove column used to name rows
    ret$Species_1 &lt;- NULL
    return(ret)
}

purrr::map(iris[, 1:4], ~my_fun(.x))

$Sepal.Length
           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

$Sepal.Width
           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

$Petal.Length
           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA

$Petal.Width
           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA

huangapple
  • 本文由 发表于 2023年2月14日 09:02:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/75442568.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定