问题

我想要识别数据框中组（因子）之间的所有不重叠值。让我们使用iris数据集来说明。iris数据集包括三种植物物种（setosa、versicolor和virginica）的萼片长度、萼片宽度、花瓣长度和花瓣宽度的测量数据。这三个物种在萼片长度和宽度的测量值上有重叠。在花瓣长度和宽度的测量值上，setosa与versicolor和virginica都不重叠。

我希望可以使用各种函数（如范围值或散点图）来手动可视化所需的信息：

tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)

# 或

library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()

但对于大型数据集，手动执行这些操作是不切实际的，因此我希望编写一个函数，用于识别像iris这样的数据框中因子之间的不重叠值。输出可以是一个包含TRUE或FALSE（表示不重叠和重叠，分别）的矩阵列表，每个变量都有一个矩阵。例如，对于iris数据集，输出将是包含4个矩阵的列表：

$1.Sepal.Length
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$2.Sepal.Width
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$3.Petal.Length
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

$4.Petal.Width
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA

我接受不同输出的建议，只要它们能够识别所有不重叠的值。

英文:

I would like to identify all non-overlapping values between groups (factors) in a dataframe. Let's use iris to illustrate. The iris dataset has measurements of sepal length, sepal width, petal length, and petal width for three plant species (setosa, versicolor, and virginica). All three species overlap in measurements of sepal length and width. In measurements of both petal length and width, setosa doesn't overlap with both versicolor and virginica.

What I want can be easily visualized manually using a variety of functions such as range values or scatter plots:

tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)

# or

library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()

But it's impractical to do this manually for large datasets, so I'd like to write a function that identifies non-overlapping values between factors in dataframes like iris. The output could be a list of matrices with TRUE or FALSE (indicating non-overlap and overlap, respectively), one for each variable in the dataset. For example, the output for iris would be a list of 4 matrices:

$1.Sepal.Length
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$2.Sepal.Width
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$3.Petal.Length
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

$4.Petal.Width
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA

I accept suggestions of different outputs, as long as they identify all non-overlapping values.

答案1

得分: 3

这是在 tidyverse 中的一个可能的解决方案：

library(dplyr)

# 构建自定义函数
my_fun <- function(x){
    # 从输入数据（具有度量值的列）和 iris 数据集的 Species 向量构建 tibble
    myDf <- dplyr::tibble(Species = as.character(iris$Species), Vals = as.numeric(x)) %>%
        # 按 Species 分组
        dplyr::group_by(Species) %>%
        # 计算每个 Species 的最小值和最大值
        dplyr::summarise(mini = min(Vals), maxi = max(Vals)) 

    ret <- myDf %>%
        # 通过全连接合并数据
        dplyr::full_join(myDf, by = character(), suffix = c("_1", "_2")) %>%
        # 转换操作为按行执行
        dplyr::rowwise() %>%
        # 如果 Species 相同，则生成 NA，否则检查是否重叠 - 此处使用否定，因为如果它们重叠，希望结果为 FALSE
        dplyr::mutate(res = ifelse(Species_1 == Species_2, NA, !(dplyr::between(mini_1, mini_2, maxi_2) | dplyr::between(maxi_1, mini_2, maxi_2) | between(mini_2, mini_1, maxi_1) | dplyr::between(maxi_2, mini_1, maxi_1) ))) %>%
        # 将 tibble 转换为宽格式以获得所需的布局
        tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, values_from = res) %>%
        # 需要设置行名称
        as.data.frame()

    # 从列设置行名称
    row.names(ret) <- ret$Species_1
    # 移除用于命名行的列
    ret$Species_1 <- NULL
    return(ret)
}

purrr::map(iris[, 1:4], ~my_fun(.x))

此代码中的函数执行一系列数据操作，最终生成一个数据框，其中包含了每个度量值之间的关系信息。

英文:

this is one possible solution within the tidyverse

library(dplyr)

# build custom function
my_fun &lt;- function(x){
    # build tibble from input data (colum with metric) and Species vector from iris
    myDf &lt;- dplyr::tibble(Species = as.character(iris$Species), Vals = as.numeric(x)) %&gt;%
        # find min and max value per species
        dplyr::group_by(Species) %&gt;%
        dplyr::summarise(mini = min(Vals), maxi = max(Vals)) 

    ret &lt;- myDf %&gt;%
        # build full join from data
        dplyr::full_join(myDf, by = character(), suffix = c(&quot;_1&quot;, &quot;_2&quot;)) %&gt;% 
        # convert operation to row wise
        dplyr::rowwise() %&gt;% 
        # if species are the same generate NA else check if between  - I do negate here as if they are overlapping you want it to be FALSE
        dplyr::mutate(res = ifelse(Species_1 == Species_2, NA, !(dplyr::between(mini_1, mini_2, maxi_2) | dplyr::between(maxi_1, mini_2, maxi_2) | between(mini_2, mini_1, maxi_1) | dplyr::between(maxi_2, mini_1, maxi_1) ))) %&gt;%
        # make tibble wide to get the wanted layout
        tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, values_from = res) %&gt;%
        # need it to be able to set row names
        as.data.frame()

    # set row names from column
    row.names(ret) &lt;- ret$Species_1
    # remove column used to name rows
    ret$Species_1 &lt;- NULL
    return(ret)
}

purrr::map(iris[, 1:4], ~my_fun(.x))

$Sepal.Length
           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

$Sepal.Width
           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

$Petal.Length
           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA

$Petal.Width
           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中识别数据框中因子之间不重叠的数值。

问题

答案1

在 ggplot 中增加累积函数图中的线条大小

使用ifelse在列之间进行变异

如何在R中从两列创建一个数据框。

如何使用ggplot2在R中绘制具有纵向分隔条的两组直方图的叠加部分。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论