英文:
Identifying non-overlapping values between factors in a dataframe in R
问题
我想要识别数据框中组(因子)之间的所有不重叠值。让我们使用iris
数据集来说明。iris
数据集包括三种植物物种(setosa、versicolor和virginica)的萼片长度、萼片宽度、花瓣长度和花瓣宽度的测量数据。这三个物种在萼片长度和宽度的测量值上有重叠。在花瓣长度和宽度的测量值上,setosa与versicolor和virginica都不重叠。
我希望可以使用各种函数(如范围值或散点图)来手动可视化所需的信息:
tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)
# 或
library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()
但对于大型数据集,手动执行这些操作是不切实际的,因此我希望编写一个函数,用于识别像iris
这样的数据框中因子之间的不重叠值。输出可以是一个包含TRUE或FALSE(表示不重叠和重叠,分别)的矩阵列表,每个变量都有一个矩阵。例如,对于iris
数据集,输出将是包含4个矩阵的列表:
$1.Sepal.Length
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$2.Sepal.Width
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$3.Petal.Length
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
$4.Petal.Width
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
我接受不同输出的建议,只要它们能够识别所有不重叠的值。
英文:
I would like to identify all non-overlapping values between groups (factors) in a dataframe. Let's use iris
to illustrate. The iris
dataset has measurements of sepal length, sepal width, petal length, and petal width for three plant species (setosa, versicolor, and virginica). All three species overlap in measurements of sepal length and width. In measurements of both petal length and width, setosa doesn't overlap with both versicolor and virginica.
What I want can be easily visualized manually using a variety of functions such as range values or scatter plots:
tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)
# or
library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()
But it's impractical to do this manually for large datasets, so I'd like to write a function that identifies non-overlapping values between factors in dataframes like iris
. The output could be a list of matrices with TRUE or FALSE (indicating non-overlap and overlap, respectively), one for each variable in the dataset. For example, the output for iris
would be a list of 4 matrices:
$1.Sepal.Length
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$2.Sepal.Width
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$3.Petal.Length
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
$4.Petal.Width
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
I accept suggestions of different outputs, as long as they identify all non-overlapping values.
答案1
得分: 3
这是在 tidyverse
中的一个可能的解决方案:
library(dplyr)
# 构建自定义函数
my_fun <- function(x){
# 从输入数据(具有度量值的列)和 iris 数据集的 Species 向量构建 tibble
myDf <- dplyr::tibble(Species = as.character(iris$Species), Vals = as.numeric(x)) %>%
# 按 Species 分组
dplyr::group_by(Species) %>%
# 计算每个 Species 的最小值和最大值
dplyr::summarise(mini = min(Vals), maxi = max(Vals))
ret <- myDf %>%
# 通过全连接合并数据
dplyr::full_join(myDf, by = character(), suffix = c("_1", "_2")) %>%
# 转换操作为按行执行
dplyr::rowwise() %>%
# 如果 Species 相同,则生成 NA,否则检查是否重叠 - 此处使用否定,因为如果它们重叠,希望结果为 FALSE
dplyr::mutate(res = ifelse(Species_1 == Species_2, NA, !(dplyr::between(mini_1, mini_2, maxi_2) | dplyr::between(maxi_1, mini_2, maxi_2) | between(mini_2, mini_1, maxi_1) | dplyr::between(maxi_2, mini_1, maxi_1) ))) %>%
# 将 tibble 转换为宽格式以获得所需的布局
tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, values_from = res) %>%
# 需要设置行名称
as.data.frame()
# 从列设置行名称
row.names(ret) <- ret$Species_1
# 移除用于命名行的列
ret$Species_1 <- NULL
return(ret)
}
purrr::map(iris[, 1:4], ~my_fun(.x))
此代码中的函数执行一系列数据操作,最终生成一个数据框,其中包含了每个度量值之间的关系信息。
英文:
this is one possible solution within the tidyverse
library(dplyr)
# build custom function
my_fun <- function(x){
# build tibble from input data (colum with metric) and Species vector from iris
myDf <- dplyr::tibble(Species = as.character(iris$Species), Vals = as.numeric(x)) %>%
# find min and max value per species
dplyr::group_by(Species) %>%
dplyr::summarise(mini = min(Vals), maxi = max(Vals))
ret <- myDf %>%
# build full join from data
dplyr::full_join(myDf, by = character(), suffix = c("_1", "_2")) %>%
# convert operation to row wise
dplyr::rowwise() %>%
# if species are the same generate NA else check if between - I do negate here as if they are overlapping you want it to be FALSE
dplyr::mutate(res = ifelse(Species_1 == Species_2, NA, !(dplyr::between(mini_1, mini_2, maxi_2) | dplyr::between(maxi_1, mini_2, maxi_2) | between(mini_2, mini_1, maxi_1) | dplyr::between(maxi_2, mini_1, maxi_1) ))) %>%
# make tibble wide to get the wanted layout
tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, values_from = res) %>%
# need it to be able to set row names
as.data.frame()
# set row names from column
row.names(ret) <- ret$Species_1
# remove column used to name rows
ret$Species_1 <- NULL
return(ret)
}
purrr::map(iris[, 1:4], ~my_fun(.x))
$Sepal.Length
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$Sepal.Width
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$Petal.Length
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
$Petal.Width
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论