识别R中的 Dataframe 中的因子之间的非重叠值

eufgjt7s  于 2023-02-20  发布在  其他
关注(0)|答案(1)|浏览(93)

我想确定组间所有不重叠的值让我们使用iris来说明。iris数据集具有三种植物物种的萼片长度、萼片宽度、花瓣长度和花瓣宽度的测量值(setosaversicolor,and virginica).所有三个种在萼片长度和宽度的测量上重叠.在花瓣长度和宽度的测量上,setosa 不与 versicolorvirginica 重叠。
使用各种函数(如范围值或散点图)可以轻松地手动可视化所需内容:

tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)

# or

library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()

但是对于大型数据集来说,手工操作是不切实际的,所以我想写一个函数来识别像iris这样的 Dataframe 中因子之间的非重叠值。输出可以是一个矩阵列表,TRUE或FALSE(分别表示非重叠和重叠),数据集中的每个变量对应一个矩阵。例如,iris的输出将是一个包含4个矩阵的列表:

$1.Sepal.Length
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$2.Sepal.Width
            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

$3.Petal.Length
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

$4.Petal.Width
            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA

我接受不同输出的建议,只要它们标识所有不重叠的值。

0lvr5msh

0lvr5msh1#

这是tidyverse内的一种可能的解决方案

library(dplyr)

# build custom function
my_fun <- function(x){
    # build tibble from input data (colum with metric) and Species vector from iris
    myDf <- dplyr::tibble(Species = as.character(iris$Species), Vals = as.numeric(x)) %>%
        # find min and max value per species
        dplyr::group_by(Species) %>%
        dplyr::summarise(mini = min(Vals), maxi = max(Vals)) 

    ret <- myDf %>%
        # build full join from data
        dplyr::full_join(myDf, by = character(), suffix = c("_1", "_2")) %>% 
        # convert operation to row wise
        dplyr::rowwise() %>% 
        # if species are the same generate NA else check if between  - I do negate here as if they are overlapping you want it to be FALSE
        dplyr::mutate(res = ifelse(Species_1 == Species_2, NA, !(dplyr::between(mini_1, mini_2, maxi_2) | dplyr::between(maxi_1, mini_2, maxi_2) | between(mini_2, mini_1, maxi_1) | dplyr::between(maxi_2, mini_1, maxi_1) ))) %>%
        # make tibble wide to get the wanted layout
        tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, values_from = res) %>%
        # need it to be able to set row names
        as.data.frame()

    # set row names from column
    row.names(ret) <- ret$Species_1
    # remove column used to name rows
    ret$Species_1 <- NULL
    return(ret)
}

purrr::map(iris[, 1:4], ~my_fun(.x))

$Sepal.Length
           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

$Sepal.Width
           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

$Petal.Length
           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA

$Petal.Width
           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA

相关问题