R语言存在NA时，多个变量的平均值和sd/组

0g0grzrc 于 2023-04-27 发布在其他

关注(0)|答案(2)|浏览(171)

我想为分组数据的多个变量创建一个平均值和标准差表。但是，数据具有NA s，因此我需要包含na.rm =T命令。
使用iris作为MWE，更改为包括NA s：

irisalt = iris
irisalt[1,1] =NA
irisalt[52,2] =NA
irisalt[103,3]= NA

第一次尝试：

irisalt%>%
  group_by(Species)%>%
  summarise(count = n(),
            across(contains("."), c("mean" = mean, "sd" = sd))
  )

 Species    count Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd Petal.Length_mean Petal.Length_sd
  <fct>      <int>             <dbl>           <dbl>            <dbl>          <dbl>             <dbl>           <dbl>
1 setosa        50             NA             NA                 3.43          0.379              1.46           0.174
2 versicolor    50              5.94           0.516            NA            NA                  4.26           0.470
3 virginica     50              6.59           0.636             2.97          0.322             NA             NA

这是我需要的表，但我想通过删除NA来计算均值和sd。
第二次尝试：

irisalt%>%
  group_by(Species)%>%
  drop_na()%>%
  summarise(count = n(),
            across(contains("."), c("mean" = mean, "sd" = sd))
  )

这将删除存在NA的整行，并因此改变存在数据的变量的均值。
第三次尝试：

irisalt%>%
  group_by(Species)%>%
  summarise(count = n(),
            across(contains("."), c("mean" = mean(., na.rm = T), "sd" = sd(., na.rm =T)))
  )

Error in `summarise()`:
i In argument: `across(...)`.
Caused by error in `is.data.frame()`:
! 'list' object cannot be coerced to type 'double'

第四次尝试：

irisalt%>%
  group_by(Species)%>%
  summarise(count = n(),
            across(contains("."), ~c("mean" = mean(., na.rm = T), "sd" = sd(., na.rm =T)))
  )

Species    count Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>      <int>        <dbl>       <dbl>        <dbl>       <dbl>
1 setosa        50        5.00        3.43         1.46        0.246
2 setosa        50        0.356       0.379        0.174       0.105
3 versicolor    50        5.94        2.76         4.26        1.33 
4 versicolor    50        0.516       0.311        0.470       0.198
5 virginica     50        6.59        2.97         5.54        2.03 
6 virginica     50        0.636       0.322        0.555       0.275
Warning message:
Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
i Please use `reframe()` instead.
i When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust
  accordingly.

这些是我需要的数字，但我需要每组一行（Species），每个变量有一个单独的列表示平均值和sd，就像我第一次尝试一样

来源：https://stackoverflow.com/questions/76105000/mean-and-sd-per-group-for-multiple-variables-when-nas-present

2条答案

按热度按时间

xuo3flqw1#

看起来你很接近了，但是你的语法有点不对：

library(dplyr)

irisalt %>%
  group_by(Species) %>%
  summarise(count = n(),
            across(contains("."), list(mean = ~ mean(., na.rm = T), 
                                       sd = ~ sd(., na.rm =T)))
  )

来自文档?across
函数或lambda表达式的命名列表，例如list（mean = mean，n_miss = ~ sum（is.na（.x））。每个函数应用于每个列，输出通过使用. names中的粘合规范组合函数名和列名来命名。
注意：从dplyr 1.1.0开始，summarize现在有一个.by参数，这个参数是实验性的，但是允许像这里一样进行一次性的组计算。所以你不需要管道到group_by。

输出

Species    count Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd Petal.Length_mean Petal.Length_sd Petal.Width_mean Petal.Width_sd
  <fct>      <int>             <dbl>           <dbl>            <dbl>          <dbl>             <dbl>           <dbl>            <dbl>          <dbl>
1 setosa        50              5.00           0.356             3.43          0.379              1.46           0.174            0.246          0.105
2 versicolor    50              5.94           0.516             2.76          0.311              4.26           0.470            1.33           0.198
3 virginica     50              6.59           0.636             2.97          0.322              5.54           0.555            2.03           0.275

赞(0）回复(0）举报 2023-04-27

fhity93d2#

您可以使用较新的reframe，并且需要将摘要统计信息放在列表中：

irisalt %>%
  reframe(count = n(),
            across(contains("."), 
                   list(mean = ~mean(., na.rm = T), 
                        stdev = ~sd(., na.rm =T))),
  .by = Species)

输出

Species count Sepal.Length_mean Sepal.Length_stdev Sepal.Width_mean Sepal.Width_stdev Petal.Length_mean Petal.Length_stdev Petal.Width_mean Petal.Width_stdev
1     setosa    50          5.004082          0.3558787         3.428000         0.3790644          1.462000          0.1736640            0.246         0.1053856
2 versicolor    50          5.936000          0.5161711         2.761224         0.3107895          4.260000          0.4699110            1.326         0.1977527
3  virginica    50          6.588000          0.6358796         2.974000         0.3224966          5.544898          0.5553007            2.026         0.2746501

以下是有关summarize和reframe之间差异的更多信息：https://dplyr.tidyverse.org/reference/reframe.html

赞(0）回复(0）举报 2023-04-27

我来回答

R语言存在NA时，多个变量的平均值和sd/组

2条答案

相关问题

热门标签

最新问答

R语言 存在NA时，多个变量的平均值和sd/组

2条答案

相关问题

热门标签

最新问答

R语言存在NA时，多个变量的平均值和sd/组