按R组分列的总计数百分比

mm5n2pyu  于 11个月前  发布在  其他
关注(0)|答案(4)|浏览(88)

我试图创建一个输出,按因子水平计算计数占总计数(在数据框中)的百分比,但似乎无法弄清楚如何在输出中保留分组结构。
我可以得到总计数除以...

df %>% summarise(sum(num))
# 15

字符串
...

df %>% group_by(species) %>% summarise(sum(num))
# A tibble: 3 × 2
#   species                  `sum(num)`
#   <chr>                         <int>
# 1 Farfantepenaeus duorarum          4
# 2 Farfantepenaeus notialis          0
# 3 Farfantepenaeus spp              11


但我没法把它弄成这样...

# ???
#   species                     Percent
#   <chr>                         <int>
# 1 Farfantepenaeus duorarum       4 / 15 = 0.267
# 2 Farfantepenaeus notialis       0 / 15 = 0.000
# 3 Farfantepenaeus spp           11 / 15 = 0.733


我得到的最接近的结果是这样的,但是因为我使用了reframe(),所以它返回未分组的数据

df %>% group_by(species) %>% 
  summarise(factor_count=sum(num)) %>% 
  # ungroup() %>% 
  # Wanring: # Please use `reframe()` instead., When switching from `summarise()` 
  # to `reframe()`, remember that `reframe()` always returns an ungrouped data
  reframe(percent=factor_count/sum(df$num))

# A tibble: 3 × 1
  percent
    <dbl>
1   0.267
2   0    
3   0.733


数据类型:

> dput(df)
structure(list(species = c("Farfantepenaeus notialis", "Farfantepenaeus spp", 
"Farfantepenaeus notialis", "Farfantepenaeus notialis", "Farfantepenaeus duorarum", 
"Farfantepenaeus duorarum", "Farfantepenaeus notialis", "Farfantepenaeus spp", 
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis", 
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis", 
"Farfantepenaeus notialis", "Farfantepenaeus spp", "Farfantepenaeus duorarum", 
"Farfantepenaeus spp", "Farfantepenaeus spp", "Farfantepenaeus duorarum", 
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus spp", 
"Farfantepenaeus spp", "Farfantepenaeus notialis"), num = c(0L, 
0L, 0L, 0L, 1L, 0L, 0L, 2L, 0L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 3L, 0L, 2L, 4L, 0L)), row.names = c(159897L, 174698L, 
236857L, 190237L, 327321L, 272931L, 304567L, 75538L, 109206L, 
351373L, 280332L, 163966L, 282183L, 341197L, 316962L, 354703L, 
343971L, 95333L, 244258L, 254061L, 87561L, 186908L, 221318L, 
258688L, 97737L), class = "data.frame")

b1zrtrql

b1zrtrql1#

两个步骤:汇总组总数,然后对所有组合进行重新计算。

library(dplyr)
df %>%
  summarize(Percent = sum(num), .by = species) %>%
  mutate(Percent = Percent / sum(Percent))
#                    species   Percent
# 1 Farfantepenaeus notialis 0.0000000
# 2      Farfantepenaeus spp 0.7333333
# 3 Farfantepenaeus duorarum 0.2666667

字符串
对于您的代码:

  • reframe是不必要的(大多数情况下,当行数 * 改变 * 时,它通常可以代替summarise,但我还没有验证两者是否/在哪里有显著差异),实际上在这里它将删除species
  • (几乎)* 永远不要 * 在以df开头的管道中使用df$:使用df$num会忽略自管道开始以来所做的任何操作,这意味着分组、过滤、添加/更改等在该版本的df中不可用。当然,有时候它是有用的,甚至是必要的,但它们很少。
yjghlzjz

yjghlzjz2#

使用xtabs

> xtabs(num ~ species, df) |> proportions() |> as.data.frame()
                   species         Freq
1 Farfantepenaeus duorarum 0.2666666667
2 Farfantepenaeus notialis 0.0000000000
3      Farfantepenaeus spp 0.7333333333

字符串

cgfeq70w

cgfeq70w3#

将值传递给count函数的wt参数

df %>%
    count(species, wt = num/sum(.$num), name = 'percent')

                   species   percent
1 Farfantepenaeus duorarum 0.2666667
2 Farfantepenaeus notialis 0.0000000
3      Farfantepenaeus spp 0.7333333

字符串

bejyjqdl

bejyjqdl4#

以下是两种替代方法:

使用map_vec

library(purrr)
library(dplyr)

df %>% 
  summarise(sum_num = sum(num), .by=species) %>% 
  mutate(percent = map_vec(sum_num, ~ .x /  sum(df$num)))

字符串

base R:

# credits to @r2evans: 
aggregate(num ~ species, data = df, sum) |>
  transform(percent = num/sum(num))

# or:
df_sums <- aggregate(num ~ species, data = df, sum)
df_sums$percent <- df_sums$num / sum(df$num)

df_sums
species sum_num   percent
1 Farfantepenaeus notialis       0 0.0000000
2      Farfantepenaeus spp      11 0.7333333
3 Farfantepenaeus duorarum       4 0.2666667

相关问题