我有以下数据集:
my_data = structure(list(state = c("State A", "State A", "State A", "State A",
"State B", "State B", "State B", "State B", "State A", "State A",
"State A", "State A", "State B", "State B", "State B", "State B"
), city = c("city 1", "city 1", "city 2", "city 2", "city 3",
"city 3", "city 4", "city 4", "city 1", "city 1", "city 2", "city 2",
"city 3", "city 3", "city 4", "city 4"), vaccine = c("yes", "no",
"yes", "no", "yes", "no", "yes", "no", "yes", "no", "yes", "no",
"yes", "no", "yes", "no"), counts = c(1221, 2233, 1344, 887,
9862, 2122, 8772, 2341, 1221, 2233, 1344, 887, 9862, 2122, 8772,
2341), year = c(2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021,
2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022)), row.names = c(NA,
-16L), class = "data.frame")
- 我的问题:**对于每个城市,我想找出每年接种疫苗的人口百分比。
最后的结果可能是这样的(我只是做了一些数字):
state city vaccine Relative_Percentage year
1 State A city 1 yes 0.6 2021
2 State A city 1 no 0.4 2021
3 State A city 2 yes 0.3 2021
4 State A city 2 no 0.7 2021
以这篇文章为例(Relative frequencies / proportions with dplyr),我尝试了以下代码:
library(dplyr)
my_data %>%
group_by(year, state, city, vaccine) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
但我不认为我的代码是正确的-所有的百分比都正好是0.5
`summarise()` has grouped output by 'year', 'state', 'city'. You can override using the `.groups` argument.
# A tibble: 16 x 6
# Groups: year, state, city [8]
year state city vaccine n freq
<dbl> <chr> <chr> <chr> <int> <dbl>
1 2021 State A city 1 no 1 0.5
2 2021 State A city 1 yes 1 0.5
有人能告诉我如何解决这个问题吗?
谢谢!
1条答案
按热度按时间2ic8powd1#
对于每个城市,我想找出每年接种疫苗的人口百分比。
分组中不要包含
vaccine
,可以将state
保留在分组中,以区分city
。另外,如果要计算counts
的百分比,则需要在summarize
中计算;因为您已经删除了counts
,所以以后不可能再查看它。尝试在freq
的计算中使用n
只是计算数据库中行的百分比,而不是接种疫苗的人的百分比。既然你想知道哪种疫苗有哪种频率,就把它加到总结里吧。
坦率地说,我们并不“需要”
summarize
,我们可以将其修改进来,因为计数似乎已经聚合。