dplyr使用多个阈值按组设置行的子集

epggiuax 于 2023-09-27 发布在其他

关注(0)|答案(1)|浏览(77)

我现在想回答一个非常简单的问题：列出每个地区排名前20%的药店（按单位衡量）。
我首先测量了每家药店销售的产品数量，将其分类为DESC，并将以下阈值设置为20%：
以下是我计算数字的方法：

spad_number <- c_spad_results %>%
  group_by(province)%>%
  summarise(tot_pharma = n())
View(spad_number)

spad_number_20_threshold <- spad_number %>%
  mutate(Top_20_threshold = round(tot_pharma*0.20))%>%
  dplyr::select(Top_20_threshold,province)

给我这个结果数据框
A tibble：4 × 2 Top_20_threshold省
1 248柏林2 38杰努阿
3 27伦敦
4 42都灵
现在我想得到一个dataframe，我只能得到前20%，并且无法想出比这更好的解决方案：

top_20_pharma <- ranked_pharmas %>%
  group_by(province) %>%
  slice(1:spad_number_20_threshold$Top_20_threshold)
View(top_20_pharma)

然后拿到我一直收到的那期杂志
警告消息：1：在1：spad_number_20_threshold$Top_20_threshold中：数值表达式有4个元素：仅第一个使用2：在1：spad_number_20_threshold$Top_20_threshold中：数值表达式有4个元素：只有第一个使用3：在1：spad_number_20_threshold$Top_20_threshold中：数值表达式有4个元素：只有第一个使用4：在1：spad_number_20_threshold$Top_20_threshold中：数值表达式有4个元素：仅使用第一个
我如何确保我的ranked_pharmas dataframe行是根据每个区域正确选择的，而不仅仅是阅读dataframe的第一个值。
非常感谢您提前任何形式的帮助！！！
最好：）
我本来预计会得到248个柏林，38个杰努阿，27个伦敦和42个药店，但超过了双倍。我试验过过滤功能

#filter(row_number(desc(total_units)) <= first(spad_number_20_threshold$Top_20_threshold))

但它也只是读取248的第一个值，并为每组提供248个药房。这里缺少Excel中的Vlookup之类的东西，但我对R的经验不足：/

r

来源：https://stackoverflow.com/questions/77136438/dplyr-subsetting-rows-by-groups-with-multiple-thresholds

1条答案

按热度按时间

6jjcrrmo1#

不确定c_spad_results的结构，但我相信slice_head可以解决你的问题，如果数据集被安排好的话。

c_spad_results |>
  group_by(province) |>
  slice_head(prop = 0.2)

赞(0）回复(0）举报 2023-09-27

我来回答

dplyr使用多个阈值按组设置行的子集

1条答案

相关问题

热门标签

最新问答