将 Dataframe 子集化，每组记录不超过5条

oug3syen 于 2023-02-26 发布在其他

关注(0)|答案(2)|浏览(135)

我有一个数据框，里面有一个叫做饮食的因子，假设饮食因子的水平有“食草动物”、“食肉动物”和“杂食动物”，那么有3个食草动物、6个食肉动物和8个杂食动物。
基本上，我想过滤这个 Dataframe ，使水平，有5个以下的计数保持不变和水平，有5个以上的计数得到过滤下来5（最好是随机）。
例如，我会从3种食草动物，6种食肉动物，8种杂食动物

diet       factor2
1 herbivore     a
2 herbivore     a
3 herbivore     a
4 carnivore     a
5 carnivore     a
6 carnivore     a
7 carnivore     a
8 carnivore     a
9 carnivore     a
10 omnivore     a
11 omnivore     a
12 omnivore     a
13 omnivore     a
14 omnivore     a
15 omnivore     a
16 omnivore     a
17 omnivore     a

3种草食动物，5种食肉动物，5种杂食动物。

diet    factor2
1 herbivore     a
2 herbivore     a
3 herbivore     a
4 carnivore     a
5 carnivore     a
6 carnivore     a
7 carnivore     a
8 carnivore     a
9 omnivore     a
10 omnivore     a
11 omnivore     a
12 omnivore     a
13 omnivore     a

来源：https://stackoverflow.com/questions/75562072/subset-dataframe-to-have-not-more-than-5-records-for-each-group

2条答案

按热度按时间

blpfk2vs1#

我们可以使用slice_sample，它在dplyr 1.1.0中获得了一个by参数：

set.seed(2023)
library(dplyr)
mtcars |>
  slice_sample(n = 5, by = carb) |>
  arrange(carb)                       # for easier visual review

结果

mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128          32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1 # 5 shown out of 7
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Toyota Corolla    33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9         27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2 # 5 shown out of 10
Volvo 142E        21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
Dodge Challenger  15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 450SLC       15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3 # 3 shown out of 3
Merc 450SE        16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL        17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 280C         17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4 # 5 shown out of 10
Camaro Z28        13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6 # 1 shown out of 1
Maserati Bora     15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8 # 1 shown out of 1

既往（pryr？）死亡：

mtcars |> 
  group_by(carb) |> 
  slice_sample(n = 5) |> 
  ungroup() |>
  arrange(carb)

赞(0）回复(0）举报 2023-02-26

u5rb5r592#

另一种dplyr解决方案：

df1 %>%  
  group_by(diet) %>% 
  slice(1:5) ## or we can use: filter(row_number() <= 5)

在data.table中，我们可以使用rowid：
x一个一个一个一个x一个一个二个x

赞(0）回复(0）举报 2023-02-26

我来回答

将 Dataframe 子集化，每组记录不超过5条

2条答案

相关问题

热门标签

最新问答