在DPLYR中通过逻辑子集与管道计算平均值似乎返回不同的结果

9gm1akwq  于 2023-06-19  发布在  其他
关注(0)|答案(1)|浏览(117)

我一直在尝试通过实现过滤器和连接操作来提高代码的速度。然而,仔细观察后,我发现这两种不同的方法对于同一个函数(在我的例子中是计算一个向量的平均值)返回的结果似乎略有不同
我正在查看时间序列数据,并希望计算基线(即在特定的时间范围内的某个响应值(Mean_Intensity)的平均值。我的简化数据看起来是这样的

time Neuron_ID                             Mean_Intensity DrugApp1
  <dbl> <fct>                                          <dbl> <dbl>
1     0 GLP_1_200nM_16_02_2023_p2_f1_Neuron_1           9.88   300
2     0 GLP_1_200nM_16_02_2023_p2_f1_Neuron_2          11.8    300
3     0 GLP_1_200nM_16_02_2023_p2_f1_Neuron_3           8.45   300
4     0 GLP_1_200nM_16_02_2023_p2_f1_Neuron_4           9.99   300
5     0 GLP_1_200nM_16_02_2023_p2_f1_Neuron_5           4.48   300
6     0 GLP_1_20nM_23_02_2023_p3_f1_Neuron_1            9.89   300

旧方法如下

df_baseline <- df %>%
  filter(time >= (DrugApp1 - 260), time <= (DrugApp1 + 750)) %>%
  group_by(Neuron_ID) %>%
  mutate(F0 = mean(Mean_Intensity[time <= DrugApp1], na.rm = TRUE))

我的新代码实现筛选器

df_baseline_new <- df %>%
  filter(time >= (DrugApp1 - 260), time <= DrugApp1) %>%
  group_by(Neuron_ID) %>%
  mutate(F0 = mean(Mean_Intensity, na.rm = TRUE))

然而,对于同一个变量,这两种方法似乎返回的F0值略有不同
例如,对于给定的Neuron_ID,这两个方法似乎分别返回2.906231或2.911889的F0值。
这个细微的差异让我认为这是由于mutate(F0 = mean(Mean_Intensity[time <= DrugApp1], na.rm = TRUE))的长度timefilter(time >= (DrugApp1 - 260), time <= (DrugApp1 + 750))不同造成的。我认为这可能与<=/<操作符包含/排除一个额外的时间点有关,但我尝试了这些操作符的许多组合,无法使它们相同,我不确定如何检查mutate(F0 = mean(Mean_Intensity[time <= DrugApp1], na.rm = TRUE))作用的时间范围。
I'm aware that filter removes NA values但是我整理了我的数据,df_baseline_new没有丢失任何可以解释这一点的值。
就目前的情况而言,我认为新的代码更有可能是正确的;我可以看到在df_baseline_new中过滤的time点正如我所期望的那样-所以这不是太大的问题,因为它实质上更快(2.01s vs 340 s),但是我想知道是什么导致了这种情况发生。任何帮助将不胜感激。

编辑

一些重新创建该问题的示例代码:

structure(list(time = c(0, 0, 0, 10, 10, 10, 20, 20, 20, 30, 
30, 30, 40, 40, 40, 50, 50, 50, 60, 60, 60, 70, 70, 70, 80, 80, 
80, 90, 90, 90, 100, 100, 100, 110, 110, 110, 120, 120, 120, 
130, 130, 130, 140, 140, 140, 150, 150, 150, 160, 160, 160, 170, 
170, 170, 180, 180, 180, 190, 190, 190, 200, 200, 200, 210, 210, 
210, 220, 220, 220, 230, 230, 230, 240, 240, 240, 250, 250, 250, 
260, 260, 260, 270, 270, 270, 280, 280, 280, 290, 290, 290, 300, 
300, 300, 310, 310, 310, 320, 320, 320, 330, 330, 330, 340, 340, 
340, 350, 350, 350, 360, 360, 360, 370, 370, 370, 380, 380, 380, 
390, 390, 390, 400, 400, 400, 410, 410, 410, 420, 420, 420, 430, 
430, 430, 440, 440, 440, 450, 450, 450, 460, 460, 460, 470, 470, 
470, 480, 480, 480, 490, 490, 490, 500, 500, 500, 510, 510, 510, 
520, 520, 520, 530, 530, 530, 540, 540, 540, 550, 550, 550, 560, 
560, 560, 570, 570, 570, 580, 580, 580, 590, 590, 590, 600, 600, 
600, 610, 610, 610, 620, 620, 620, 630, 630, 630, 640, 640, 640, 
650, 650, 650, 660, 660, 660, 670, 670, 670, 680, 680, 680, 690, 
690, 690, 700, 700, 700, 710, 710, 710, 720, 720, 720, 730, 730, 
730, 740, 740, 740, 750, 750, 750, 760, 760, 760, 770, 770, 770, 
780, 780, 780, 790, 790, 790, 800, 800, 800, 810, 810, 810, 820, 
820, 820, 830, 830, 830, 840, 840, 840, 850, 850, 850, 860, 860, 
860, 870, 870, 870, 880, 880, 880, 890, 890, 890, 900, 900, 900, 
910, 910, 910, 920, 920, 920, 930, 930, 930, 940, 940, 940, 950, 
950, 950, 960, 960, 960, 970, 970, 970, 980, 980, 980, 990, 990, 
990, 1000, 1000, 1000, 1010, 1010, 1010), Neuron_ID = structure(c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L), levels = c("POMC_GFP_GLP_1_50nM_23_02_23_p1_f1_Neuron_3", 
"POMC_GFP_GLP_1_50nM_23_02_23_p1_f1_Neuron_4", "POMC_GFP_GLP_1_50nM_23_02_23_p1_f1_Neuron_5"
), class = "factor"), Mean_Intensity = c(17.592, 10.148, 2.753, 
16.496, 9.785, 2.684, 18.887, 9.681, 2.78700000000001, 19.068, 
10.181, 2.71300000000001, 19.463, 10.072, 2.86999999999999, 18.075, 
9.795, 2.70699999999999, 17.515, 9.7, 2.74300000000001, 17.35, 
9.55800000000001, 2.61200000000001, 18.055, 9.35899999999999, 
2.67999999999999, 18.119, 9.11799999999999, 2.68599999999999, 
18.136, 9.298, 2.715, 18.648, 9.54899999999999, 2.744, 18.41, 
9.45399999999999, 2.866, 16.803, 9.75700000000001, 2.941, 18.081, 
9.76000000000001, 2.64, 17.018, 9.574, 3.283, 19.086, 10.122, 
2.98299999999999, 18.874, 9.97699999999999, 2.94799999999999, 
18.416, 9.556, 3.178, 19.367, 9.62, 2.852, 19.236, 9.68599999999999, 
2.875, 19.282, 9.76000000000001, 3.479, 20.024, 9.64100000000001, 
3.15600000000001, 20.177, 9.85499999999999, 3.077, 20.53, 9.29900000000001, 
3.096, 19.449, 9.595, 3.352, 17.926, 9.52499999999999, 3.20099999999999, 
18.146, 9.398, 3.101, 17.706, 9.355, 2.952, 18.222, 9.41800000000001, 
3, 19.932, 9.46600000000001, 2.941, 20.391, 9.47500000000001, 
2.943, 19.975, 9.449, 2.86199999999999, 19.704, 9.63, 3.029, 
20.318, 9.247, 2.711, 21.773, 9.613, 2.756, 21.757, 9.753, 2.78, 
21.396, 9.396, 2.75, 21.837, 9.721, 2.89700000000001, 20.221, 
9.387, 2.718, 20.905, 9.26400000000001, 2.80900000000001, 19.763, 
9.50399999999999, 2.759, 20.885, 9.821, 2.949, 21.624, 9.411, 
2.65400000000001, 21.694, 9.316, 3.167, 21.947, 9.56100000000001, 
3.52000000000001, 23.746, 9.245, 3.14500000000001, 24.241, 9.366, 
3.23099999999999, 23.875, 9.491, 3.28, 24.328, 9.361, 3.03700000000001, 
23.937, 8.99600000000001, 2.78100000000001, 23.383, 9.083, 2.94200000000001, 
23.297, 9.295, 3.29900000000001, 23.183, 9.20100000000001, 3.21300000000001, 
22.493, 9.111, 3.24300000000001, 23.056, 9.09400000000001, 2.967, 
23.406, 9.16199999999999, 3.226, 25.179, 9.277, 3.295, 25.679, 
9.024, 3.134, 25.215, 8.986, 3.19199999999999, 24.382, 9.048, 
3.28, 25.559, 9.33, 3.352, 25.122, 9.575, 3.27200000000001, 25.637, 
9.303, 3.33, 23.172, 9.176, 3.31, 24.396, 9.349, 3.32300000000001, 
23.88, 9.313, 3.321, 23.421, 9.304, 3.19, 24.431, 9.16399999999999, 
3.22999999999999, 25.018, 9.01900000000001, 3.363, 25.232, 9.128, 
3.53999999999999, 25.348, 9.206, 3.43900000000001, 25.399, 9.497, 
3.298, 24.746, 9.069, 2.967, 25.702, 9.107, 3.26299999999999, 
25.793, 9.36500000000001, 3.292, 25.703, 9.111, 3.148, 25.637, 
9.38800000000001, 3.328, 25.759, 9.384, 3.42399999999999, 23.973, 
9.53200000000001, 3.39400000000001, 23.623, 9.649, 3.497, 25.759, 
9.78500000000001, 3.65700000000001, 25.619, 9.64400000000001, 
3.28, 24.842, 9.813, 3.48400000000001, 24.342, 9.759, 3.55, 23.872, 
9.711, 3.474, 21.907, 9.794, 3.26299999999999, 21.298, 9.56, 
2.869, 21.291, 9.752, 3.227, 21.236, 9.76799999999999, 3.15299999999999, 
21.887, 9.93000000000001, 3.169, 22.742, 9.848, 3.22, 22.095, 
10.115, 3.622, 22.043, 10.031, 3.73, 21.19, 10.041, 3.548, 21.572, 
9.992, 3.476, 23.337, 10.255, 3.67099999999999, 23.556, 10.142, 
3.459, 22.683, 10.26, 3.56, 23.389, 10.289, 3.42999999999999, 
23.686, 10.025, 3.15900000000001, 23.687, 10.147, 3.12700000000001
), DrugApp1 = c(300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 
300, 300, 300, 300, 300, 300, 300, 300, 300, 300)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -306L), groups = structure(list(
    Neuron_ID = structure(1:3, levels = c("POMC_GFP_GLP_1_50nM_23_02_23_p1_f1_Neuron_3", 
    "POMC_GFP_GLP_1_50nM_23_02_23_p1_f1_Neuron_4", "POMC_GFP_GLP_1_50nM_23_02_23_p1_f1_Neuron_5"
    ), class = "factor"), .rows = structure(list(c(1L, 4L, 7L, 
    10L, 13L, 16L, 19L, 22L, 25L, 28L, 31L, 34L, 37L, 40L, 43L, 
    46L, 49L, 52L, 55L, 58L, 61L, 64L, 67L, 70L, 73L, 76L, 79L, 
    82L, 85L, 88L, 91L, 94L, 97L, 100L, 103L, 106L, 109L, 112L, 
    115L, 118L, 121L, 124L, 127L, 130L, 133L, 136L, 139L, 142L, 
    145L, 148L, 151L, 154L, 157L, 160L, 163L, 166L, 169L, 172L, 
    175L, 178L, 181L, 184L, 187L, 190L, 193L, 196L, 199L, 202L, 
    205L, 208L, 211L, 214L, 217L, 220L, 223L, 226L, 229L, 232L, 
    235L, 238L, 241L, 244L, 247L, 250L, 253L, 256L, 259L, 262L, 
    265L, 268L, 271L, 274L, 277L, 280L, 283L, 286L, 289L, 292L, 
    295L, 298L, 301L, 304L), c(2L, 5L, 8L, 11L, 14L, 17L, 20L, 
    23L, 26L, 29L, 32L, 35L, 38L, 41L, 44L, 47L, 50L, 53L, 56L, 
    59L, 62L, 65L, 68L, 71L, 74L, 77L, 80L, 83L, 86L, 89L, 92L, 
    95L, 98L, 101L, 104L, 107L, 110L, 113L, 116L, 119L, 122L, 
    125L, 128L, 131L, 134L, 137L, 140L, 143L, 146L, 149L, 152L, 
    155L, 158L, 161L, 164L, 167L, 170L, 173L, 176L, 179L, 182L, 
    185L, 188L, 191L, 194L, 197L, 200L, 203L, 206L, 209L, 212L, 
    215L, 218L, 221L, 224L, 227L, 230L, 233L, 236L, 239L, 242L, 
    245L, 248L, 251L, 254L, 257L, 260L, 263L, 266L, 269L, 272L, 
    275L, 278L, 281L, 284L, 287L, 290L, 293L, 296L, 299L, 302L, 
    305L), c(3L, 6L, 9L, 12L, 15L, 18L, 21L, 24L, 27L, 30L, 33L, 
    36L, 39L, 42L, 45L, 48L, 51L, 54L, 57L, 60L, 63L, 66L, 69L, 
    72L, 75L, 78L, 81L, 84L, 87L, 90L, 93L, 96L, 99L, 102L, 105L, 
    108L, 111L, 114L, 117L, 120L, 123L, 126L, 129L, 132L, 135L, 
    138L, 141L, 144L, 147L, 150L, 153L, 156L, 159L, 162L, 165L, 
    168L, 171L, 174L, 177L, 180L, 183L, 186L, 189L, 192L, 195L, 
    198L, 201L, 204L, 207L, 210L, 213L, 216L, 219L, 222L, 225L, 
    228L, 231L, 234L, 237L, 240L, 243L, 246L, 249L, 252L, 255L, 
    258L, 261L, 264L, 267L, 270L, 273L, 276L, 279L, 282L, 285L, 
    288L, 291L, 294L, 297L, 300L, 303L, 306L)), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .drop = TRUE))

好吧,我的问题是由于DrugApp1实际上是一个与我的 Dataframe 大小相同的数字向量。由于这是一个我在许多不同数据集上运行的脚本,具有不同的变量名称(我提取为DrugApp1String),我有一段早期代码,可以将不同的变量名称转换为一个公共的DrugApp1。我打算这样做:

df$DrugApp1 <- df[,c(DrugApp1_string)]

然而事实是

DrugApp1 <- df[,c(DrugApp1_string)]

这导致了不对齐的发生,我认为也导致了我的代码运行缓慢。非常非常烦人。
非常感谢您的帮助!

tf7tbtn2

tf7tbtn21#

您正在不同的数据集上运行代码,第一个块中的df,第二个块中的data_FINAL。如果要检查差异,请在同一数据上运行两组代码,修改列名以便您可以区分哪些是哪些,并添加一些标志来标记每个均值中包含哪些行。就像这样:

data_FINAL$id = 1:nrow(data_FINAL) # add ID variable so we can compare

result_new = data_FINAL %>%
  filter(time >= (DrugApp1 - 260), time <= DrugApp1) %>%
  group_by(Neuron_ID) %>%
  mutate(
    included_new = TRUE,
    F0_new = mean(Mean_Intensity, na.rm = TRUE)
  ) 

result_old <- data_FINAL %>% ## use the same data
  filter(time >= (DrugApp1 - 260), time <= (DrugApp1 + 750)) %>%
  group_by(Neuron_ID) %>%
  mutate(
    included_old = time <= DrugApp1,
    F0 = mean(Mean_Intensity[time <= DrugApp1], na.rm = TRUE)
  )

那你就可以加入比较了

result_old |> 
  full_join(result_new, by = "ID", suffix = c("_old", "_new")) |>
  filter(included_old != included_new)

这将为您提供包含在一个均值中但不包含在另一个均值中的任何行的结果。

相关问题