如何自动计算每组RMSE以拟合R中的数据?

kq0g1dla  于 2023-04-03  发布在  其他
关注(0)|答案(1)|浏览(151)

例如,我有如下数据。

genotype=rep(c("A","B","C","D","E"), each=5)
env=rep(c(10,20,30,40,50), time=5)
outcome=c(10,15,17,19,22,12,13,15,18,25,10,11,12,13,18,11,15,20,22,28,10,9,10,12,15)
dataA=data.frame(genotype,env,outcome)

然后,我想在每个基因型的结果和env之间进行拟合,并且还想计算RMSE。所以,我使用了这个代码。

A=anova(lm(outcome~env,data=subset(dataA, genotype=="A")))

##
Response: outcome
          Df Sum Sq Mean Sq F value   Pr(>F)   
env        1   78.4  78.400      84 0.002746 **
Residuals  3    2.8   0.933   
##

A_rmse=sqrt(0.933)
A_rmse= 0.9659193

我需要以同样的方式计算B和直到E基因型,但在我的实际数据中,基因型超过100,所以不可能逐个计算。所以我想知道如何自动计算每个基因型的RMSE(=方差分析表中MSE的平方根)。
你能告诉我怎么做吗?
总是非常感谢,

a9wyjsp7

a9wyjsp71#

一种方法是按基因型使用nest数据。这允许您在列表列中创建包含anova结果的数据框。然后可以使用broom::tidy提取Mean sq值并计算RMSE。
这方面的基础是优秀的教程Running a model on separate groups
首先,如果需要,安装软件包,然后加载:

library(dplyr)
library(tidyr)
library(purrr)
library(broom)

下面是按基因型嵌套数据的样子:

dataA %>% 
  nest(data = -genotype)

# A tibble: 5 × 2
  genotype data            
  <chr>    <list>          
1 A        <tibble [5 × 2]>
2 B        <tibble [5 × 2]>
3 C        <tibble [5 × 2]>
4 D        <tibble [5 × 2]>
5 E        <tibble [5 × 2]>

现在我们可以使用mutatemap添加包含anova结果的列,以及tidy提取的指标:

dataA %>% 
  nest(data = -genotype) %>% 
  mutate(A = map(data, ~anova(lm(outcome~env, data = .))), 
         results = map(A, tidy))

# A tibble: 5 × 4
  genotype data             A               results         
  <chr>    <list>           <list>          <list>          
1 A        <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
2 B        <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
3 C        <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
4 D        <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
5 E        <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>

现在我们unnestresultsfilter残差项:

dataA %>% 
  nest(data = -genotype) %>% 
  mutate(A = map(data, ~anova(lm(outcome~env, data = .))), 
         results = map(A, tidy)) %>% 
  unnest(results) %>% 
  filter(term == "Residuals")

# A tibble: 5 × 9
  genotype data             A               term         df sumsq meansq statistic p.value
  <chr>    <list>           <list>          <chr>     <int> <dbl>  <dbl>     <dbl>   <dbl>
1 A        <tibble [5 × 2]> <anova [2 × 5]> Residuals     3  2.80  0.933        NA      NA
2 B        <tibble [5 × 2]> <anova [2 × 5]> Residuals     3 13.1   4.37         NA      NA
3 C        <tibble [5 × 2]> <anova [2 × 5]> Residuals     3  6.4   2.13         NA      NA
4 D        <tibble [5 × 2]> <anova [2 × 5]> Residuals     3  2.7   0.9          NA      NA
5 E        <tibble [5 × 2]> <anova [2 × 5]> Residuals     3  5.9   1.97         NA      NA

最后,使用sqrt(meansq)计算RMSE,然后选择所需的列。
整个过程看起来像这样:

dataA %>% 
  nest(data = -genotype) %>% 
  mutate(A = map(data, ~anova(lm(outcome~env, data = .))), 
         results = map(A, tidy)) %>% 
  unnest(results) %>% 
  filter(term == "Residuals") %>% 
  mutate(rmse = sqrt(meansq)) %>% 
  select(genotype, rmse)

# A tibble: 5 × 2
  genotype  rmse
  <chr>    <dbl>
1 A        0.966
2 B        2.09 
3 C        1.46 
4 D        0.949
5 E        1.40

相关问题