有没有办法在R中的两个 Dataframe 之间绘制相关性热图?这两个 Dataframe 具有不同的行名称,并且维数不等

jecbmhm3  于 2022-12-20  发布在  其他
关注(0)|答案(1)|浏览(127)

我有两个不同的 Dataframe ,如附图Dataframe1Dataframe2所示。
我试过了。

#First dataframe
structure(list(Label = c("Gene 1", "Gene 2", "Gene 3", "Gene 4", 
"Gene 5", "Gene 6", "Gene 7", "Gene 8", "Gene 9", "Gene 10", 
"Gene 11", "Gene 12", "Gene 13", "Gene 14", "Gene 15", "Gene 16", 
"Gene 17", "Gene 18", "Gene 19", "Gene 20", "Gene 21", "Gene 22", 
"Gene 23", "Gene 24", "Gene 25", "Gene 26", "Gene 27", "Gene 28", 
"Gene 29", "Gene 30"), Count = c(1500, 1600, 1700, 1800, 1900, 
2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900, 3000, 
3100, 3200, 3300, 3400, 3500, 3600, 3700, 3800, 3900, 4000, 4100, 
4200, 4300, 4400)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-30L))

df_1 <- read_excel("Demo_data.xlsx", sheet = "Dataframe1")
str(df_1)
View(df_1)

df_1.1 <- column_to_rownames(df_1, 'Label')
View(df_1.1)

df_1.2 <- t(df_1.1)
View(df_1.2)

df_1.2 <- as.data.frame(df_1.2)
str(df_1.2)

typeof(dff1)
str(dff1)

#Second dataframe
structure(list(Label = c("Control1", "Control2", "Control3", 
"Control4", "Control5", "Control6", "Control7", "Control8", "Control9", 
"Control10", "Control11", "Control12", "Control13", "Control14", 
"Control15", "Control16", "Control17", "Control18", "Control19", 
"Control20", "Control21", "Control22", "Control23", "Control24"
), Count = c(1800, 1400, 1110, 1900, 2500, 2900, 2100, 900, 5000, 
2300, 700, 1400, 3400, 2310, 3322, 2200, 4400, 2100, 1000, 6700, 
4300, 2120, 4800, 4300)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -24L))

df_2 <- read_excel("Demo_data.xlsx", sheet = "Dataframe2")

df_2.1 <- column_to_rownames(df_2, 'Label')
View(df_2.1)

df_2.1 <- t(df_2.1)
View(df_2.1)

df_2.1 <- as.data.frame(df_2.1)
str(df_2.1)

correlation <- cor(df_1.2, df_2.1)
View(correlation)

这是我想要的输出,但我得到NA的每一个相关性。任何帮助是高度赞赏。
Desired output (without NA)

hgb9j2n6

hgb9j2n61#

正如评论中所写的那样,你试图实现的目标相当不清楚。
如果要计算每个 Dataframe 中Count列之间的相关性,并使用散点图对其进行可视化,可以使用以下代码:

library(tidyverse)

df_1 <- structure(list(Label = c("Gene 1", "Gene 2", "Gene 3", "Gene 4", 
                                 "Gene 5", "Gene 6", "Gene 7", "Gene 8", "Gene 9", "Gene 10", 
                                 "Gene 11", "Gene 12", "Gene 13", "Gene 14", "Gene 15", "Gene 16", 
                                 "Gene 17", "Gene 18", "Gene 19", "Gene 20", "Gene 21", "Gene 22", 
                                 "Gene 23", "Gene 24", "Gene 25", "Gene 26", "Gene 27", "Gene 28", 
                                 "Gene 29", "Gene 30"), 
                       Count = c(1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 
                                 2600, 2700, 2800, 2900, 3000, 3100, 3200, 3300, 3400, 3500, 3600, 
                                 3700, 3800, 3900, 4000, 4100, 4200, 4300, 4400)), 
                  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L))

df_2 <- structure(list(Label = c("Control1", "Control2", "Control3", 
                                 "Control4", "Control5", "Control6", "Control7", "Control8", "Control9", 
                                 "Control10", "Control11", "Control12", "Control13", "Control14", 
                                 "Control15", "Control16", "Control17", "Control18", "Control19", 
                                 "Control20", "Control21", "Control22", "Control23", "Control24"), 
                       Count = c(1800, 1400, 1110, 1900, 2500, 2900, 2100, 900, 5000, 2300, 700, 1400, 
                                 3400, 2310, 3322, 2200, 4400, 2100, 1000, 6700, 4300, 2120, 4800, 4300)), 
                  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -24L))

dat = left_join(
  df_1 %>% mutate(id=str_extract(Label, "\\d+")),
  df_2 %>% mutate(id=str_extract(Label, "\\d+")), 
  by="id", suffix=c("_gene", "_ctl")
)

dat
#> # A tibble: 30 x 5
#>    Label_gene Count_gene id    Label_ctl Count_ctl
#>    <chr>           <dbl> <chr> <chr>         <dbl>
#>  1 Gene 1           1500 1     Control1       1800
#>  2 Gene 2           1600 2     Control2       1400
#>  3 Gene 3           1700 3     Control3       1110
#>  4 Gene 4           1800 4     Control4       1900
#>  5 Gene 5           1900 5     Control5       2500
#>  6 Gene 6           2000 6     Control6       2900
#>  7 Gene 7           2100 7     Control7       2100
#>  8 Gene 8           2200 8     Control8        900
#>  9 Gene 9           2300 9     Control9       5000
#> 10 Gene 10          2400 10    Control10      2300
#> # ... with 20 more rows

cor(dat$Count_gene, dat$Count_ctl, use="pairwise.complete.obs")
#> [1] 0.5047392

ggplot(dat, aes(x=Count_gene, y=Count_ctl)) + 
  geom_point()
#> Warning: Removed 6 rows containing missing values (`geom_point()`).

创建于2022年12月12日,使用reprex v2.0.2
基本上,我提取id作为标签的最后一位,然后使用left_join()合并 Dataframe 。
这可能看起来过于复杂,但在一个 Dataframe 中保持数据整洁总是一个好主意。
请注意,在您的示例中,df_2id==24处停止,因此仅对24个完整观测计算相关性。
然而,相关性是在2个向量上计算的,所以为了得到热图,你需要一组许多向量,而你似乎没有。
对于您的下一个问题,如果您像我在本答案中所做的那样使用reprex包,那就太好了。

相关问题