基于列表中多个 Dataframe 的共享前缀对两个 Dataframe 进行平均

7fyelxc5 于 2023-04-27 发布在其他

关注(0)|答案(4)|浏览(90)

我有一个R语言中的 Dataframe 列表，在这个列表中，每个 Dataframe 都是一个“对”的一部分，你可以分辨出哪两个 Dataframe 在一起，因为这两个 Dataframe 的名字有相同的前缀（例如，“001”），但它们具有不同的后缀除了标识符变量（xlm3nlx）之外， Dataframe 内的所有数据都是数字（xlm0nlx，xlmlnlx，xlm2nlx）。

set.seed(12345)
library(dplyr)

# randomly generate a list of dataframes
# that looks like my real life data
id_vars <- gl(n = 30, k = 2) %>% sprintf("%03d", .)
mystrings <- rep.int(x = c("_AB", "_CD", "_EF", "_GH", "_IJ"), 60/5)
df_names <- paste0(id_vars, mystrings)

the_problem <- replicate(60, {
  data.frame(col1 = as.numeric(sample(1:5, 10, replace = TRUE)),
             col2 = as.numeric(sample(1:5, 10, replace = TRUE)),
             col3 = as.numeric(sample(1:5, 10, replace = TRUE))
             )
}, simplify = FALSE)

names(the_problem) <- df_names

我想对每一对 Dataframe 进行算术平均。一旦两个 Dataframe 被平均在一起，后缀中包含的信息就不再重要了，所以我不想在最终产品中包含这些信息。我还希望最终产品是一个 Dataframe 列表，而不是一个大的 Dataframe 。

# how I'd like the final product to be structured
the_solution <- replicate(30, {
  data.frame(col1 = as.numeric(sample(1:5, 10, replace = TRUE)),
             col2 = as.numeric(sample(1:5, 10, replace = TRUE)),
             col3 = as.numeric(sample(1:5, 10, replace = TRUE))
  )
}, simplify = FALSE)

new_id <- gl(n = 30, k = 1) %>% sprintf("%03d", .)
the_solution <- mapply(cbind, the_solution, "idvar" = new_id, SIMPLIFY = FALSE)
names(the_solution) <- new_id

我发现this prior SO question有一个类似的问题，涉及 Dataframe 共享前缀，但是当我有30多个唯一前缀时，解决方案就不太好了。有没有一种方法可以在不提及每个唯一前缀的情况下实现这个目标？

来源：https://stackoverflow.com/questions/76094962/average-two-dataframes-together-based-on-a-shared-prefix-for-multiple-dataframes

4条答案

按热度按时间

z5btuh9x1#

您可以按前缀对 Dataframe 列表进行分组，并使用reduce计算均值：

library(tidyverse)

name <- names(the_problem)
set_names(unique(str_extract(name, "^\\d+"))) |> 
map(~ the_problem[startsWith(name, .x)]) |> 
  imap(~ mutate(reduce(.x, `+`) / length(.x), idvars = .y))

输出

$`001`
   col1 col2 col3 idvars
1   3.0  1.0  2.5    001
2   1.5  3.0  3.0    001
3   2.5  3.5  2.5    001
4   3.5  1.5  4.5    001
5   3.5  3.5  2.5    001
6   2.5  3.5  2.5    001
7   2.5  1.5  3.0    001
8   3.0  4.5  4.0    001
9   3.5  3.5  2.5    001
10  2.0  3.5  4.0    001

$`002`
   col1 col2 col3 idvars
1   3.5  4.0  2.0    002
2   3.5  4.0  2.5    002
3   3.5  3.0  4.0    002
4   2.5  4.0  3.0    002
5   3.0  3.5  3.0    002
6   1.5  2.5  3.5    002
7   3.0  3.5  2.5    002
8   3.5  3.5  3.5    002
9   1.5  3.0  3.5    002
10  1.5  3.5  4.5    002
.
.
.

或者，您可以将所有内容折叠到一个大数据框架中，执行操作，然后将其拆分：

library(dplyr)
library(tidyr)

the_problem |> 
  bind_rows(.id = "x") |>
  mutate(rn = row_number(), .by = x) |>
  separate(x, into = c("idvars", "suffix")) |> 
  summarize(across(starts_with("col"), mean),
            .by = c(idvars, rn)) |> 
  select(-rn) |> 
  group_split(id_vars)

赞(0）回复(0）举报 2023-04-27

cig3rfwq2#

类似这样的东西应该可以工作：

prefixes = unique(id_vars)
result = lapply(prefixes, \(prefix) {
  i = grep(prefix, names(the_problem))
  Reduce(f = "+", the_problem[i]) / length(i)
  })

names(result) = prefixes
result[["001"]]
#    col1 col2 col3
# 1   3.0  5.0  4.0
# 2   3.0  2.0  2.5
# 3   3.0  2.5  2.0
# 4   5.0  2.5  3.0
# 5   3.0  2.5  1.5
# 6   2.0  2.0  2.5
# 7   2.0  5.0  3.0
# 8   3.5  1.0  5.0
# 9   4.0  3.0  4.0
# 10  2.5  3.0  2.0

赞(0）回复(0）举报 2023-04-27

k97glaaz3#

定义一个Mean函数和一个prefix向量，然后使用tapply。不使用任何包。

Mean <- function(x) Reduce(`+`, x) / length(x)
prefix <- sub("_.*", "", names(the_problem))
tapply(the_problem, prefix, Mean, simplify = FALSE)

如果我们知道总是有2个对象需要平均，我们可以将Mean定义为以下之一：

Mean <- function(x) (x[[1]] + x[[2]]) / 2

Mean <- function(x) do.call(`+`, x) / 2

赞(0）回复(0）举报 2023-04-27

rggaifut4#

data.table解决方案：

library(data.table)

interim <- rbindlist(lapply(the_problem, setDT, keep.rownames = TRUE), idcol = "idvar")
interim[, cohort := sub('(.*?)_.*', '\\1', idvar)]
avg_cols <- c("col1","col2","col3")

avg_dt <- interim[,lapply(.SD, mean), .SDcols = avg_cols, by = .(cohort, rn)]
split(avg_dt, avg_dt$cohort)

可以扩展到多个平均列（通过更改avg_cols），不同的函数（通过更改lapply(.SD)中的函数）和不同的id结构（通过修改正则表达式）

赞(0）回复(0）举报 2023-04-27

我来回答

基于列表中多个 Dataframe 的共享前缀对两个 Dataframe 进行平均

4条答案

相关问题

热门标签

最新问答