通过保留原始行将树反向转换为 Dataframe 结构

vybvopom  于 2023-06-03  发布在  其他
关注(0)|答案(2)|浏览(209)

让我们假设以下数据:

x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.", 
"Accounting", "Accounting", "Research", "Research", "IT", "IT", 
"IT"), child = c("Accounting", "Research", "IT", "New Software", 
"New Accounting Standards", "New Product Line", "New Labs", "Outsource", 
"Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e", 
"f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")

       parent                    child misc
1   Acme Inc.               Accounting    a
2   Acme Inc.                 Research    b
3   Acme Inc.                       IT    c
4  Accounting             New Software    d
5  Accounting New Accounting Standards    e
6    Research         New Product Line    f
7    Research                 New Labs    g
8          IT                Outsource    h
9          IT                 Go agile    i
10         IT              Switch to R    j

我现在可以用data.tree包将其转换为树结构。

my_tree <- data.tree::FromDataFrameNetwork(x)

我实际上想得到的是电平信息,或多或少是宽格式的,理论上我可以通过

my_data <- data.tree::ToDataFrameTypeCol(my_tree)

其给出:

level_1    level_2                  level_3
1 Acme Inc. Accounting             New Software
2 Acme Inc. Accounting New Accounting Standards
3 Acme Inc.   Research         New Product Line
4 Acme Inc.   Research                 New Labs
5 Acme Inc.         IT                Outsource
6 Acme Inc.         IT                 Go agile
7 Acme Inc.         IT              Switch to R

但是,正如您所看到的,这个输出的行数比原始数据少(7行而不是10行)。这是因为函数只给我最后的叶子,如果我没看错的话。但我想要的是,对于原始数据框中的每一行,使用该特定子级的完整级别信息增强数据。例如,我们知道“Accounting”是第2级,所以我想将该信息作为新列添加到原始数据中。
预期的结果是:

parent                    child misc   level_1    level_2                  level_3
1   Acme Inc.               Accounting    a Acme Inc. Accounting                       NA
2   Acme Inc.                 Research    b Acme Inc.   Research                       NA
3   Acme Inc.                       IT    c Acme Inc.         IT                       NA
4  Accounting             New Software    d Acme Inc. Accounting             New Software
5  Accounting New Accounting Standards    e Acme Inc. Accounting New Accounting Standards
6    Research         New Product Line    f Acme Inc.   Research         New Product Line
7    Research                 New Labs    g Acme Inc.   Research                 New Labs
8          IT                Outsource    h Acme Inc.         IT                Outsource
9          IT                 Go agile    i Acme Inc.         IT                 Go agile
10         IT              Switch to R    j Acme Inc.         IT              Switch to R

我被困在这里不知道该怎么做。你知道吗?

jhkqcmku

jhkqcmku1#

这可能不是最优雅的解决方案,但它似乎确实有效。你需要的是首先在键parent == level_2child == level_3上将两个数据连接在一起。接下来,您可以使用键parent == level_1child == level_2将结果与树数据连接起来。这将加入在第一次加入中未匹配的其余观测。您可以合并不同的level_1level_2变量,以整合连接中的信息。最后,使用distinct()将消除连接过程中产生的重复项。

library(dplyr)
library(data.tree)
  
x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.", 
                               "Accounting", "Accounting", "Research", "Research", "IT", "IT", 
                               "IT"), child = c("Accounting", "Research", "IT", "New Software", 
                                                "New Accounting Standards", "New Product Line", "New Labs", "Outsource", 
                                                "Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e", 
                                                                                     "f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")

my_tree <- data.tree::FromDataFrameNetwork(x)
my_data <- data.tree::ToDataFrameTypeCol(my_tree)

left_join(x, my_data, join_by(parent==level_2, child==level_3), keep=TRUE) %>% 
  left_join(my_data %>% select(level_1, level_2), join_by(parent==level_1, child==level_2), keep=TRUE) %>%
  mutate(level_1 = coalesce(level_1.x, level_1.y), 
         level_2 = coalesce(level_2.x, level_2.y)) %>% 
  select(parent:misc, level_1, level_2, level_3) %>% 
  distinct()
#>        parent                    child misc   level_1    level_2
#> 1   Acme Inc.               Accounting    a Acme Inc. Accounting
#> 2   Acme Inc.                 Research    b Acme Inc.   Research
#> 3   Acme Inc.                       IT    c Acme Inc.         IT
#> 4  Accounting             New Software    d Acme Inc. Accounting
#> 5  Accounting New Accounting Standards    e Acme Inc. Accounting
#> 6    Research         New Product Line    f Acme Inc.   Research
#> 7    Research                 New Labs    g Acme Inc.   Research
#> 8          IT                Outsource    h Acme Inc.         IT
#> 9          IT                 Go agile    i Acme Inc.         IT
#> 10         IT              Switch to R    j Acme Inc.         IT
#>                     level_3
#> 1                      <NA>
#> 2                      <NA>
#> 3                      <NA>
#> 4              New Software
#> 5  New Accounting Standards
#> 6          New Product Line
#> 7                  New Labs
#> 8                 Outsource
#> 9                  Go agile
#> 10              Switch to R

创建于2023-06-01使用reprex v2.0.2

krcsximq

krcsximq2#

我决定使用一种完全绕过树结构的解决方案(在我的实际用例中,这也将性能提高了10000倍左右)。
我所做的本质上是创建一个循环,在每个循环中,我将父级与其子级进行匹配,然后向上移动一级,直到没有更多的父级。
然后我对数据进行一些整理,以便它具有我想要的结构和顺序。
使用我最初问题中的示例数据,我这样做:

library(tidyverse)    
still_open <- nrow(x)
i <- 2
x2 <- x |> 
  mutate(level_1 = child)

while (still_open != 0)
{
  x2 <- x2 |> 
    mutate("level_{i}" := parent[match(.data[[paste0("level_", i - 1)]], child)], .after = .data[[paste0("level_", i - 1)]])
  
  still_open <- x2 |> 
    pull(paste0("level_", i)) |> 
    na.omit() |> 
    length()
  
  i <- i + 1
}

x2 <- x2 |> 
  pivot_longer(cols = starts_with("level_")) |> 
  filter(!is.na(value)) |> 
  mutate(value = rev(value), .by = child) |> 
  pivot_wider(names_from  = name,
              values_from = value)

其给出:

# A tibble: 10 × 6
   parent     child                    misc  level_1   level_2    level_3                 
   <chr>      <chr>                    <chr> <chr>     <chr>      <chr>                   
 1 Acme Inc.  Accounting               a     Acme Inc. Accounting NA                      
 2 Acme Inc.  Research                 b     Acme Inc. Research   NA                      
 3 Acme Inc.  IT                       c     Acme Inc. IT         NA                      
 4 Accounting New Software             d     Acme Inc. Accounting New Software            
 5 Accounting New Accounting Standards e     Acme Inc. Accounting New Accounting Standards
 6 Research   New Product Line         f     Acme Inc. Research   New Product Line        
 7 Research   New Labs                 g     Acme Inc. Research   New Labs                
 8 IT         Outsource                h     Acme Inc. IT         Outsource               
 9 IT         Go agile                 i     Acme Inc. IT         Go agile                
10 IT         Switch to R              j     Acme Inc. IT         Switch to R

相关问题