R中复杂长-宽数据集到长数据集

oalqel3c  于 2023-02-14  发布在  其他
关注(0)|答案(2)|浏览(105)

我有一个复杂的数据集,如下所示:

df1 <- tibble::tribble(~"Canada > London",  ~"",    ~"Notes",   ~"United Kingdom > London", ~"",    ~"",
"Restaurant",   "Price",    "Range",    "Restaurant",   "Price",    "Range",
"Fried beef",   "27",   "25-30",    "Fried beef",   "29",   "25 - 35",
"Fried potato", "5",    "3 - 8",    "Fried potato", "8",    "3 - 8",
"Bar",  "Price",    "Range",    "Price",    "Range",    "",
"Beer Lager",   "5",    "4 - 8",    "Beer Lager",   "6",    "4 - 8",
"Beer Dark",    "4",    "3 - 7",    "Beer Dark",    "5",    "3 - 7")

或者,对于视觉表示:

它的参数很长(如啤酒淡啤、啤酒黑啤......),数据输入很宽(许多宽元素,如加拿大〉伦敦,或英国〉伦敦)。
所需的输出将是两个数据集,如下所示:
1.第一个数据集(值):

1.第二个数据集(范围):

任何建议都将不胜感激:)

44u64gxh

44u64gxh1#

您的数据既不宽也不长,而是一个杂乱的数据表,需要进行一些清理才能将其转换为整洁的数据。之后,您可以使用tidyr::pivot_wider获得所需的表:

library(dplyr)
library(tidyr)
library(purrr)

tidy_data <- function(.data, cols) {
  .data <- .data[cols]
  place <- names(.data)[[1]]
  
  .data |> 
    rename(product = 1, price = 2, range = 3) |> 
    filter(!price %in% c("Price", "Range")) |>
    mutate(place = place)
}

df1_tidy <- purrr::map_dfr(list(1:3, 4:6), tidy_data, .data = df1)

df1_tidy |> 
  select(place, product, price) |> 
  pivot_wider(names_from = product, values_from = price)
#> # A tibble: 2 × 5
#>   place                   `Fried beef` `Fried potato` `Beer Lager` `Beer Dark`
#>   <chr>                   <chr>        <chr>          <chr>        <chr>      
#> 1 Canada > London         27           5              5            4          
#> 2 United Kingdom > London 29           8              6            5

df1_tidy |> 
  select(place, product, range) |> 
  pivot_wider(names_from = product, values_from = range, names_glue = "{product} Range")
#> # A tibble: 2 × 5
#>   place                   `Fried beef Range` Fried potato Rang…¹ Beer …² Beer …³
#>   <chr>                   <chr>              <chr>               <chr>   <chr>  
#> 1 Canada > London         25-30              3 - 8               4 - 8   3 - 7  
#> 2 United Kingdom > London 25 - 35            3 - 8               4 - 8   3 - 7  
#> # … with abbreviated variable names ¹​`Fried potato Range`, ²​`Beer Lager Range`,
#> #   ³​`Beer Dark Range`
rsl1atfo

rsl1atfo2#

我同意@stefan的观点。你实际上有4个表,或者2个表,这取决于你怎么看它。下面是2个函数的实现,它们开始清理和格式化过程。第一个函数按行拆分dfs,第二个函数按列拆分它们。之后,格式化、清理和合并dfs到1中就更容易了。

library(tidyverse)

df0 = tibble::tribble(~"Canada > London",  ~"",    ~"Notes",   ~"United Kingdom > London", ~"",    ~"",
                       "Restaurant",   "Price",    "Range",    "Restaurant",   "Price",    "Range",
                       "Fried beef",   "27",   "25-30",    "Fried beef",   "29",   "25 - 35",
                       "Fried potato", "5",    "3 - 8",    "Fried potato", "8",    "3 - 8",
                       "Bar",  "Price",    "Range",    "Price",    "Range",    "",
                       "Beer Lager",   "5",    "4 - 8",    "Beer Lager",   "6",    "4 - 8",
                       "Beer Dark",    "4",    "3 - 7",    "Beer Dark",    "5",    "3 - 7")

split_rows = function(df){
  
  # breaks of sub-dfs within original df
  df_breaks = df[,2] == "Price"
  df_breaks = (1:length(df_breaks))[df_breaks]
  df_breaks
  
  
  
  # list to populate in loop with sub-dfs
  df_list = c()
  
  for(i in 1:length(df_breaks)){
    
    # get start of sub-df
    start = df_breaks[i]
    
    # get end of sub-df
    if(i == length(df_breaks)){
      end = nrow(df) # if its the last set it to the last row of the original df
    }
    else{
      end = df_breaks[i+1]-1 # else, set it to the next start - 1
    }
    
    # subset df
    df_temp = df[start:end,]
    
    # first row as header
    colnames(df_temp) = df_temp[1,]
    df_temp = df_temp[-1,]
    
    # append to df_list
    df_list = append(df_list,list(df_temp))
  }
  
  return(df_list)
}
split_cols = function(df_list,second_df_col_start = 4){
  df_list = lapply(df_list, function(df){
    df1 = df[,1:(second_df_col_start-1)]
    df2 = df[,second_df_col_start:ncol(df)]
    return(list(df1,df2))
    })
  
  return(df_list)
}

output = split_rows(df0) %>% 
  split_cols()

输出:

[[1]]
[[1]][[1]]
# A tibble: 2 × 3
  Restaurant   Price Range
  <chr>        <chr> <chr>
1 Fried beef   27    25-30
2 Fried potato 5     3 - 8

[[1]][[2]]
# A tibble: 2 × 3
  Restaurant   Price Range  
  <chr>        <chr> <chr>  
1 Fried beef   29    25 - 35
2 Fried potato 8     3 - 8  

[[2]]
[[2]][[1]]
# A tibble: 2 × 3
  Bar        Price Range
  <chr>      <chr> <chr>
1 Beer Lager 5     4 - 8
2 Beer Dark  4     3 - 7

[[2]][[2]]
# A tibble: 2 × 3
  Price      Range ``   
  <chr>      <chr> <chr>
1 Beer Lager 6     4 - 8
2 Beer Dark  5     3 - 7

相关问题