基于R中相似的作者ID将行展开为一行

ljsrvy3e  于 2023-07-31  发布在  其他
关注(0)|答案(2)|浏览(101)

我在Excel工作表中有一个大数据,如:

name    author_id   rating
Name1   Id 1        0.72
Name2   Id 2        0.9
Name3   Id 1        0.74
Name4   Id 1        0.76
Name5   Id 2        0.88
Name1   Id 3        0.8
Name3   Id 2        0.86
Name6   Id 3        0.84
Name3   Id 3        0.84
Name4   Id 4        0.78
Name1   Id 4        0.84
Name6   Id 5        0.84
Etc.    …             …

字符串
我想基于author_id对它们进行分组,并将输出保存在如下格式的文本文件中:

Name1   0.72  Name3  0.74  Name4  0.76
Name2   0.9   Name5  0.88  Name3  0.86
Name1   0.8   Name6  0. 84 Name3  0.84
Name4   0.78  Name1  0. 84    
Name6   0.84


我尝试了下面的代码,但它不工作。

library(readxl)
library(dplyr)
library(openxlsx)

df = read_excel("/Users/admin/Documents/restaurants.xlsx")

df_grp_author_id = df %>%
  group_by(author_id)  %>%
  summarise(name = paste(name), rating, collapse = ', ', rating = paste(rating))
     
write.table(df_grp_author_id , file = "transactionsBy_author_id.txt", sep = "\t", row.names = FALSE)

xwbd5t1u

xwbd5t1u1#

您需要为每个author_id创建行号序列,并将数据透视到wide。

library(dplyr)
library(tidyr)

df %>%
  mutate(n = row_number(), .by = author_id) %>%
  pivot_wider(id_cols = author_id, names_from = n, names_vary = "slowest",
              values_from = c(name, rating))

# # A tibble: 5 × 7
#   author_id name_1 rating_1 name_2 rating_2 name_3 rating_3
#   <chr>     <chr>     <dbl> <chr>     <dbl> <chr>     <dbl>
# 1 Id 1      Name1      0.72 Name3      0.74 Name4      0.76
# 2 Id 2      Name2      0.9  Name5      0.88 Name3      0.86
# 3 Id 3      Name1      0.8  Name6      0.84 Name3      0.84
# 4 Id 4      Name4      0.78 Name1      0.84 NA        NA   
# 5 Id 5      Name6      0.84 NA        NA    NA        NA

字符串

数据

df <- read.table(text =
"name    author_id   rating
Name1   'Id 1'        0.72
Name2   'Id 2'        0.9
Name3   'Id 1'        0.74
Name4   'Id 1'        0.76
Name5   'Id 2'        0.88
Name1   'Id 3'        0.8
Name3   'Id 2'        0.86
Name6   'Id 3'        0.84
Name3   'Id 3'        0.84
Name4   'Id 4'        0.78
Name1   'Id 4'        0.84
Name6   'Id 5'        0.84", header = TRUE)

cmssoen2

cmssoen22#

这种方法将数据转换为宽格式。结果中每个name有一列,每个author_id有一行。

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- structure(list(name = c("Name1", "Name2", "Name3", "Name4", "Name5", 
                              "Name1", "Name3", "Name6", "Name3", "Name4", "Name1", "Name6"
), author_id = c(1, 2, 1, 1, 2, 3, 2, 3, 3, 4, 4, 5), rating = c(0.72, 
                                                                 0.9, 0.74, 0.76, 0.88, 0.8, 0.86, 0.84, 0.84, 0.78, 0.84, 0.84
)), row.names = c(NA, -12L), class = c("tbl_df", "tbl", "data.frame"
))

df %>%
  pivot_wider(names_from = name, values_from = rating)
#> # A tibble: 5 × 7
#>   author_id Name1 Name2 Name3 Name4 Name5 Name6
#>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1         1  0.72  NA    0.74  0.76 NA    NA   
#> 2         2 NA      0.9  0.86 NA     0.88 NA   
#> 3         3  0.8   NA    0.84 NA    NA     0.84
#> 4         4  0.84  NA   NA     0.78 NA    NA   
#> 5         5 NA     NA   NA    NA    NA     0.84

字符串
创建于2023-07-20使用reprex v2.0.2

相关问题