将R中的大数据压缩到csv中,不使用NULLS或LIST

am46iovg  于 2023-04-03  发布在  其他
关注(0)|答案(1)|浏览(123)

首次发布:
我正在为arules() read.transactions准备数据,需要压缩唯一的发票数据(500 k + cases),以便每个唯一的发票及其相关信息都可以放在一行中,如下所示:
Invoice001,CustomerID,Country,StockCodeXYZ,StockCode123
发票002...等
然而,数据读取重复发票为每个StockCode像这样:
发票001、客户ID、国家/地区、股票代码XYZ
发票001,客户ID,国家/地区,库存代码123
发票002...等
我一直在尝试pivot_wider(),然后unite(),但它生成了285 M+大部分为空的单元格到一个LIST,我很难解决,无法写入csv或读取到arules。我也尝试了keep(~!is.null(.)), discard(is.null), compact()没有成功,我愿意接受任何方法来实现上述预期的结果。
然而,我觉得我应该能够使用内置的arules() read.transactions() fx来解决它,但是当我在那里尝试不同的东西时,我也得到了各种错误。
数据来自加州大学欧文分校,可以在这里找到:https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx
任何帮助都将不胜感激。

library(readxl)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
destfile <- "Online_20Retail.xlsx"
curl::curl_download(url, destfile)
Online_20Retail <- read_excel(destfile)

trans <- read.transactions(????????????)
zkure5ic

zkure5ic1#

这一张发票“573585”有超过1.000 itens,所以它会产生acording列数,如果你只得到库存数量从发票项目...仍然我们有一点超过1.000列。

library(dplyr)

Online_20Retail %>% 
    dplyr::transmute(new = paste0(InvoiceNo, ", ", 
                                  CustomerID, ", ", 
                                  Country, ", "), 
                     StockCode) %>% 
    dplyr::group_by(new) %>% 
    dplyr::summarise(output = paste(StockCode, collapse = ", ")) %>%
    dplyr::transmute(mystring = paste0(new, output)) 
    # you might want to put "%>% dplyr::pull(mystring)" at the ending of the line above to get a vector not tibble/dataframe

# A tibble: 25,900 x 1
   mystring                                                                                                                                         
   <chr>                                                                                                                                            
 1 536365, 17850, United Kingdom, 85123A, 71053, 84406B, 84029G, 84029E, 22752, 21730                                                               
 2 536366, 17850, United Kingdom, 22633, 22632                                                                                                      
 3 536367, 13047, United Kingdom, 84879, 22745, 22748, 22749, 22310, 84969, 22623, 22622, 21754, 21755, 21777, 48187                                
 4 536368, 13047, United Kingdom, 22960, 22913, 22912, 22914                                                                                        
 5 536369, 13047, United Kingdom, 21756                                                                                                             
 6 536370, 12583, France, 22728, 22727, 22726, 21724, 21883, 10002, 21791, 21035, 22326, 22629, 22659, 22631, 22661, 21731, 22900, 21913, 22540, 22~
 7 536371, 13748, United Kingdom, 22086                                                                                                             
 8 536372, 17850, United Kingdom, 22632, 22633                                                                                                      
 9 536373, 17850, United Kingdom, 85123A, 71053, 84406B, 20679, 37370, 21871, 21071, 21068, 82483, 82486, 82482, 82494L, 84029G, 84029E, 22752, 217~
10 536374, 15100, United Kingdom, 21258                                                                                                             
# ... with 25,890 more rows

相关问题