将R中的大数据压缩到csv中，不使用NULLS或LIST

am46iovg 于 2023-04-03 发布在其他

关注(0)|答案(1)|浏览(123)

首次发布：
我正在为arules() read.transactions准备数据，需要压缩唯一的发票数据（500 k + cases），以便每个唯一的发票及其相关信息都可以放在一行中，如下所示：
Invoice001，CustomerID，Country，StockCodeXYZ，StockCode123
发票002...等
然而，数据读取重复发票为每个StockCode像这样：
发票001、客户ID、国家/地区、股票代码XYZ
发票001，客户ID，国家/地区，库存代码123
发票002...等
我一直在尝试pivot_wider()，然后unite()，但它生成了285 M+大部分为空的单元格到一个LIST，我很难解决，无法写入csv或读取到arules。我也尝试了keep(~!is.null(.)), discard(is.null), compact()没有成功，我愿意接受任何方法来实现上述预期的结果。
然而，我觉得我应该能够使用内置的arules() read.transactions() fx来解决它，但是当我在那里尝试不同的东西时，我也得到了各种错误。
数据来自加州大学欧文分校，可以在这里找到：https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx
任何帮助都将不胜感激。

library(readxl)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
destfile <- "Online_20Retail.xlsx"
curl::curl_download(url, destfile)
Online_20Retail <- read_excel(destfile)

trans <- read.transactions(????????????)

r

来源：https://stackoverflow.com/questions/72619764/compress-large-data-in-r-into-csv-without-nulls-or-list

1条答案

按热度按时间

zkure5ic1#

这一张发票“573585”有超过1.000 itens，所以它会产生acording列数，如果你只得到库存数量从发票项目...仍然我们有一点超过1.000列。

library(dplyr)

Online_20Retail %>% 
    dplyr::transmute(new = paste0(InvoiceNo, ", ", 
                                  CustomerID, ", ", 
                                  Country, ", "), 
                     StockCode) %>% 
    dplyr::group_by(new) %>% 
    dplyr::summarise(output = paste(StockCode, collapse = ", ")) %>%
    dplyr::transmute(mystring = paste0(new, output)) 
    # you might want to put "%>% dplyr::pull(mystring)" at the ending of the line above to get a vector not tibble/dataframe

# A tibble: 25,900 x 1
   mystring                                                                                                                                         
   <chr>                                                                                                                                            
 1 536365, 17850, United Kingdom, 85123A, 71053, 84406B, 84029G, 84029E, 22752, 21730                                                               
 2 536366, 17850, United Kingdom, 22633, 22632                                                                                                      
 3 536367, 13047, United Kingdom, 84879, 22745, 22748, 22749, 22310, 84969, 22623, 22622, 21754, 21755, 21777, 48187                                
 4 536368, 13047, United Kingdom, 22960, 22913, 22912, 22914                                                                                        
 5 536369, 13047, United Kingdom, 21756                                                                                                             
 6 536370, 12583, France, 22728, 22727, 22726, 21724, 21883, 10002, 21791, 21035, 22326, 22629, 22659, 22631, 22661, 21731, 22900, 21913, 22540, 22~
 7 536371, 13748, United Kingdom, 22086                                                                                                             
 8 536372, 17850, United Kingdom, 22632, 22633                                                                                                      
 9 536373, 17850, United Kingdom, 85123A, 71053, 84406B, 20679, 37370, 21871, 21071, 21068, 82483, 82486, 82482, 82494L, 84029G, 84029E, 22752, 217~
10 536374, 15100, United Kingdom, 21258                                                                                                             
# ... with 25,890 more rows

赞(0）回复(0）举报 2023-04-03

我来回答

将R中的大数据压缩到csv中，不使用NULLS或LIST

1条答案

相关问题

热门标签

最新问答