我一直在试验polars
,我最感兴趣的关键特性是 * 大于RAM* 的操作。
我从HERE下载了一些文件来玩。在网站上:* 每个文件的第一行是头; 1行对应1条记录。*。警告下载总量很大(~1.3GB)!此实验在AWS服务器(t2.medium
、2cpu
、4GB
)上完成
wget https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Shoes_v1_00.tsv.gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Office_Products_v1_00.tsv.gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Software_v1_00.tsv.gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv .gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Watches_v1_00.tsv.gz
gunzip *
以下是wc -l
的结果
drwxrwxr-x 3 ubuntu ubuntu 4096 Jun 2 12:44 ../
-rw-rw-r-- 1 ubuntu ubuntu 1243069057 Nov 25 2017 amazon_reviews_us_Office_Products_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu 44891575 Nov 25 2017 amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu 1570176560 Nov 25 2017 amazon_reviews_us_Shoes_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu 249565371 Nov 25 2017 amazon_reviews_us_Software_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu 412542975 Nov 25 2017 amazon_reviews_us_Watches_v1_00.tsv
$ find . -type f -exec cat {} + | wc -l
8398139
$ find . -name '*.tsv' | xargs wc -l
2642435 ./amazon_reviews_us_Office_Products_v1_00.tsv
341932 ./amazon_reviews_us_Software_v1_00.tsv
85982 ./amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv
4366917 ./amazon_reviews_us_Shoes_v1_00.tsv
960873 ./amazon_reviews_us_Watches_v1_00.tsv
8398139 total
现在,如果我使用polars
使用我们新的花哨的懒惰函数来计数行:
import polars as pl
csvfile = "~/data/amazon/*.tsv"
(
pl.scan_csv(csvfile, separator = '\t')
.select(
pl.count()
)
.collect()
)
shape: (1, 1)
┌─────────┐
│ count │
│ --- │
│ u32 │
╞═════════╡
│ 4186305 │
└─────────┘
这是wc -l
和polars
的区别。奇怪了...可能是数据问题。让我们只关注感兴趣的列:
csvfile = "~/data/amazon/*.tsv"
(
... pl.scan_csv(csvfile, separator = '\t')
... .select(
... pl.col("product_category").count()
... )
... .collect()
... )
shape: (1, 1)
┌──────────────────┐
│ product_category │
│ --- │
│ u32 │
╞══════════════════╡
│ 7126095 │
└──────────────────┘
.collect(streaming = True)
:
shape: (1, 1)
┌──────────────────┐
│ product_category │
│ --- │
│ u32 │
╞══════════════════╡
│ 7125569 │
└──────────────────┘
好吧,还是相差100万左右?让我们从下往上:
csvfile = "~/data/amazon/*.tsv"
(
pl.scan_csv(csvfile, separator = '\t')
.groupby(["product_category"])
.agg(pl.col("product_category").count().alias("counts"))
.collect(streaming = True)
.filter(pl.col('counts') > 100)
.sort(pl.col("counts"), descending = True)
.select(
pl.col('counts').sum()
)
)
shape: (1, 1)
┌─────────┐
│ counts │
│ --- │
│ u32 │
╞═════════╡
│ 7125553 │
└─────────┘
很接近,虽然它再次是一个不同的计数...
使用R
进行更多检查:
library(vroom)
library(purrr)
library(glue)
library(logger)
amazon <- list.files("~/data/amazon/", full.names = TRUE)
f <- function(file){
df <- vroom(file, col_select = 'product_category', show_col_types=FALSE )
log_info(glue("File [{basename(file)}] has [{nrow(df)}] rows"))
}
walk(amazon, f)
INFO [2023-06-02 14:23:40] File [amazon_reviews_us_Office_Products_v1_00.tsv] has [2633651] rows
INFO [2023-06-02 14:23:41] File [amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv] has [85898] rows
INFO [2023-06-02 14:24:06] File [amazon_reviews_us_Shoes_v1_00.tsv] has [4353998] rows
INFO [2023-06-02 14:24:30] File [amazon_reviews_us_Software_v1_00.tsv] has [331152] rows
INFO [2023-06-02 14:24:37] File [amazon_reviews_us_Watches_v1_00.tsv] has [943763] rows
Total: 8348462
好吧去他的基本上是一个随机数生成练习,没有什么是真实的。
当然,如果这是一个数据卫生问题,错误应该是恒定的?你知道为什么会有这么大的差异吗?
1条答案
按热度按时间jobtbby31#
在这种情况下,声明下载的大小通常很有帮助。
1.3 GB
17.6 MB
我尝试pandas调试这个,它无法读取任何这些文件:
第1598行:
问题是单个
"
字符,您需要禁用默认的引号行为。通过这种变化,我每次都得到
8398134
的总计数。极性
Pandas
duckdb
pyarrow