使用pandas从URL阅读CSV时的进度(字节)

8zzbczxx  于 2023-09-27  发布在  其他
关注(0)|答案(2)|浏览(108)

因为我需要读取的一些CSV文件非常大(多GB),所以我试图实现一个进度条,当从带有pandas的URL阅读CSV文件时,该进度条指示从总字节中读取的字节数。
我正在尝试实现这样的东西:

from tqdm import tqdm
import requests
from sodapy import Socrata
import contextlib
import urllib
import pandas as pd

url = "https://public.tableau.com/views/PPBOpenDataDownloads/UseOfForce-All.csv?:showVizHome=no"

response = requests.get(url, params=None, stream=True)
response.raise_for_status()
total_size = int(response.headers.get('Content-Length', 0))

block_size = 1000
df = []
last_position = 0
cur_position = 1
with tqdm(desc=url, total=total_size,
     unit='iB',
     unit_scale=True,
     unit_divisor=1024
     ) as bar:
    with contextlib.closing(urllib.request.urlopen(url=url)) as rd:
        # Create TextFileReader
        reader = pd.read_csv(rd, chunksize=block_size)
        for chunk in reader:
            df.append(chunk)
            # Here I would like to calculate the current file position: cur_position 
            bar.update(cur_position - last_position)
            last_position = cur_position

有没有办法从pandas TextFileReader中获取文件位置?也许是与C++中的ftell等价的TextFileReader?

rlcwz9us

rlcwz9us1#

没有经过彻底的测试,但是你可以用read()方法实现自定义类,你可以从requests响应中逐行读取并更新tqdm条:

import requests
import pandas as pd
from tqdm import tqdm

url = "https://public.tableau.com/views/PPBOpenDataDownloads/UseOfForce-All.csv?:showVizHome=no"

class TqdmReader:
    def __init__(self, resp):
        total_size = int(resp.headers.get("Content-Length", 0))

        self.resp = resp
        self.bar = tqdm(
            desc=resp.url,
            total=total_size,
            unit="iB",
            unit_scale=True,
            unit_divisor=1024,
        )

        self.reader = self.read_from_stream()

    def read_from_stream(self):
        for line in self.resp.iter_lines():
            line += b"\n"
            self.bar.update(len(line))
            yield line

    def read(self, n=0):
        try:
            return next(self.reader)
        except StopIteration:
            return ""

with requests.get(url, params=None, stream=True) as resp:
    df = pd.read_csv(TqdmReader(resp))

print(len(df))

图纸:

https://public.tableau.com/views/PPBOpenDataDownloads/UseOfForce-All.csv?:showVizHome=no: 100%|██████████████████████████████████████████████████████████████████████████████| 2.09M/2.09M [00:00<00:00, 2.64MiB/s]
7975
xuo3flqw

xuo3flqw2#

这里是另一个例子,用于执行Pandas分块CSV阅读器,并在缺少总长度或记录编号的情况下显示一些进度信息。

  • 并不总是能够提前知道CSV或其他Pandas阅读器格式中包含的大小或总行数
  • 在本例中,有一个简单的块过滤循环,它提取较大数据集的一些行,以创建适合RAM的较小数据集
  • 示例中的数据集为StackOverflow dump
  • 我们使用tqdm和Jupyter notebook支持来显示HTML进度条,它看起来比notebook中的文本模式进度条更干净
  • 因为当我们对块进行操作时,我们不知道文件的真实的结束,我们不知道总操作将持续多长时间-这可以通过给定tqdm(total=)参数来更改,您可以获得自动估计,但总数必须在Pandas reader之外获得
  • 不管我们是否知道total,我们总是可以显示状态信息,如已过去的时间和已经处理了多少行

from tqdm.auto import tqdm
from pandas.io.parsers.readers import TextFileReader

chunk_size = 2**16  # 64k rows at a time
result_df: pd.DataFrame = None
matched_chunks: list[pd.DataFrame] = []
match_count = row_count = 0

with tqdm() as progress_bar:

    reader: TextFileReader

    rows_read = 0

    with pd.read_csv("csv/Posts.csv", chunksize=chunk_size) as reader:
        chunk: pd.DataFrame
        for chunk in reader:
            
            # Make Tags column regex friendly
            chunk["Tags"] = chunk["Tags"].fillna("")
            
            # Find posts in this chunk that match our tag filter
            matched_chunk = chunk.loc[chunk["Tags"].str.contains(tags_regex, case=False, regex=True)]
            
            matched_chunks.append(matched_chunk)

            match_count += len(matched_chunk)
            row_count += len(chunk)

            last = chunk.iloc[-1]

            # Show the date where the filter progres is going.
            # We are finished when reaching 2023-06
            progress_bar.set_postfix({
                "Date": last["CreationDate"],      
                "Matches": match_count,      
                "Total rows": f"{row_count:,}",
            })

            # Display rows read as a progress bar,
            # but we do not know the end
            progress_bar.update(len(chunk))

result_df = pd.concat(matched_chunks)

Full code here

相关问题