高效清理大型CSV文件

goqiplq2 于 2023-04-27 发布在其他

关注(0)|答案(2)|浏览(157)

问题：我有一个包含大量数据的CSV文件，我需要使用Python对其执行一些数据清理和过滤操作。
例如，CSV文件包含一个日期格式为“YYYY-MM-DD”的列，但其中一些条目的格式不正确或缺少值。我需要清理这些条目，以便它们都具有正确的格式，并删除任何缺少日期的行。
如何使用Python以最短的运行时间清理和过滤大型CSV文件？

import csv

# Read the CSV file into a list of dictionaries
data = []
with open('data.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        data.append(row)

# Loop through the data and clean up the date column
for i in range(len(data)):
    if 'date' in data[i]:
        date = data[i]['date']
        if date:
            try:
                year, month, day = date.split('-')
                year = int(year)
                month = int(month)
                day = int(day)
                if year < 1000 or year > 9999 or month < 1 or month > 12 or day < 1 or day > 31:
                    raise ValueError('Invalid date format')
                data[i]['date'] = f'{year}-{month:02d}-{day:02d}'
            except ValueError:
                del data[i]['date']

# Loop through the data and remove rows with missing dates
clean_data = []
for row in data:
    if 'date' in row and row['date']:
        clean_data.append(row)

csv

来源：https://stackoverflow.com/questions/76054181/cleaning-a-large-csv-file-efficiently

2条答案

按热度按时间

ctehm74n1#

我建议使用pandas模块，它可以在Python中有效地处理csv数据。例如，以下代码可以解决您的问题：

import pandas as pd

# Read the CSV file into a pandas dataframe
df = pd.read_csv('data.csv')

# Clean up the date column
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Remove rows with missing dates
df.dropna(subset=['date'], inplace=True)

请注意，这不再检查您的日期是否为“YYYY-MM-DD”格式，而是任何适合日期的格式。由于这更灵活，这可能是一个优势。否则，您可以根据需要简单地修改代码。

赞(0）回复(0）举报 2023-04-27

ilmyapht2#

虽然Pandas很棒，但如果是关于速度的话，我会看看像Dask, Ray, or Modin这样的东西。Modin的伟大之处在于语法与Pandas相同，所以如果你使用@mathbreaker给出的Pandas解决方案：

import pandas as pd

# Read the CSV file into a pandas dataframe
df = pd.read_csv('data.csv')

# Clean up the date column
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Remove rows with missing dates
df.dropna(subset=['date'], inplace=True)

您可以通过更改此行切换到Modin：

import modin.pandas as pd

如果您正在处理大型数据集，这应该会有很大帮助：

赞(0）回复(0）举报 2023-04-27

我来回答

高效清理大型CSV文件

2条答案

相关问题

热门标签

最新问答