减少内存使用量处理非常大的csv文件

qcuzuvrc  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(110)

我是一个初学者,我试图编写代码,使我能够减少内存使用时,非常大的csv文件,通过转换列的数据类型。应用我学到的这些东西,我写了下面的代码:

# They are lines of code that I wrote by adapting the code learned in the following lesson:
# https://www.udemy.com/course/corso-python-e-data-science/learn/lecture/13091796#content

# To test the code I use a csv file relating to the passengers embarked on Titanic.
# I should had downloaded it from the following website,
# but I'm not sure because I didn't pin the URL when I downloaded it:
# https://github.com/datasciencedojo/datasets/blob/master/titanic.csv

import numpy as np
import pandas as pd
import tkinter as tk
from tkinter import filedialog

xInt_Types = ["int64", "int32", "int16", "int8", "uint64", "uint32", "uint16", "uint8"]
xFloat_Types = ["float64", "float32", "float16"]

# The user needs to indicate the CSV file with the data he wants to work on
xPath = filedialog.askopenfilename()
xFile = pd.read_csv(xPath)

# I store the label assigned to each column of the CSV file
xColumnName = xFile.columns

# This Function allows you to calculate the weight of the data contained in a DataFrame
def f_Size(xDataFrame):
    xSomma = 0
    for xDataType in ["float16", "float32", "float64", "uint8", "uint16", "uint32", "uint64", "int8", "int16", "int32", "int64", "object"]:
        xSel_DataType = xDataFrame.select_dtypes(include=[xDataType])
        xTot_Bytes = xSel_DataType.memory_usage(deep=True)
        xMedia_Bytes = xSel_DataType.memory_usage(deep=True).mean()
        xMedia_Megabytes = xMedia_Bytes/1024**2
        xSomma += xMedia_Megabytes
    return xSomma

# This function allows you to summarize the main characteristics of different number formats
def f_Info(xInt_Types, xFloat_Types):
    for xInt in xInt_Types:
        print(np.iinfo(xInt))
    for xFlt in xFloat_Types:
        print(np.finfo(xFlt))

#This function calculates the weight of the data contained in the column of a DataFrame
def f_M_U(xPandas_Data):
    if isinstance(xPandas_Data, pd.DataFrame):
        xMem_Usage_b = xPandas_Data.memory_usage(deep=True).sum()
    else:
        xMem_Usage_b = xPandas_Data.memory_usage(deep=True)
    return xMem_Usage_b

# This function allows you to retrieve the characteristics of each object attribute (column) of the csv file
def f_CSV_Attr(xFile, xColumnName):
    for xCol in xColumnName:
        xFile_Object = xFile.select_dtypes(include=["object"]).copy()
        print(xFile_Object.describe())

#To save memory I'll transform:
     #Int64 to Int32 and so on
     #Float64 into Float32 and so on
     #Object into Categories

def f_Zip(xFile, xInt_Types, xFloat_Types, xColumnName):
    xOptimize_File = xFile.copy()
    # In xOptimize_File we replace the original columns with the converted ones
    for xCol in xColumnName:
        if xFile[xCol].dtypes in xInt_Types:
            for i in xInt_Types:
                xFile_Int = xFile.select_dtypes(include=[i])
                xConvertedInt = xFile_Int.apply(pd.to_numeric, downcast="unsigned")
                xOptimize_File[xConvertedInt.columns] = xConvertedInt
        elif xFile[xCol].dtypes in xFloat_Types:
            for f in xFloat_Types:
                xFile_Float = xFile.select_dtypes(include=[f])
                xConvertedFloat = xFile_Float.apply(pd.to_numeric, downcast="float")
                xOptimize_File[xConvertedFloat.columns] = xConvertedFloat
        else:
            xNum_Unique = len(xFile[xCol].unique())
            xNum_Total = len(xFile[xCol])
            if xNum_Unique / xNum_Total < 0.5:
                xConverted_Object = xFile[xCol].astype("category")
                #xOptimize_File[xConverted_Object.columns] = xConverted_Object # This generates an error
                xOptimize_File[xFile[xCol]] = xConverted_Object # This also doesn't work :-(
            else:
                xConverted_Object = xFile[xCol]

    print(xOptimize_File)
    return xOptimize_File

print('The weight of the selected file is equal to:', f_Size(xFile))
xOptimize_File = f_Zip(xFile, xInt_Types, xFloat_Types, xColumnName)
print('The weight of the converted file is equal to:', f_Size(xOptimize_File))

我正在尝试减少内存使用,但我写的代码不起作用。

elcex8rz

elcex8rz1#

最快、最少使用内存、也是目前为止最简单的选择是使用csv模块的读取器:它逐行迭代输入CSV,并返回每行的字符串列表。从这个字符串列表中,只需将它们转换为您期望的类型。
给定此输入:

Col1,Col2,Col3,Col4
81.23,23938,false,ZUITO
67.57,27480,false,JLMAU
97.55,63965,false,SAVPW
53.15,23907,false,NTZLZ
14.90,29321,false,IQUVA
65.27,45452,true,MLUYP
12.66,81050,true,YLWBC
56.12,39627,true,CCRAW
44.83,18004,true,ASSOP
77.43,23267,true,SXOLU
import csv

MyRow = tuple[float, int, bool, str]

rows: list[MyRow] = []
with open("input.csv", newline="", encoding="utf-8") as f:
    reader = csv.reader(f)
    next(reader)  # discard header
    for row in reader:
        rows.append(
            (
                float(row[0]),
                int(row[1]),
                row[2] == "true",
                row[3],
            )
        )

print(rows)

图纸:

[
    (81.23, 23938, False, "ZUITO"),
    (67.57, 27480, False, "JLMAU"),
    (97.55, 63965, False, "SAVPW"),
    (53.15, 23907, False, "NTZLZ"),
    (14.9, 29321, False, "IQUVA"),
    (65.27, 45452, True, "MLUYP"),
    (12.66, 81050, True, "YLWBC"),
    (56.12, 39627, True, "CCRAW"),
    (44.83, 18004, True, "ASSOP"),
    (77.43, 23267, True, "SXOLU"),
]

我喜欢类型提示,所以我创建了MyRow来帮助确保我使用正确的铸件,以及正确的铸件数量,以符合我的期望。类型提示与在运行时如何计算/转换值无关。如果Col 1和Col 2的类型不正确,则会抛出ValueError,如果它们不是float和/或int。

相关问题