与Pandas合并并插入到SQLite

vom3gejh  于 2022-11-14  发布在  SQLite
关注(0)|答案(1)|浏览(182)

我正在尝试合并CSV文件并将它们放入SQLite表中。它起作用了,但一旦我这样做了,我就无法在不复制行的情况下更新表。我可以添加新的合并列,但不能在最后一行之后添加。
读取CSV文件并获取特定列:

import pandas as pd
import sqlite3

pd.set_option('display.max_columns', 6)
dia1 = pd.read_csv('dia0705.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
dia2 = pd.read_csv('dia0712.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]

dia1['CPF/CNPJ'] = dia1['CPF/CNPJ'].astype(str)
dia2['CPF/CNPJ'] = dia2['CPF/CNPJ'].astype(str)

dia1['NOME ACIONISTA'] = dia1['NOME ACIONISTA'].astype(str)
dia2['NOME ACIONISTA'] = dia2['NOME ACIONISTA'].astype(str)

然后,我使用合并命令并重命名列以匹配我的SQLite表:

merge1 = pd.merge(dia1, dia2, how='outer', on=["NOME ACIONISTA", "CPF/CNPJ"],)  # indicator=True)

merge1.rename(columns={"EO_x": "dia0705"}, inplace=True)
merge1.rename(columns={"EO_y": "dia0712"}, inplace=True)
merge1.rename(columns={"NOME ACIONISTA": "Nome_Acionista"}, inplace=True)
merge1.rename(columns={"CPF/CNPJ": "CPF_CNPJ"}, inplace=True

重复是因为我想合并多个CSV文件:

merge2 = pd.merge(merge1, dia3, how='outer', on=["Nome_Acionista", "CPF_CNPJ"],)

dia4 = pd.read_csv('220913_completo.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
dia4.rename(columns={"NOME ACIONISTA": "Nome_Acionista"}, inplace=True)
dia4.rename(columns={"CPF/CNPJ": "CPF_CNPJ"}, inplace=True)
dia4.rename(columns={"EO": "dia0913"}, inplace=True)

merge3 = pd.merge(merge2, dia4, how='outer', on=["Nome_Acionista", "CPF_CNPJ"],)

dia4 = pd.read_csv('220913_completo.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
dia4.rename(columns={"NOME ACIONISTA": "Nome_Acionista"}, inplace=True)
dia4.rename(columns={"CPF/CNPJ": "CPF_CNPJ"}, inplace=True)
dia4.rename(columns={"EO": "dia0913"}, inplace=True)

如果我将它连接到我的SQLite表,它可以工作:

connection = sqlite3.connect('2022.db')
c = connection.cursor()

merge3.to_sql(
        name='acoes',
        con=connection,
        if_exists='append',
        index=False,
    )

我的表包含一年中每一天的列name_IDNumber_ID和365。

zf9nrax1

zf9nrax11#

由于每个文件的所有内容几乎完全相同,因此您可以遍历这些文件。下面我有几个细节:
1.有一个手动编写的文件名列表(带有.csv)可供循环。
1.列从eo重命名为文件名(不含文件类型)。
关于如何改变这些问题,有一些建议:
1.可以使用os和/或glob来循环访问目录中的文件,而不是需要手动写入的列表(过滤可以包括在其中,例如仅针对那些以.csv结尾的文件。
1.如果存在不同的名称,则可以创建一个词典来将文件名和文件Map在一起。或者,与文件列表具有相同顺序的另一列表。如果循环访问目录中的文件,这将更加困难。
还请注意,第一个文件已添加到循环之前。如果循环访问文件夹中的文件,则可以包含if语句,代码行为"if this is the first file (maybe merged_data = pd.DataFrame(); if pd.DataFrame().empty: ...) then create the DataFrame."

import pandas as pd

files = ["file1.csv", "file2.csv"] # list of file names
# otherwise, you could use os and/or glob to loop through the files in a folder

# the first file, don't include this in the list above.  If part of a loop, could add in an enumeration and if == first item...
merged_data = pd.read_csv('dia0705.csv', header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
merged_data['CPF/CNPJ'] = merged_data['CPF/CNPJ'].astype(str)
merged_data['NOME ACIONISTA'] = merged_data['NOME ACIONISTA'].astype(str)

merged_data.rename(columns={"NOME ACIONISTA": "Nome_Acionista",
                     "CPF/CNPJ": "CPF_CNPJ",
                     "EO": "dia0705"}, inplace=True)

# loop through all files in list
for file in files:
    # read_csv and select specific columns
    to_merge = pd.read_csv(file, header=1, sep=";", dtype='unicode')[["EO", "NOME ACIONISTA", "CPF/CNPJ"]]
    # change column names
    to_merge.rename(columns={"NOME ACIONISTA": "Nome_Acionista",
                             "CPF/CNPJ": "CPF_CNPJ",
                             "EO": file.split(".")[0]}, inplace=True)
    # merge the merged_data and to_merge DataFrames
    merged_data = pd.merge(merged_data, to_merge,
                           how='outer',
                           on=["Nome_Acionista", "CPF_CNPJ"],)

相关问题