sqlite 如何在Python中从两个csv文件中创建一个包含数据的单一表,并使用一个公共id?

polhcujo  于 2023-05-18  发布在  SQLite
关注(0)|答案(3)|浏览(167)

有2个文件client.csv(15gb 1.6亿行)和phone.csv。
客户端. csv文件示例:
| 身份证|名称|电子邮件|合同号|地址标识|
| - -------------|- -------------|- -------------|- -------------|- -------------|
| 一百二十三|普金·瓦夏|www.example.com Pupkin_Vasya@mail.ru五二八三五一二| 43784578 | 5283512 |
以下是一个电话. csv文件的示例:
| 身份证|电话|
| - -------------|- -------------|
| 一百二十三|7999999999|
最后我们应该得到:
| 身份证|电话|名称|电子邮件|
| - -------------|- -------------|- -------------|- -------------|
| 一百二十三|7999999999|普金·瓦夏|www.example.com Pupkin_Vasya@mail.ru|
我只需要在csv或SQLite表中写入所需的数据。
从理论上讲,我需要读取带数字的那个,并使用它的id从另一个文件中浏览数据,但如何做到这一点并不清楚。
我试了这个代码:

df_1 = pd.DataFrame({'id':[123], 'name': ['Pupkin Vasya'], 'email': ['Pupkin_Vasya@mail.ru'], 'usless_info': [1]})
df_2 = pd.DataFrame({'id':[123], 'phone': [79999999999], 'usless_info': [1]})
df = pd.merge(left=df_2[['id', 'phone']], right=df_1[['id', 'name', 'email']], how='left', on='id')
df.to_csv('final.csv', index=False)

但文件比我的内存容量大
我需要找到一个不使用Pandas库的解决方案
UPD:如果文件client.csv中的id在文件phone.csv中没有数据,则文件client.csv中的数据将转到最终文件,但“phone”列仍为空

pcrecxhr

pcrecxhr1#

下面是伪Python中的一个解决方案:

c = read_next_client()
p = read_next_phone()
while c != None and p != None:
    if c.id == p.id:
        write_combined_record(c,p)
        c = read_next_client()
        p = read_next_phone()
    elif c.id < p.id:
        write_combined_record(c,None)
        c = read_next_client()
    else:
        p = read_next_phone()
while c != None:
    write_combined_record(c,None)
    c = read_next_client()
3okqufwl

3okqufwl2#

这里有一些应该非常接近解决方案的东西(不幸的是,我没有实际的文件来测试)。基本思想是同时迭代两个输入文件,根据id字段保持它们同步,并在执行时编写输出文件。

import csv

with (
    open('client.csv', newline='', encoding='utf8-sig') as client_file,
    open('phone.csv', newline='', encoding='utf8-sig') as phone_file,
    open('final.csv', 'w', newline='', encoding='utf8') as final_file,
):
    client_csv = csv.reader(client_file)
    phone_csv = csv.reader(phone_file)
    final_csv = csv.writer(final_file)
    # CSV headers.
    assert next(client_csv) == (
        'id', 'name', 'email', 'contract_number', 'address_id'
    ), 'Bad headers in client.csv'
    assert next(phone_csv) == ('id', 'phone'), 'Bad headers in phone.csv'
    final_csv.writerow(('id', 'phone', 'name', 'email'))
    try:
        # Initialize lookaheads for the two files.
        c_id = ''
        c_id, name, email, _, _ = next(client_csv)
        p_id, phone = next(phone_csv)
        while True:
            if p_id == c_id:
                # Matching row -- merge and advance both files.
                final_csv.writerow((c_id, phone, name, email))
                c_id = ''
                c_id, name, email, _, _ = next(client_csv)
                p_id, phone = next(phone_csv)
            elif int(p_id) > int(c_id):
                # Phone is ahead of client.
                final_csv.writerow((c_id, '', name, email))
                c_id = ''
                c_id, name, email, _, _ = next(client_csv)
            else:
                # Client is ahead of phone.
                p_id, phone = next(phone_csv)
    except StopIteration:
        # We hit the end of one of the files.
        if c_id:
            final_csv.writerow((c_id, '', name, email))
    # Exhaust any remaining entries in client.
    for c_id, name, email, _, _ in client_csv:
        final_csv.writerow((c_id, '', name, email))
ztmd8pv5

ztmd8pv53#

可以使用csv.DictReader。您可以打开3个文件-client.csvphone.csvfinal.csv
然后需要读取client.csv并将其存储在字典中。
然后,您可以迭代phone.csv并检查是否在字典中找到该键。如果找到,则可以使用所需值更新字典。
代码如下-

import csv

csv.field_size_limit(10**5)

with open('client.csv', 'r') as client_file, open('phone.csv', 'r') as phone_file, open('final.csv', 'w', newline='') as final_file:
    client_reader = csv.DictReader(client_file)
    phone_reader = csv.DictReader(phone_file)

    merged_data = {}

    for row in client_reader:
        client_id = row['id']
        merged_data[row_id] = {'id': client_id, 'name': row['name'], 'email': row['email']}

    for row in phone_reader:
        phone_id = row['id']
        if phone_id in merged_data:
            merged_data[phone_id]['phone'] = row['phone']

    csv_writer = csv.DictWriter(final_file, fieldnames=['id', 'phone', 'name', 'email'])
    csv_writer.writeheader()

    for row in merged_data.values():
        writer.writerow(row)

相关问题