python-3.x 如何比较两个csv文件与一些匹配的条件,并编写一个新的csv的基础上?

wgx48brx  于 2023-05-08  发布在  Python
关注(0)|答案(2)|浏览(180)

我试图合并两个csv文件成一个适当的匹配值。我的想法是扫描两个csv文件,并检查是否两列说namehotel_name是相同的,然后添加新的csv文件中的行。到目前为止,我已经尝试了这个:

import csv

columns = [
    "query",
    "name",
    "hotel_name",
    "hotel_address",
    "hotel_star_ratings",
    "number_of_rooms",
    "number_of_reviews",
    "hotel_user_rating",
]

data = []

with open("source.csv", "r") as source, open("target.csv", "r") as target:
    source_reader = csv.DictReader(source)
    target_reader = csv.DictReader(target)

    for db_row in source_reader:
        for so_row in target_reader:
            if db_row["name"] == so_row["hotel_name"]:
                db_row["hotel_name"] = so_row["hotel_name"]
                db_row["hotel_address"] = so_row["hotel_address"]
                db_row["hotel_star_ratings"] = so_row["hotel_star_ratings"]
                db_row["number_of_rooms"] = so_row["number_of_rooms"]
                db_row["number_of_reviews"] = so_row["number_of_reviews"]
                db_row["hotel_user_rating"] = so_row["hotel_user_rating"]
        data.append(
            {
                "query": db_row["query"],
                "name": db_row["name"],
                "hotel_name": db_row["hotel_name"],
                "hotel_address": db_row["hotel_address"],
                "hotel_star_ratings": db_row["hotel_star_ratings"],
                "number_of_rooms": db_row["number_of_rooms"],
                "number_of_reviews": db_row["number_of_reviews"],
                "hotel_user_rating": db_row["hotel_user_rating"],
            }
        )

with open("output.csv", "w") as output:
    writer = csv.DictWriter(output, fieldnames=columns)
    writer.writeheader()

    writer.writerows(data)

但它没有工作,它只输出一行。
我正在扫描这两个csv文件:

source.csv

query,name,hotel_name,hotel_address,hotel_star_ratings,number_of_rooms,number_of_reviews,hotel_user_rating
"ubytování, Holešov, okres Kroměříž, Česká republika",Hotel Panský Pivovar
"ubytování, Holešov, okres Kroměříž, Česká republika",Hotel Tacl
"ubytování, Holešov, okres Kroměříž, Česká republika",Hotel Slavia Holešov
"ubytování, Holešov, okres Kroměříž, Česká republika",Brusenka Restaurant a Pension

target.csv

hotel_name,hotel_address,hotel_star_ratings,number_of_rooms,number_of_reviews,hotel_user_rating
Hotel Panský Pivovar,"Masarykova 19, Holešov, 769 01, Česká republika",4,3,489,9.3
Pension Brusenka,"Brusné 3, Bystřice pod Hostýnem, 768 61, Česká republika",3b,12,556,9.2
Hotel Tacl,"Palackého 518, Holešov, 769 01, Česká republika",3,4,76,8.9

我的预期输出是这样的:

query,name,hotel_name,hotel_address,hotel_star_ratings,number_of_rooms,number_of_reviews,hotel_user_rating
"ubytování, Holešov, okres Kroměříž, Česká republika",Hotel Panský Pivovar,Hotel Panský Pivovar,"Masarykova 19, Holešov, 769 01, Česká republika",4,3,489,9.3
"ubytování, Holešov, okres Kroměříž, Česká republika",Hotel Tacl,Hotel Tacl,"Palackého 518, Holešov, 769 01, Česká republika",3,4,76,8.9
"ubytování, Holešov, okres Kroměříž, Česká republika",Hotel Slavia Holešov,,,,,,
"ubytování, Holešov, okres Kroměříž, Česká republika",Brusenka Restaurant a Pension,Pension Brusenka,"Brusné 3, Bystřice pod Hostýnem, 768 61, Česká republika",3b,12,556,9.2
rvpgvaaj

rvpgvaaj1#

又一次尝试:

import csv

with open("source.csv", "r") as source, open("target.csv", "r") as target:
    source_reader = csv.DictReader(source)
    target_reader = csv.DictReader(target)

    dct1 = {}
    for line in source_reader:
        dct1[line["name"]] = line

    dct2 = {}
    for line in target_reader:
        dct2[line["hotel_name"]] = line

    out = []
    for name in dct1:
        if name in dct2:
            out.append({**dct1[name], **dct2[name]})
        else:
            out.append(dct1[name])

    print(out)

图纸:

[
    {
        "query": "ubytování, Holešov, okres Kroměříž, Česká republika",
        "name": "Hotel Panský Pivovar",
        "hotel_name": "Hotel Panský Pivovar",
        "hotel_address": "Masarykova 19, Holešov, 769 01, Česká republika",
        "hotel_star_ratings": "4",
        "number_of_rooms": "3",
        "number_of_reviews": "489",
        "hotel_user_rating": "9.3",
    },
    {
        "query": "ubytování, Holešov, okres Kroměříž, Česká republika",
        "name": "Hotel Tacl",
        "hotel_name": "Hotel Tacl",
        "hotel_address": "Palackého 518, Holešov, 769 01, Česká republika",
        "hotel_star_ratings": "3",
        "number_of_rooms": "4",
        "number_of_reviews": "76",
        "hotel_user_rating": "8.9",
    },
    {
        "query": "ubytování, Holešov, okres Kroměříž, Česká republika",
        "name": "Hotel Slavia Holešov",
        "hotel_name": None,
        "hotel_address": None,
        "hotel_star_ratings": None,
        "number_of_rooms": None,
        "number_of_reviews": None,
        "hotel_user_rating": None,
    },
    {
        "query": "ubytování, Holešov, okres Kroměříž, Česká republika",
        "name": "Brusenka Restaurant a Pension",
        "hotel_name": None,
        "hotel_address": None,
        "hotel_star_ratings": None,
        "number_of_rooms": None,
        "number_of_reviews": None,
        "hotel_user_rating": None,
    },
]

然后,您可以将列表值(dicts)保存为新的CSV文件。

wlwcrazw

wlwcrazw2#

这个嵌套循环结构是您的主要问题:

for db_row in source_reader:
        for so_row in target_reader:

对于source_reader中的每一行,你都试图循环target_reader中的每一行......但是source_readertarget_reader都是迭代器,所以在第一次循环之后,for so_row in target_reader是一个无操作的:迭代器已经用尽,内部循环的内容将不会执行。
与嵌套这些循环不同的是,首先读取一个文件,然后读取第二个文件,然后合并结果。就像这样:

import csv

fieldnames: list[str] = [
    "query",
    "hotel_name",
    "hotel_address",
    "hotel_star_ratings",
    "number_of_rooms",
    "number_of_reviews",
    "hotel_user_rating",
]

db: dict[str, dict] = {}

# Read 'target.csv' and populdate the 'db' dictionary
with open("target.csv", "r") as target:
    target_reader = csv.DictReader(target)
    for row in target_reader:
        db[row["hotel_name"]] = row

# Read 'source.csv' and update matching entries in
# 'db' with the value of the `query` column
with open("source.csv", "r") as source:
    source_reader = csv.DictReader(source)
    for row in source_reader:
        if row["name"] in db:
            db[row["name"]]["query"] = row["query"]

# Write out the reuslts
with open("output.csv", "w") as output:
    writer = csv.DictWriter(output, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(db.values())

给定示例输入数据,此代码将生成:

query,hotel_name,hotel_address,hotel_star_ratings,number_of_rooms,number_of_reviews,hotel_user_rating
"ubytování, Holešov, okres Kroměříž, Česká republika",Hotel Panský Pivovar,"Masarykova 19, Holešov, 769 01, Česká republika",4,3,489,9.3
,Pension Brusenka,"Brusné 3, Bystřice pod Hostýnem, 768 61, Česká republika",3b,12,556,9.2
"ubytování, Holešov, okres Kroměříž, Česká republika",Hotel Tacl,"Palackého 518, Holešov, 769 01, Česká republika",3,4,76,8.9

这与您想要的输出并不完全匹配--我省略了重复的hotel name列--但是如果需要,您可以随时将其添加回去。

相关问题