重新格式化csv文件

l2osamch  于 2023-06-03  发布在  其他
关注(0)|答案(2)|浏览(289)

我有一个只有两个条目的csv文件。这就是:

Meat One,['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']

第一个是标题,第二个是业务标题。
问题出在第二个入口。
下面是我的代码:

import csv

with open('phonebookCOMPK-Directory.csv', "rt") as textfile:
    reader = csv.reader(textfile)

    for row in reader:
        row5 = row[5].replace("[", "").replace("]", "")
        listt = [(''.join(row5))]
        print (listt[0])

它打印:

'Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers'

我需要做的是,我想创建一个包含这些单词的列表,然后像这样打印它们,使用for循环分别打印每个项目:

Abattoirs
Exporters
Food Delivery
Butchers Retail
Meat Dealers-Retail
Meat Freezer
Meat Packers

事实上,我正试图重新格式化我目前的csv文件,并清理它,以便它可以更精确和更容易理解。
csv的完整第一行是这样的:

Meat One,+92-21-111163281,Al Shaheer Corporation,Retailers,2008,"['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']","[[' Outlets Address : Shop No. Z-10, Station Shopping Complex, MES Market, Malir-Cantt, Karachi. Landmarks : MES Market, Station Shopping Complex City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi. Landmarks : Nadra Chowrangi, Sky Garden, Tipu Sultan Road City : Karachi UAN : +92-21-111163281 '], ["" Outlets Address : Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi. Landmarks : Boat Basin, Jans Broast, Khayaban-e-Roomi City : Karachi UAN : +92-21-111163281 View Map ""], [' Outlets Address : Gulistan-e-Johar, Karachi. Landmarks : Perfume Chowk City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Tee Emm Mart, Creek Vista Appartments, Khayaban-e-Shaheen, Phase VIII, DHA, Karachi. Landmarks : Creek Vista Appartments, Nueplex Cinema, Tee Emm Mart, The Place City : Karachi Mobile : 0302-8333666 '], [' Outlets Address : Y-Block, DHA, Lahore. Landmarks : Y-Block City : Lahore UAN : +92-42-111163281 '], [' Outlets Address : Adj. PSO, Main Bhittai Road, Jinnah Supermarket, F-7 Markaz, Islamabad. Landmarks : Bhittai Road, Jinnah Super Market, PSO Petrol Pump City : Islamabad UAN : +92-51-111163281 ']]","Agriculture, fishing & Forestry > Farming equipment & services > Abattoirs in Pakistan"

First column is Name
Second column is Number
Third column is Owner
Forth column is Business type
Fifth column is Y.O.E
Sixth column is Business Headings
Seventh column is Outlets (List of lists containing every branch address)
Eighth column is classification

没有限制使用csv.reader,我愿意接受任何可用的技术来清理我的文件。

a7qyws3x

a7qyws3x1#

从两个独立的任务来考虑它:

  • 从“脏”源(此CSV文件)收集一些数据项
  • 将这些数据存储在某个地方,以便易于以编程方式访问和操作(根据您想对它做什么)

处理脏CSV

一种方法是使用函数deserialize_business()从CSV中的每个传入行提取结构化的业务信息。这个函数可能很复杂,因为这是任务的本质,但仍然建议将其拆分为自包含的较小函数(如get_outlets()get_headings()等)。这个函数可以返回一个字典,但根据你的需要,它可以是一个[named]元组,一个自定义对象,等等。
该函数将是用于该特定CSV数据源的“适配器”。
反序列化函数示例:

def deserialize_business(csv_line):
    """
    Distills structured business information from given raw CSV line.
    Returns a dictionary like {name, phone, owner,
    btype, yoe, headings[], outlets[], category}.
    """

    pieces = [piece.strip("[[\"\']] ") for piece in line.strip().split(',')]

    name = pieces[0]
    phone = pieces[1]
    owner = pieces[2]
    btype = pieces[3]
    yoe = pieces[4]

    # after yoe headings begin, until substring Outlets Address
    headings = pieces[4:pieces.index("Outlets Address")]

    # outlets go from substring Outlets Address until category
    outlet_pieces = pieces[pieces.index("Outlets Address"):-1]

    # combine each individual outlet information into a string
    # and let ``deserialize_outlet()`` deal with that
    raw_outlets = ', '.join(outlet_pieces).split("Outlets Address")
    outlets = [deserialize_outlet(outlet) for outlet in raw_outlets]

    # category is the last piece
    category = pieces[-1]

    return {
        'name': name,
        'phone': phone,
        'owner': owner,
        'btype': btype,
        'yoe': yoe,
        'headings': headings,
        'outlets': outlets,
        'category': category,
    }

调用它的示例:

with open("phonebookCOMPK-Directory.csv") as f:
    lineno = 0

    for line in f:
        lineno += 1

        try:
            business = deserialize_business(line)

        except:
            # Bad line formatting?
            log.exception(u"Failed to deserialize line #%s!", lineno)

        else:
            # All is well
            store_business(business)

存储数据

您将使用store_business()函数获取数据结构并将其写入某个位置。也许它是另一个结构更好的CSV,也许是多个CSV,一个JSON文件,或者你可以使用SQLite关系数据库工具,因为Python内置了它。
这一切都取决于你以后想做什么。

关系示例

在这种情况下,您的数据将被拆分到多个表中。(我使用的是“table”这个词,但它可以是一个CSV文件,尽管你也可以使用SQLite DB,因为Python内置了它。
确定所有可能业务标题的表格:

business heading ID, name
1, Abattoirs
2, Exporters
3, Food Delivery
4, Butchers Retail
5, Meat Dealers-Retail
6, Meat Freezer
7, Meat Packers

确定所有可能类别的表格:

category ID, parent category, name
1, NULL, "Agriculture, fishing & Forestry"
2, 1, "Farming equipment & services"
3, 2, "Abattoirs in Pakistan"

企业标识表:

business ID, name, phone, owner, type, yoe, category
1, Meat One, +92-21-111163281, Al Shaheer Corporation, Retailers, 2008, 3

描述其插座的表格:

business ID, city, address, landmarks, phone
1, Karachi UAN, "Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi", "Nadra Chowrangi, Sky Garden, Tipu Sultan Road", +92-21-111163281
1, Karachi UAN, "Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi", "Boat Basin, Jans Broast, Khayaban-e-Roomi", +92-21-111163281

描述其标题的表格:

business ID, business heading ID
1, 1
1, 2
1, 3
…

处理所有这些需要一个复杂的store_business()函数。如果使用关系型的方式保存数据,可能值得研究SQLite和一些ORM框架。

e3bfsja2

e3bfsja22#

你可以替换这行:

print(listt[0])

其中:

print(*listt[0], sep='\n')

相关问题