python—按标记分组csv的最佳方法文本文件

e4eetjau  于 2021-08-25  发布在  Java
关注(0)|答案(1)|浏览(265)

我有一个这样的文件:

DOWNLOADPRINTER
S
CCODEPAGE 4103
CPAGENAME PAGE
CR_PAGE_INFO_START
FAVOURITE_FOOD Cookies
FAVOURITE_CAR AUDI
CR_PAGE_INFO_END
CR_ADDR_LINE_BEGIN
Adress_Post_code 1234
Adress_City GeorgeTown
CR_ADDR_LINE_END
CR_PERSONAL_INFO_START
FIRST_NAME John
LAST_NAME Doe
CR_PERSONAL_INFO_END
CR_ADDR_LINE_BEGIN
Adress_Post_code 1234
Adress_City GeorgeTown
CR_ADDR_LINE_END
CR_PERSONAL_INFO_START
FIRST_NAME Jane
LAST_NAME Doe
CR_PERSONAL_INFO_END
CR_ADDR_LINE_BEGIN
...
(random amount of datas, attributes have always same sort and amount)
...
CR_PERSONAL_INFO_END
DOWNLOADPRINTER
S
CCODEPAGE 4103
CPAGENAME PAGE
CR_PAGE_INFO_START
FAVOURITE_FOOD Donuts
FAVOURITE_CAR AUDI
CR_PAGE_INFO_END
CR_ADDR_LINE_BEGIN
Adress_Post_code 1234
Adress_City GeorgeTown
CR_ADDR_LINE_END
CR_PERSONAL_INFO_START
FIRST_NAME Jennifer
LAST_NAME Doe

该文件包含1000个数据集中约10.000个数据
我想按不同属性将其分组为具有以下格式的正确外观的csv:

FAVOURITE_FOOD , FAVOURITE CAR, Adress_Post_code, Adress_City,FIRST_NAME,LAST_NAME 
Cookies, Audi, 1234, GeorgeTown, John, Doe
Cookies, Audi, 1234, GeorgeTown, Jane, Doe
......
Donuts, Audi, 1234, GeorgeTown, Jennifer, Doe

其目的是忽略所有参数:cr_..、downloadprinter、(行)s、ccodepage。
特别的参数是favoutine_food and Favorite_car,它在每个数据集中出现一次,但必须是特定数据集中每行的前缀。
目前的做法:

import csv
import os
import re
path = os.path.dirname(file)
filename = '/input.TXT'
output = 'output.csv'
attributes = ('FAVOURITE_FOOD', 'FAVOURITE_CAR', 'Adress_Post_code', 'Adress_City', 'FIRST_NAME','LAST_NAME' )

## dont parse all for testing

num_lines = 5000

with open(path + filename, 'r') as file:
    with open('output.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(attributes)
        for i in range(num_lines):
            line = next(file).strip()
            if str(line).startswith('FAVOURITE_FOOD'):
                prefix = ''
                print('new dataset found')
                prefix = re.sub('FAVOURITE_FOOD', '', line)
                print(prefix)
                continue
            if str(line).startswith('FAVOURITE_CAR'):
                prefix += ',' + re.sub('FAVOURITE_CAR', '', line)
                print(prefix)
                continue
            if str(line).startswith('Adress_City'):
                line = re.sub('DWA_CO_ADDRESS-CITY1', '', line)
                ##dont allow whitespaces
                line = re.sub(' ', '', line)
                out_file.write(prefix + line)
                ##how to continue with the other files?
                ##I would like to stick with writer.writerow  and not  out_file.write
c8ib6hqw

c8ib6hqw1#

为了使用 csv.writer 必须为每行生成一个列表。前两个元素对于一组行是公共的。我建议您使用一些被忽略的元素作为哨兵值,以了解新行/行组何时开始和结束。这样,您可以在每次找到相关字段时输入一行,并在找到sentinel行时将其写入文件:

...

## dont parse all for testing

num_lines = 5000

sentinel = 'CR_PERSONAL_INFO_END'
reset = 'CR_PAGE_INFO_START'

# BEWARE do not forget newline='' for csv writers

with open(path + filename, 'r') as file, open(output, 'w', newline='') as out_file:
    writer = csv.writer(out_file)
    _ = writer.writerow(attributes)  # write headers
    row = ['' for at in attributes]  # prepare an empty row
    for i, line in enumerate(file):  # loop
        if i >= num_lines:           # a maximum of num_lines times
            break
        line = line.strip()
        if line == reset:           # reset attributes
            row = ['' for j in range(6)]
        elif line == sentinel:
            _ = writer.writerow(row)         # write a row
            row[2:] = ['' for j in range(4)] # and reset fields but 2 first ones
        else:
            fields = line.split()
            try:
                # search beginning of line in attributes
                ix = attributes.index(fields[0])
                row[ix] = fields[1]   # if found set second part in current row
            except ValueError:
                pass

根据您的示例数据,如果按预期给出:

FAVOURITE_FOOD,FAVOURITE_CAR,Adress_Post_code,Adress_City,FIRST_NAME,LAST_NAME
Cookies,AUDI,1234,GeorgeTown,John,Doe
Cookies,AUDI,1234,GeorgeTown,Jane,Doe
Donuts,AUDI,1234,GeorgeTown,Jennifer,Doe

相关问题