我有一个这样的文件:
DOWNLOADPRINTER
S
CCODEPAGE 4103
CPAGENAME PAGE
CR_PAGE_INFO_START
FAVOURITE_FOOD Cookies
FAVOURITE_CAR AUDI
CR_PAGE_INFO_END
CR_ADDR_LINE_BEGIN
Adress_Post_code 1234
Adress_City GeorgeTown
CR_ADDR_LINE_END
CR_PERSONAL_INFO_START
FIRST_NAME John
LAST_NAME Doe
CR_PERSONAL_INFO_END
CR_ADDR_LINE_BEGIN
Adress_Post_code 1234
Adress_City GeorgeTown
CR_ADDR_LINE_END
CR_PERSONAL_INFO_START
FIRST_NAME Jane
LAST_NAME Doe
CR_PERSONAL_INFO_END
CR_ADDR_LINE_BEGIN
...
(random amount of datas, attributes have always same sort and amount)
...
CR_PERSONAL_INFO_END
DOWNLOADPRINTER
S
CCODEPAGE 4103
CPAGENAME PAGE
CR_PAGE_INFO_START
FAVOURITE_FOOD Donuts
FAVOURITE_CAR AUDI
CR_PAGE_INFO_END
CR_ADDR_LINE_BEGIN
Adress_Post_code 1234
Adress_City GeorgeTown
CR_ADDR_LINE_END
CR_PERSONAL_INFO_START
FIRST_NAME Jennifer
LAST_NAME Doe
该文件包含1000个数据集中约10.000个数据
我想按不同属性将其分组为具有以下格式的正确外观的csv:
FAVOURITE_FOOD , FAVOURITE CAR, Adress_Post_code, Adress_City,FIRST_NAME,LAST_NAME
Cookies, Audi, 1234, GeorgeTown, John, Doe
Cookies, Audi, 1234, GeorgeTown, Jane, Doe
......
Donuts, Audi, 1234, GeorgeTown, Jennifer, Doe
其目的是忽略所有参数:cr_..、downloadprinter、(行)s、ccodepage。
特别的参数是favoutine_food and Favorite_car,它在每个数据集中出现一次,但必须是特定数据集中每行的前缀。
目前的做法:
import csv
import os
import re
path = os.path.dirname(file)
filename = '/input.TXT'
output = 'output.csv'
attributes = ('FAVOURITE_FOOD', 'FAVOURITE_CAR', 'Adress_Post_code', 'Adress_City', 'FIRST_NAME','LAST_NAME' )
## dont parse all for testing
num_lines = 5000
with open(path + filename, 'r') as file:
with open('output.csv', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(attributes)
for i in range(num_lines):
line = next(file).strip()
if str(line).startswith('FAVOURITE_FOOD'):
prefix = ''
print('new dataset found')
prefix = re.sub('FAVOURITE_FOOD', '', line)
print(prefix)
continue
if str(line).startswith('FAVOURITE_CAR'):
prefix += ',' + re.sub('FAVOURITE_CAR', '', line)
print(prefix)
continue
if str(line).startswith('Adress_City'):
line = re.sub('DWA_CO_ADDRESS-CITY1', '', line)
##dont allow whitespaces
line = re.sub(' ', '', line)
out_file.write(prefix + line)
##how to continue with the other files?
##I would like to stick with writer.writerow and not out_file.write
1条答案
按热度按时间c8ib6hqw1#
为了使用
csv.writer
必须为每行生成一个列表。前两个元素对于一组行是公共的。我建议您使用一些被忽略的元素作为哨兵值,以了解新行/行组何时开始和结束。这样,您可以在每次找到相关字段时输入一行,并在找到sentinel行时将其写入文件:根据您的示例数据,如果按预期给出: