我正在尝试创建一个包含变量的数据框:bidder_rank, bidder_id, bid_total, bidder_info
.我创建了一个正则表达式模式,它似乎可以在regex101上工作。然而,Python脚本一直在中断,原因我无法弄清楚。
# imports
import os
import pandas as pd
import re
# text
text = '''
1 A) $11,644,939.00 VC0000007181 S.T. RHOADES CONSTRUCTION, INC. Phone (530)223-9322
B) 210 Days * 10000 8585 COMMERCIAL WAY CSLB# 00930684
A+B) $13,744,939.00 REDDING CA 96002
2 A) $12,561,053.00 VC0000007021 GR SUNDBERG, INC. Phone (707)825-6565
B) 210 Days * 10000 5211 BOYD ROAD CSLB# 00732695
A+B) $14,661,053.00 ARCATA CA 95521 Fax (707)825-6563
3 A) $13,098,288.00 VC1800001127 CALIFORNIA HIGHWAY CONSTRUCTION GROUP, Phone (925)766-7014
INC.
B) 210 Days * 10000 1647 WILLOW PASS ROAD CSLB# 01027700
A+B) $15,198,288.00 CONCORD CA 94520 Fax (925)265-9101
4 A) $13,661,954.26 VC0000003985 MERCER FRASER COMPANY Phone (707)443-6371
B) 210 Days * 10000 200 DINSMORE DR CSLB# 00105709
A+B) $15,761,954.26 FORTUNA CA 95540 Fax (707)443-0277
Bid Opening Date: 11/15/2022 Page 2
Contract Number: 01-0H20U4 11/15/2022
5 A) $15,396,278.00 VC0000000213 GRANITE CONSTRUCTION COMPANY Phone (831)728-7561
B) 210 Days * 10000 585 W BEACH STREET CSLB# 00000089
A+B) $17,496,278.00 WATSONVILLE CA 95076
Bid Opening Date: 11/15/2022 Page 3
Contract Number: 01-0H20U4 11/15/2022
'''
lines = re.split(r'(?=^\d)', texty, flags=re.MULTILINE)
# list of bids
bids = []
# loop through each line in the bid rank bid ID data table
for i in (0, len(lines)-1):
l = lines[i]
ok = re.findall(r"(?ms)(^\d+)\s*(.*)(VC\d+)\s+(.*)([\s\S]*?)(A\+B\)\s+(\$\d{1,3}(,\d{3})*(\.\d+)?))", str(l))
# continue if ok is not empty
if len(ok) == 0:
continue
else:
ok = ok[0]
# first group is bid_rank, third group is bid_id, fourth group is bidder_info, seventh group is bid_total
bidder_rank = ok[0]
bidder_id = ok[2]
bidder_info = ok[3]
bid_total = ok[6]
# create a tuple of the bid rank, bid ID, bidder info, and bid total
bid_tuple = (bidder_rank, bidder_id, bidder_info, bid_total)
# append the tuple to the list of bids
bids.append(bid_tuple)
print(bid_tuple)
# create a dataframe of the bids
biddf = pd.DataFrame(bids, columns=['bidder_rank', 'bidder_id', 'bidder_info', 'bid_total'])
print(biddf)
在挖掘之后,它似乎只对bidder_id = 5起作用。
>>> print(biddf)
bidder_rank bidder_id bidder_info bid_total
0 5 VC0000000213 GRANITE CONSTRUCTION COMPANY Phone (831)728-... $17,496,278.00
但是,根据regex101,它应该适用于所有投标人ID。我错过什么了吗?
2条答案
按热度按时间af7jpaap1#
你的正则表达式对我来说很好,只要你不使用
s
标志。我对它做了一些修改,删除了不必要的捕获组,并将必需的组更改为非捕获组,以将它们从输出中删除。Regex demo on regex101
然后,您可以将
re.findall
应用于整个文本,并直接在调用pd.DataFrame
时使用该输出:输出:
xjreopfe2#
在你的代码中有几件事我们必须改变,首先在你的for循环中,你正在迭代一个元组
(0, len(lines)-1)
,这意味着它只检查行中的第一个和最后一个项目,然后你的正则表达式模式太复杂了,而且你没有以正确的方式将输入字符串拆分成行。