我正在尝试使用从.doc
文件导入的文本创建变量。对于给定的文本:
10 5,476,326.00 6 GRANITE CONSTRUCTION COMPANY 831 724-1011
00000089
P O BOX 50085 FAX 831 768-4021
WATSONVILLE CA 95077-5085
08-0C8104 BID245
08-SBD-15-4 PAGE 3
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
01 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59, STRIPING (PARTIAL)
2419 PALMA DRIVE
VENTURA CA 93003
CAL STRIPE INC ITEMS 15, 66 AND 67
375 SOUTH G STREET
SAN BERNARDINO CA 92410
INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
23811 WASHINGTON AVE 110 317
MURRIETA CA 92562
J F L ELECTRIC INC ITEMS 68 AND 69
8257 COMPTON
LOS ANGELES CA 90001
MURPHY INDUSTRIAL COATING INC ITEM 47
2704 GUNERLY AVENUE
SIGNAL HILL C 90755
08-0C8104 BID245
08-SBD-15-4 PAGE 4
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
03 C W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59
VENTURA CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
LUNDENE PAINTING ITEM 47
FONTANA CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
10 C AND W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59 (PARTIAL)
VENTURA CA
FFB VANGUARD CONSTRUCTION ITEMS 60 THRU 65 (PARTIAL)
LIVERMORE CA
J F L ELECTRIC INC ITEMS 68 AND 69 (PARTIAL)
LOS ANGELES CA
PAVEMENT RECYCLING SYSTEM INC ITEM 28 (PARTIAL)
RIVERSIDE CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 PAGE 5
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
09 INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
MURIETTA CA
J F L ELECTRIC INC ITEMS 68 THRU 69 (PARTIAL)
LOS ANELES CA
MARINA LANDSCAPE INC EROSION CONTROL (PARTIAL)
ANAHEIM CA
PAVEMENT RECYCLING SYSTEMS INC ITEM 28 (PARTIAL)
RIVERSIDE CA
STERNDAHL ENTERPRISES INC STRIPING (PARTIAL)
SUN VALLEY CA
TOOMEY INDUSTRIES TRAFFIC CONTROL (PARTIAL)
LONG BEACH CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
我试图建立一个数据集的以下形式(与所有投标人ID在文本中):
| 投标者标识符|分包商数量|项目|
| - ------|- ------|- ------|
| 一|五个|26、27、58、59,剥离(部分),15、66、67、60至65、68、69、47|
| 03|三个|第二十六、二十七、五十八、五十九、六十八、六十九、四十七条|
非常感谢@Andrej Keseley,下面的代码捕获了我们想要的数据集中的大部分内容。
import re
import pandas as pd
document = "
10 5,476,326.00 6 GRANITE CONSTRUCTION COMPANY 831 724-1011
00000089
P O BOX 50085 FAX 831 768-4021
WATSONVILLE CA 95077-5085
08-0C8104 BID245
08-SBD-15-4 PAGE 3
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
01 C AND W CONSTRUCTION SPECIALTIES INC ITEMS 26, 27, 58 AND 59, STRIPING (PARTIAL)
2419 PALMA DRIVE
VENTURA CA 93003
CAL STRIPE INC ITEMS 15, 66 AND 67
375 SOUTH G STREET
SAN BERNARDINO CA 92410
INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
23811 WASHINGTON AVE 110 317
MURRIETA CA 92562
J F L ELECTRIC INC ITEMS 68 AND 69
8257 COMPTON
LOS ANGELES CA 90001
MURPHY INDUSTRIAL COATING INC ITEM 47
2704 GUNERLY AVENUE
SIGNAL HILL C 90755
08-0C8104 BID245
08-SBD-15-4 PAGE 4
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
03 C W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59
VENTURA CA
J F L ELECTRIC INC ITEMS 68 AND 69
LOS ANGELES CA
LUNDENE PAINTING ITEM 47
FONTANA CA
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
10 C AND W CONSTRUCTION SPECIALTY INC ITEMS 26, 27, 58 AND 59 (PARTIAL)
VENTURA CA
FFB VANGUARD CONSTRUCTION ITEMS 60 THRU 65 (PARTIAL)
LIVERMORE CA
J F L ELECTRIC INC ITEMS 68 AND 69 (PARTIAL)
LOS ANGELES CA
PAVEMENT RECYCLING SYSTEM INC ITEM 28 (PARTIAL)
RIVERSIDE CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
08-0C8104 BID245
08-SBD-15-4 PAGE 5
11/21/08 11/26/08
L I S T O F S U B C O N T R A C T O R S
BIDDER ID NAME AND ADDRESS DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
_________ ____________________________________________________________ ____________________________________________________________
09 INTEGRITY REBAR PLACERS ITEMS 60 THRU 65 (PARTIAL)
MURIETTA CA
J F L ELECTRIC INC ITEMS 68 THRU 69 (PARTIAL)
LOS ANELES CA
MARINA LANDSCAPE INC EROSION CONTROL (PARTIAL)
ANAHEIM CA
PAVEMENT RECYCLING SYSTEMS INC ITEM 28 (PARTIAL)
RIVERSIDE CA
STERNDAHL ENTERPRISES INC STRIPING (PARTIAL)
SUN VALLEY CA
TOOMEY INDUSTRIES TRAFFIC CONTROL (PARTIAL)
LONG BEACH CA
VISUAL POLLUTION TECHNOLOGIES INC ITEM 47 (PARTIAL)
SCOTTSDALE AZ
"
data = []
for id_, group in re.findall(
r"(?s)BIDDER ID\D+DESCRIPTION OF PORTION OF WORK SUBCONTRACTED\D+(\d+)(.*?)(?=BIDDER ID|-{5,}|\Z)",
document,
):
items = re.findall(r"ITEMS? (.*)", group)
data.append(
{
"bidder-id": id_,
"number_subcontractors": group.count('\n\n'),
"items": ", ".join(
i.replace(" (PARTIAL)", "").replace(" AND", ",").strip() for i in items
),
}
)
df = pd.DataFrame(data)
print(df)
图纸:
bidder-id number_subcontractors items
0 01 5 26, 27, 58, 59, 15, 66, 67, 60 THRU 65, 68, 69, 47
1 03 3 26, 27, 58, 59, 68, 69, 47
2 10 5 26, 27, 58, 59, 60 THRU 65, 68, 69, 28, 47
3 09 7 60 THRU 65, 68 THRU 69, 28, 47
4 04 3 26, 27, 57 THRU 59
5 04 3 60, 61
6 08 6 57 THRU 59, 15, 22, 23, 66, 67, 68, 69, 38, 40 THRU 43, 47
7 02 7 26, 27, 58, 59, 60 THRU 65, 2, 68 THRU 70, 28, 31, 46, 51, 56, 12, 14, 16, 19, 57, 47
8 07 1 26, 27, 58, 59
9 07 5 60 THRU 65, 68, 69, 29, 15, 66, 67, 69
10 05 5 60 THRU 65, 68, 69, 28, 12, 13, 15, 66, 67
11 05 1 40 THRU 45
12 06 3 26, 27, 57 THRU 59, 60 THRU 65, 47
但是它不能捕获不是以ITEMS
开头的字符串(比如STRIPING (PARTIAL)
)。我不确定我们是否可以在当前代码中只使用regex来实现这一点。也许拆分文本会有所帮助?我不确定,仍在尝试解决这个问题。
任何帮助或领导将不胜感激!非常感谢!
Reference regex101
Reference Question
2条答案
按热度按时间w6lpcovy1#
当然,我想这可能行得通:
在代码中只需将
与
然后对
data.append
做一些小改动留给你们的是
这随后导致投标人ID09的项目列看起来像:
6rvt4ljy2#
将文档文本中的所有表格数据加载到数据框中(包括分包商的名称和地址)可能会很有趣,这样您就可以从中检索所需的内容。
然后可以使用panda的
groupby
方法提取摘要信息:对于示例文档,
summary
Dataframe 将具有:| 投标者标识符|项目|分包商数量|
| - ------|- ------|- ------|
| 一|26、27、58、59,剥离(部分),15、66、67、60至65(部分),68、69、47|五个|
| 03|第二十六、二十七、五十八、五十九、六十八、六十九、四十七条|三个|
| 09|60至65(部分)、68至69(部分)、侵 eclipse 控制(部分)、28(部分)、剥离(部分)、交通控制(部分)、47(部分)|七|
| 十个|26、27、58、59(部分)、60至65(部分)、68、69(部分)、28(部分)、47(部分)|五个|
仅供参考,
subcontractors
Dataframe 如下所示:| 投标者标识符|项目|姓名|地址|
| - ------|- ------|- ------|- ------|
| 一|26、27、58、59,剥脱(部分)|C和W建筑专业公司|帕尔马大道2419号文图拉CA 93003|
| 一|十五、六十六、六十七|加州条纹公司|南G街375号加利福尼亚州圣伯纳迪诺92410|
| 一|60至65(部分)|完整性钢筋放置器|华盛顿大街23811号,110 317加利福尼亚州穆列塔92562|
| 一|68、69|J F L电气公司|小行星8257洛杉矶CA 90001|
| 一|四十七|墨菲工业涂料公司|甘纳利大道2704号信号山C 90755|
| 03|二十六、二十七、五十八、五十九|C W建筑专业公司|加利福尼亚州文图拉|
| 03|68、69|J F L电气公司|洛杉矶CA|
| 03|四十七|伦登绘画|丰塔纳CA|
| 十个|26、27、58、59(部分)|C和W建筑专业公司|加利福尼亚州文图拉|
| 十个|60至65(部分)|FFB VANGUARD结构|加利福尼亚州利弗莫尔|
| 十个|68、69(部分)|J F L电气公司|洛杉矶CA|
| 十个|28(部分)|路面回收系统公司|加州河滨市|
| 十个|47(部分)|视觉污染技术公司|亚利桑那州斯科茨代尔|
| 09|60至65(部分)|完整性钢筋放置器|加利福尼亚州穆列塔|
| 09|68至69(部分)|J F L电气公司|加利福尼亚州洛杉矶市|
| 09|侵 eclipse 控制(部分)|玛丽娜景观公司|加利福尼亚州阿纳海姆|
| 09|28(部分)|路面回收系统公司|加州河滨市|
| 09|剥离(部分)|斯滕达尔企业公司|加州太阳谷|
| 09|交通管制(部分)|图米工业|加州长滩|
| 09|47(部分)|视觉污染技术公司|亚利桑那州斯科茨代尔|