regex 捕获数据正则表达式python

amrnrhlw  于 2023-03-09  发布在  Python
关注(0)|答案(2)|浏览(106)

我正在尝试使用从.doc文件导入的文本创建变量。对于给定的文本:

10         5,476,326.00    6              GRANITE CONSTRUCTION COMPANY          831 724-1011
                                                                                                  00000089
                                                            P O BOX 50085                     FAX 831 768-4021
                                                            WATSONVILLE CA  95077-5085
          08-0C8104                                                                                                BID245
          08-SBD-15-4                                                                                              PAGE  3
          11/21/08                                                                                                 11/26/08
                                            L I S T   O F   S U B C O N T R A C T O R S

 BIDDER ID NAME AND ADDRESS                                            DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
 _________ ____________________________________________________________ ____________________________________________________________

    01     C AND W CONSTRUCTION SPECIALTIES INC                         ITEMS 26, 27, 58 AND 59, STRIPING (PARTIAL)
           2419 PALMA DRIVE
           VENTURA CA  93003

           CAL STRIPE INC                                               ITEMS 15, 66 AND 67
           375 SOUTH G STREET
           SAN BERNARDINO CA  92410

           INTEGRITY REBAR PLACERS                                      ITEMS 60 THRU 65 (PARTIAL)
           23811 WASHINGTON AVE 110 317
           MURRIETA CA  92562

           J F L ELECTRIC INC                                           ITEMS 68 AND 69
           8257 COMPTON
           LOS ANGELES CA  90001

           MURPHY INDUSTRIAL COATING INC                                ITEM 47
           2704  GUNERLY AVENUE
           SIGNAL HILL C  90755
          08-0C8104                                                                                                BID245
          08-SBD-15-4                                                                                              PAGE  4
          11/21/08                                                                                                 11/26/08
                                            L I S T   O F   S U B C O N T R A C T O R S

 BIDDER ID NAME AND ADDRESS                                            DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
 _________ ____________________________________________________________ ____________________________________________________________

    03     C W CONSTRUCTION SPECIALTY INC                               ITEMS 26, 27, 58 AND 59
           VENTURA CA

           J F L ELECTRIC INC                                           ITEMS 68 AND 69
           LOS ANGELES CA

           LUNDENE PAINTING                                             ITEM 47
           FONTANA CA

 BIDDER ID NAME AND ADDRESS                                            DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
 _________ ____________________________________________________________ ____________________________________________________________

    10     C AND W CONSTRUCTION SPECIALTY INC                           ITEMS 26, 27, 58 AND 59 (PARTIAL)
           VENTURA CA

           FFB VANGUARD CONSTRUCTION                                    ITEMS 60 THRU 65 (PARTIAL)
           LIVERMORE CA

           J F L ELECTRIC INC                                           ITEMS 68 AND 69 (PARTIAL)
           LOS ANGELES CA

           PAVEMENT RECYCLING SYSTEM INC                                ITEM 28 (PARTIAL)
           RIVERSIDE CA

           VISUAL POLLUTION TECHNOLOGIES INC                            ITEM 47 (PARTIAL)
           SCOTTSDALE AZ
          08-0C8104                                                                                                BID245
          08-SBD-15-4                                                                                              PAGE  5
          11/21/08                                                                                                 11/26/08
                                            L I S T   O F   S U B C O N T R A C T O R S

 BIDDER ID NAME AND ADDRESS                                            DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
 _________ ____________________________________________________________ ____________________________________________________________

    09     INTEGRITY REBAR PLACERS                                      ITEMS 60 THRU 65 (PARTIAL)
           MURIETTA CA

           J F L ELECTRIC INC                                           ITEMS 68 THRU 69 (PARTIAL)
           LOS ANELES CA

           MARINA LANDSCAPE INC                                         EROSION CONTROL (PARTIAL)
           ANAHEIM CA

           PAVEMENT RECYCLING SYSTEMS INC                               ITEM 28 (PARTIAL)
           RIVERSIDE CA

           STERNDAHL ENTERPRISES INC                                    STRIPING (PARTIAL)
           SUN VALLEY CA

           TOOMEY INDUSTRIES                                            TRAFFIC CONTROL (PARTIAL)
           LONG BEACH CA

           VISUAL POLLUTION TECHNOLOGIES INC                            ITEM 47 (PARTIAL)
           SCOTTSDALE AZ

我试图建立一个数据集的以下形式(与所有投标人ID在文本中):
| 投标者标识符|分包商数量|项目|
| - ------|- ------|- ------|
| 一|五个|26、27、58、59,剥离(部分),15、66、67、60至65、68、69、47|
| 03|三个|第二十六、二十七、五十八、五十九、六十八、六十九、四十七条|
非常感谢@Andrej Keseley,下面的代码捕获了我们想要的数据集中的大部分内容。

import re
import pandas as pd

document = "

                  10         5,476,326.00    6              GRANITE CONSTRUCTION COMPANY          831 724-1011
                                                                                                  00000089
                                                            P O BOX 50085                     FAX 831 768-4021
                                                            WATSONVILLE CA  95077-5085
          08-0C8104                                                                                                BID245
          08-SBD-15-4                                                                                              PAGE  3
          11/21/08                                                                                                 11/26/08
                                            L I S T   O F   S U B C O N T R A C T O R S

 BIDDER ID NAME AND ADDRESS                                            DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
 _________ ____________________________________________________________ ____________________________________________________________

    01     C AND W CONSTRUCTION SPECIALTIES INC                         ITEMS 26, 27, 58 AND 59, STRIPING (PARTIAL)
           2419 PALMA DRIVE
           VENTURA CA  93003

           CAL STRIPE INC                                               ITEMS 15, 66 AND 67
           375 SOUTH G STREET
           SAN BERNARDINO CA  92410

           INTEGRITY REBAR PLACERS                                      ITEMS 60 THRU 65 (PARTIAL)
           23811 WASHINGTON AVE 110 317
           MURRIETA CA  92562

           J F L ELECTRIC INC                                           ITEMS 68 AND 69
           8257 COMPTON
           LOS ANGELES CA  90001

           MURPHY INDUSTRIAL COATING INC                                ITEM 47
           2704  GUNERLY AVENUE
           SIGNAL HILL C  90755
          08-0C8104                                                                                                BID245
          08-SBD-15-4                                                                                              PAGE  4
          11/21/08                                                                                                 11/26/08
                                            L I S T   O F   S U B C O N T R A C T O R S

 BIDDER ID NAME AND ADDRESS                                            DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
 _________ ____________________________________________________________ ____________________________________________________________

    03     C W CONSTRUCTION SPECIALTY INC                               ITEMS 26, 27, 58 AND 59
           VENTURA CA

           J F L ELECTRIC INC                                           ITEMS 68 AND 69
           LOS ANGELES CA

           LUNDENE PAINTING                                             ITEM 47
           FONTANA CA

 BIDDER ID NAME AND ADDRESS                                            DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
 _________ ____________________________________________________________ ____________________________________________________________

    10     C AND W CONSTRUCTION SPECIALTY INC                           ITEMS 26, 27, 58 AND 59 (PARTIAL)
           VENTURA CA

           FFB VANGUARD CONSTRUCTION                                    ITEMS 60 THRU 65 (PARTIAL)
           LIVERMORE CA

           J F L ELECTRIC INC                                           ITEMS 68 AND 69 (PARTIAL)
           LOS ANGELES CA

           PAVEMENT RECYCLING SYSTEM INC                                ITEM 28 (PARTIAL)
           RIVERSIDE CA

           VISUAL POLLUTION TECHNOLOGIES INC                            ITEM 47 (PARTIAL)
           SCOTTSDALE AZ
          08-0C8104                                                                                                BID245
          08-SBD-15-4                                                                                              PAGE  5
          11/21/08                                                                                                 11/26/08
                                            L I S T   O F   S U B C O N T R A C T O R S

 BIDDER ID NAME AND ADDRESS                                            DESCRIPTION OF PORTION OF WORK SUBCONTRACTED
 _________ ____________________________________________________________ ____________________________________________________________

    09     INTEGRITY REBAR PLACERS                                      ITEMS 60 THRU 65 (PARTIAL)
           MURIETTA CA

           J F L ELECTRIC INC                                           ITEMS 68 THRU 69 (PARTIAL)
           LOS ANELES CA

           MARINA LANDSCAPE INC                                         EROSION CONTROL (PARTIAL)
           ANAHEIM CA

           PAVEMENT RECYCLING SYSTEMS INC                               ITEM 28 (PARTIAL)
           RIVERSIDE CA

           STERNDAHL ENTERPRISES INC                                    STRIPING (PARTIAL)
           SUN VALLEY CA

           TOOMEY INDUSTRIES                                            TRAFFIC CONTROL (PARTIAL)
           LONG BEACH CA

           VISUAL POLLUTION TECHNOLOGIES INC                            ITEM 47 (PARTIAL)
           SCOTTSDALE AZ
"

data = []
for id_, group in re.findall(
    r"(?s)BIDDER ID\D+DESCRIPTION OF PORTION OF WORK SUBCONTRACTED\D+(\d+)(.*?)(?=BIDDER ID|-{5,}|\Z)",
    document,
):
    items = re.findall(r"ITEMS? (.*)", group)
    data.append(
        {
            "bidder-id": id_,
            "number_subcontractors": group.count('\n\n'),
            "items": ", ".join(
                i.replace(" (PARTIAL)", "").replace(" AND", ",").strip() for i in items
            ),
        }
    )

df = pd.DataFrame(data)
print(df)

图纸:

bidder-id  number_subcontractors                                                                                  items
0         01                      5                                     26, 27, 58, 59, 15, 66, 67, 60 THRU 65, 68, 69, 47
1         03                      3                                                             26, 27, 58, 59, 68, 69, 47
2         10                      5                                             26, 27, 58, 59, 60 THRU 65, 68, 69, 28, 47
3         09                      7                                                         60 THRU 65, 68 THRU 69, 28, 47
4         04                      3                                                                     26, 27, 57 THRU 59
5         04                      3                                                                                 60, 61
6         08                      6                             57 THRU 59, 15, 22, 23, 66, 67, 68, 69, 38, 40 THRU 43, 47
7         02                      7  26, 27, 58, 59, 60 THRU 65, 2, 68 THRU 70, 28, 31, 46, 51, 56, 12, 14, 16, 19, 57, 47
8         07                      1                                                                         26, 27, 58, 59
9         07                      5                                                 60 THRU 65, 68, 69, 29, 15, 66, 67, 69
10        05                      5                                             60 THRU 65, 68, 69, 28, 12, 13, 15, 66, 67
11        05                      1                                                                             40 THRU 45
12        06                      3                                                     26, 27, 57 THRU 59, 60 THRU 65, 47

但是它不能捕获不是以ITEMS开头的字符串(比如STRIPING (PARTIAL))。我不确定我们是否可以在当前代码中只使用regex来实现这一点。也许拆分文本会有所帮助?我不确定,仍在尝试解决这个问题。
任何帮助或领导将不胜感激!非常感谢!
Reference regex101
Reference Question

w6lpcovy

w6lpcovy1#

当然,我想这可能行得通:
在代码中只需将

items = re.findall(r"ITEMS? (.*)", group)

items = re.findall(r"\w {3,}(.*)", group)

然后对data.append做一些小改动
留给你们的是

data = []
for id_, group in re.findall(
    r"(?s)BIDDER ID\D+DESCRIPTION OF PORTION OF WORK SUBCONTRACTED\D+(\d+)(.*?)(?=BIDDER ID|-{5,}|\Z)",
    document,
):
    items = re.findall(r"\w {3,}(.*)", group)
    data.append(
        {
            "bidder-id": id_,
            "number_subcontractors": group.count('\n\n'),
            "items": ", ".join(
                i.replace(" AND", ",").replace("ITEMS ", "").replace("ITEM", "").strip() for i in items
            ),
        }
    )

df = pd.DataFrame(data)
print(df)

这随后导致投标人ID09的项目列看起来像:

60 THRU 65 (PARTIAL), 68 THRU 69 (PARTIAL), EROSION CONTROL (PARTIAL), 28 (PARTIAL), STRIPING (PARTIAL), TRAFFIC CONTROL (PARTIAL), 47 (PARTIAL)
6rvt4ljy

6rvt4ljy2#

将文档文本中的所有表格数据加载到数据框中(包括分包商的名称和地址)可能会很有趣,这样您就可以从中检索所需的内容。
然后可以使用panda的groupby方法提取摘要信息:

import pandas as pd
import re

def load(document):
    data = []
    # Identify lines of interest by specific column layout
    res = re.findall(r"^ (.{8})  (\S.{0,60}$|\S.{58})(?:  (\S.*$))?", document, re.M)
    for a, b, c in res:
        bidder = a.strip() or bidder
        if c: # new subcontractor
            data.append({
                "bidder-id": bidder, 
                "items": re.sub(r" AND|ITEM(S )?", ",", c).strip(" ,"),
                "name": b.strip(),
                "address": ""
            })
        else: # continuation with address
            data[-1]["address"] = (data[-1]["address"] + "\n").lstrip() + b.strip()
    return pd.DataFrame(data)

subcontractors = load(document)
summary = subcontractors.groupby("bidder-id").agg(
           {"items": ", ".join, "name": "count"}
          ).rename(columns={"name": "number_subcontractors"})

对于示例文档,summary Dataframe 将具有:
| 投标者标识符|项目|分包商数量|
| - ------|- ------|- ------|
| 一|26、27、58、59,剥离(部分),15、66、67、60至65(部分),68、69、47|五个|
| 03|第二十六、二十七、五十八、五十九、六十八、六十九、四十七条|三个|
| 09|60至65(部分)、68至69(部分)、侵 eclipse 控制(部分)、28(部分)、剥离(部分)、交通控制(部分)、47(部分)|七|
| 十个|26、27、58、59(部分)、60至65(部分)、68、69(部分)、28(部分)、47(部分)|五个|
仅供参考,subcontractors Dataframe 如下所示:
| 投标者标识符|项目|姓名|地址|
| - ------|- ------|- ------|- ------|
| 一|26、27、58、59,剥脱(部分)|C和W建筑专业公司|帕尔马大道2419号文图拉CA 93003|
| 一|十五、六十六、六十七|加州条纹公司|南G街375号加利福尼亚州圣伯纳迪诺92410|
| 一|60至65(部分)|完整性钢筋放置器|华盛顿大街23811号,110 317加利福尼亚州穆列塔92562|
| 一|68、69|J F L电气公司|小行星8257洛杉矶CA 90001|
| 一|四十七|墨菲工业涂料公司|甘纳利大道2704号信号山C 90755|
| 03|二十六、二十七、五十八、五十九|C W建筑专业公司|加利福尼亚州文图拉|
| 03|68、69|J F L电气公司|洛杉矶CA|
| 03|四十七|伦登绘画|丰塔纳CA|
| 十个|26、27、58、59(部分)|C和W建筑专业公司|加利福尼亚州文图拉|
| 十个|60至65(部分)|FFB VANGUARD结构|加利福尼亚州利弗莫尔|
| 十个|68、69(部分)|J F L电气公司|洛杉矶CA|
| 十个|28(部分)|路面回收系统公司|加州河滨市|
| 十个|47(部分)|视觉污染技术公司|亚利桑那州斯科茨代尔|
| 09|60至65(部分)|完整性钢筋放置器|加利福尼亚州穆列塔|
| 09|68至69(部分)|J F L电气公司|加利福尼亚州洛杉矶市|
| 09|侵 eclipse 控制(部分)|玛丽娜景观公司|加利福尼亚州阿纳海姆|
| 09|28(部分)|路面回收系统公司|加州河滨市|
| 09|剥离(部分)|斯滕达尔企业公司|加州太阳谷|
| 09|交通管制(部分)|图米工业|加州长滩|
| 09|47(部分)|视觉污染技术公司|亚利桑那州斯科茨代尔|

相关问题