格式化Regex以输出更干净的结果

jtoj6r0c  于 2023-06-07  发布在  其他
关注(0)|答案(1)|浏览(500)

我有下面的正则表达式模式和示例文本:

string pattern = @"Seq No:\s+(\d{4})\s+(\d+)|Purchase Order\n(\d+)|(\d{4}-\d{3}-D\d{3,4})|EA\s+(.*?)\s+Drawing|Due: Requester:\s+(\d{2}/\d{2}/\d{4})\s+[A-Z]{3}|Due:\s+(\d{2}/\d{2}/\d{4})\s+Requester:|Requester:\s([A-Z]{3})|\d.\d{2}\s\d.\d{2}\s(.*?)\sEA";

以下采购订单文本示例:

xx
37764
PO Date: 5/24/2023
To: Ship To:
xxx
Total Purchase Order 0.00
PO Terms
Freight Terms:
Ship Via: FOB:
Terms:
N/A
Net 30 Days Best Way Destination
Line Quantity Item Unit Price Amount
U/M
1 1.00 2772-212-W02 0.00 0.00
EA CATCH PAN WELDMENT
Drawing: OPT
Due: Requester:
07/24/2023 GAF
Order: 2853IF-216-703 Seq No: 4
2 2.00 2853-220-D002 0.00 0.00
EA ROLL CART CONTACT PLATE
Drawing: OPT
Due: 07/24/2023 Requester: GAF
Order: 2853IF-220-000 Seq No: 4
3 4.00 MCRI-0100-D0104 0.00 0.00
ROBOT BASE CORNER JACK PLATE 
EA
Drawing: OPT
Due: 07/24/2023 Requester: GAF
Order: 2853IF-216-702 Seq No: 5
4 4.00 MCRI-0100-D0105 0.00 0.00
EA ROBOT BASE CENTER JACK PLATE
Drawing: OPT
Due: 07/24/2023 Requester: GAF
Order: 2853IF-216-702 Seq No: 6
5 2.00 MCRI-0450-D0103 0.00 0.00
EA LIGHT CURTAIN POST JACK PLATE
Drawing: OPT
Due: 07/24/2023 Requester: GAF
Order: 2853IF-240-000 Seq No: 6
xxxxx(page ended here)
xxxxx(start of next page here)
37764
PO Date: 5/24/2023
To: Ship To:
Purchaser:
xx
USA
Total Purchase Order 0.00
PO Terms
Freight Terms:
Ship Via: FOB:
Terms:
N/A
Net 30 Days Best Way Destination
Line Quantity Item Unit Price Amount
U/M
6 1.00 MCRI-0650-D0202 0.00 0.00
EA 2.5" SCH40 DRESS OUT TUBE
Drawing:
OPT
Due: 07/24/2023 Requester: GAF
Order: 2853IF-216-702 Seq No: 10
7 2.00 2799-216-D005 0.00 0.00
EA ALUMINUM TUBE
Drawing: OPT
Due: 07/24/2023 Requester: GAF
Order: 2853IF-218-000 Seq No: 11
8 1.00 MCRI-0750-D0101 0.00 0.00
EA TEACH PENDANT MOUNT
Drawing: OPT
Due: 07/24/2023 Requester: GAF
Order: 2853IF-216-702 Seq No: 12
9 1.00 2853-217-D001 0.00 0.00
EA MOUNTING PLATE
Drawing: OPT
Due: 07/24/2023 Requester: GAF
Order: 2853IF-217-000 Seq No: 12
10 1.00 2799-216-D006 0.00 0.00
ALUMINUM TUBE
EA
Drawing: OPT
Due: 07/24/2023 Requester: GAF
xx(end of page 2 here in middle of line item 10)
Page 2 of 3
xxx
37764
PO Date: 5/24/2023
To: Ship To:
Purchaser:
xx
USA
Total Purchase Order 0.00
PO Terms
Freight Terms:
Ship Via: FOB:
Terms:
N/A
Net 30 Days Best Way Destination
Line Quantity Item Unit Price Amount
U/M
Order: 2853IF-218-000 Seq No: 12
11 4.00 2799-216-D007 0.00 0.00
EA VACUUM CUP MOUNT
Drawing: OPT
Due: 07/24/2023 Requester: GAF
Order: 2853IF-218-000 Seq No: 13
12 7.00 2799-216-D008 0.00 0.00
EA BEAM CROSS CLAMP
Drawing:
OPT
Due: 07/24/2023 Requester: GAF
Order: 2853IF-218-000 Seq No: 14
13 2.00 2853-219-D001 0.00 0.00
EA CYLINDER MOUNT PLATE
Drawing: OPT
Due: 07/24/2023 Requester: GAF
Order: 2853IF-219-000 Seq No: 14
Total 0.00
 xx
 Page 3 of 3

https://regex101.com/r/Q52mJs/1
我这里有一张公司的订单。提取一些基本信息但是数据没有被正确地提取。当在Excel上查看时,PO上的每个行项目都应该是Excel中的一行,但我认为从正则表达式中可以通过每个匹配项将它们分解。这正是我告诉它要做的,我敢肯定。我怎样才能得到想要的结果?我很肯定这是因为我用了一堆“|“在我的模式中意思是“或者”我只是不知道如何格式化模式以准确地提取数据。
支线任务:数量只显示在1个条目上,因为数量以2种不同的方式显示,这使得它作为2组显示。我如何在一组中捕获两者?

原始采购订单截图(信息隐藏)

mepcadol

mepcadol1#

我建议你分几步来做:
1.提取内容中有趣的部分,它似乎在U/MTotal之间,使用像@"\r?\nU/M\r?\n(?<data>.*)\r?\nTotal\s[\d.]+"s这样的正则表达式。可能需要更改它以处理多页输出。这里我只看到 * 第1页的1*。
在这里测试:https://regex101.com/r/nf0N7z/1
1.然后,对于第二部分,我们可以看到每个项目总是以行,数量,项目单位,价格和金额开始。唯一有一点变化的是项目名称、到期日和请求者的顺序。因此,为了更清楚,在正则表达式中添加一些带有x标志的注解。我还将使用m标志,以便^$分别匹配行的开始或结尾。
模式:

@"# Match a line such as 1 2.00 2814-212-D003 0.00 0.00
^
(?<line>\d+)\s+
(?<quantity>[\d.]+)\s+
(?<item_unit>\d+-\d+-D\d+)\s+ # Will it always be D*** ? If not then change this.
(?<price>[\d.]+)\s+
(?<amount>[\d.]+)\s+

# Match the item name (before or after EA)
(?:
  EA\s(?<item_name_v1>[^\r\n]*)\s+
  |
  (?<item_name_v2>[^\r\n]*)\nEA\s+
)

# Match the drawing
Drawing:\s+(?<drawing>[^\r\n]+)\s+

# Match the due date and requester, in several ways
Due:\s+
(?:
  Requester:\s+(?<req_date_v1>\d{2}/\d{2}/\d{4})\s+(?<req_name_v1>[^\r\n]+)
  |
  (?<req_date_v2>\d{2}/\d{2}/\d{4})\sRequester:\s+(?<req_name_v2>[^\r\n]+)
)
\s+
Order:\s+(?<order>\w+-\d+-\d+)\s+
Seq\sNo:\s+(?<seq_no>\d+)
$
"gxm

在这里测试:https://regex101.com/r/hEjKt6/2
由于有多个语法(我指的是元素的不同顺序),因此对于相同的信息,将有多个捕获组。为了清晰起见,我命名了这些组(而不仅仅是数字索引)。你可以只测试它们的内容,然后选择不为空的一个,就像下面的JavaScript例子:

// In JavaScript, the x flag isn't handled, so you end up with a messy long unreadable regex like this:
const regex = /^(?<line>\d+)\s+(?<quantity>[\d.]+)\s+(?<item_unit>\d+-\d+-D\d+)\s+(?<price>[\d.]+)\s+(?<amount>[\d.]+)\s+(?:EA\s(?<item_name_v1>[^\r\n]*)\s+|(?<item_name_v2>[^\r\n]*)\nEA\s+)Drawing:\s+(?<drawing>[^\r\n]+)\s+Due:\s+(?:Requester:\s+(?<req_date_v1>\d{2}\/\d{2}\/\d{4})\s+(?<req_name_v1>[^\r\n]+)|(?<req_date_v2>\d{2}\/\d{2}\/\d{4})\sRequester:\s+(?<req_name_v2>[^\r\n]+))\s+Order:\s+(?<order>\w+-\d+-\d+)\s+Seq\sNo:\s+(?<seq_no>\d+)$/gm;

const input = `Purchase Order
37224
(omitted details)
Total Purchase Order 0.00
PO Terms
Freight Terms:
Ship Via: FOB:
Terms:
N/A
Net 30 Days Best Way Destination
Line Quantity Item Unit Price Amount
U/M
1 2.00 2814-212-D003 0.00 0.00
EA LONG JACK PAD
Drawing: OPT
Due: Requester:
05/19/2023 NMB
Order: 2843HR-213-703 Seq No: 9002
2 2.00 2814-212-D003 0.00 0.00
EA LONG JACK PAD
Drawing: OPT
Due: 05/19/2023 Requester: NMB
Order: 2843HR-214-703 Seq No: 9002
3 2.00 2814-212-D004 0.00 0.00
SHORT JACK PAD
EA
Drawing: OPT
Due: 05/19/2023 Requester: NMB
Order: 2843HR-213-703 Seq No: 9003
4 2.00 2814-212-D004 0.00 0.00
EA SHORT JACK PAD
Drawing: OPT
Due: 05/19/2023 Requester: NMB
Order: 2843HR-214-703 Seq No: 9003
Total 0.00
(omitted details)
Page 1 of 1`;

// Loop over all matches.
for (const match of input.matchAll(regex)) {
  // An object to store all the captured groups.
  let foundData = {};
  // Loop over all captured groups.
  for (const group of Object.entries(match.groups)) {
    let name = group[0];
    let value = group[1];
    // A regex to test if it's a group with a _v1 or _v2 at the end.
    let matchMultiNamedGroup = name.match(/^(.*?)_(v\d)$/)
    if (matchMultiNamedGroup) {
      // If it's _v2, we've already handled it in the step before for _v1.
      if (matchMultiNamedGroup[2] === "v2") continue;
      // Remove the _v1 for the capturing group name.
      name = matchMultiNamedGroup[1];
      // Take the value if it's not undefined or take the capturing group _v2 instead.
      value = value !== undefined ? value : match.groups[matchMultiNamedGroup[1] + '_v2'];
    }
    // Add the group to the data object.
    foundData[name] = value;
  }
  console.log(foundData);
}

但好消息是,C#似乎接受在模式中多次使用命名捕获组。在这种情况下,这有助于我们:https://dotnetfiddle.net/M8zHBj

相关问题