如何使用bash regex、sed、grep或groovy从GitHub压缩的提交消息中提取Jira ID?

2ledvvac  于 2023-10-15  发布在  Git
关注(0)|答案(1)|浏览(114)

我想从压缩合并GitHub PR产生的提交消息字符串中提取Jira ID。如果存在多个Jira ID,我想提取第一个。
如何使用bash regex、sed、grep、groovy或任何其他可用的comman line工具?
测试用例:

ABCD-1231 dummy title XYZ-566 (#423)
=> ABCD-1231

[ABCD-1232, XYZ-566] dummy title (#424)
=> ABCD-1232

[ABCD-1233] dummy title (#425)
=> ABCD-1233

ABCD-1234: dummy title (#426)
=> ABCD-1234

XYZ-567 dummy title (#427)
=> XYZ-567

(XYZ-568) dummy title (#428)
=> XYZ-568

"XYZ-569" dummy title (#429)
=> XYZ-569

dummy title XYZ-570 dummy title (#430)
=> XYZ-570

DUMMY title XYZ-571 dummy title (#431)
=> XYZ-571

'feature/XYZ-572' dummy title (#432)
=> XYZ-572

FEATURE|XYZ-573 dummy title (#433)
=> XYZ-573

<Feature\XYZ-574> dummy title (#434)
=> XYZ-574

dummy title FAKE-XYZ-575 dummy title (#435)
=> <nothing>

dummy title abcdXYZ-576 dummy title (#436)
=> <nothing>

巴什
"[^A-Z]*(([A-Z]+-[0-9]+)*.*) \(#([0-9]+)\)"会失败

  • “特色|XYZ-573虚拟标题(#433)”
  • “[功能\XYZ-574]虚拟标题(#434)”
  • ...

而bash似乎不支持如下的负向后看:".*[^A-Z]*((?<!([A-Z]+)-?)[A-Z]+-[0-9]+).*"
有人能提出解决方案或正确的工具来完成这项任务吗?

oprakyz7

oprakyz71#

您可以尝试使用-P选项在grep上启用 PCRE 引擎。这可能取决于服务器上的grep版本。它会让你使用lookarounds。
你也可以使用-m 1选项来限制匹配行,但是这对我们没有太大帮助,因为提交消息都在一行上。我将使用它只是为了防止您的提交消息在几行上。这可以保存几个CPU周期。
-o选项将只输出匹配项。然后,我们可以将输出传递给head,以便只进行第一个匹配。
对于模式,我会尝试使用(?<=^|[^\w-])[A-Z]+-\d+。我已经使用了一个积极的向后查找来匹配行的开头或者任何不是单词或连字符的字符。
我在下面的bash脚本中测试了所有的提交消息:

#!/bin/bash

commits=(
    "ABCD-1231 dummy title XYZ-566 (#423)"
    "[ABCD-1232, XYZ-566] dummy title (#424)"
    "[ABCD-1233] dummy title (#425)"
    "ABCD-1234: dummy title (#426)"
    "XYZ-567 dummy title (#427)"
    "(XYZ-568) dummy title (#428)"
    '"XYZ-569" dummy title (#429)'
    "dummy title XYZ-570 dummy title (#430)"
    "DUMMY title XYZ-571 dummy title (#431)"
    "'feature/XYZ-572' dummy title (#432)"
    "FEATURE|XYZ-573 dummy title (#433)"
    "<Feature\XYZ-574> dummy title (#434)"
    "dummy title FAKE-XYZ-575 dummy title (#435)"
    "dummy title abcdXYZ-576 dummy title (#436)"
)

for (( i=0; i<${#commits[@]}; i++ ))
do
    echo ${commits[$i]}
    # A) My first attempt, using head to only get the first match.
    echo ${commits[$i]} | grep -P -m 1 -o '(?<=^|[^\w-])[A-Z]+-\d+' | head -n1

    # B) InSync's more sofisticated solution to match only the first
    # occurrence with the help of \K, which resets the starting point
    # of the reported match. This is a good way to consume characters
    # which we don't want in the output. It's also used because we can't
    # solve this with a positive lookbehind as the latter has to be a fixed
    # length pattern (not the case because of the ungreedy .*? pattern).
    # My positive lookbehind (?<=^|[^\w-]) can also be replaced by a
    # shorter negative lookbehind (?<![\w-])
    echo ${commits[$i]} | grep -P -m 1 -o '^.*?\K(?<![\w-])[A-Z]+-\d+'
done

编辑

谢谢你,@InSync,你聪明的解决方案没有使用head,而是使用 PCRE\K模式来重置报告匹配的起点。它用于以一种不贪婪的方式消耗第一个Jira ID之前的所有字符。我已经把它添加到B)点下的上面的批次中。
输出,A)和B):

ABCD-1231 dummy title XYZ-566 (#423)
ABCD-1231
ABCD-1231
[ABCD-1232, XYZ-566] dummy title (#424)
ABCD-1232
ABCD-1232
[ABCD-1233] dummy title (#425)
ABCD-1233
ABCD-1233
ABCD-1234: dummy title (#426)
ABCD-1234
ABCD-1234
XYZ-567 dummy title (#427)
XYZ-567
XYZ-567
(XYZ-568) dummy title (#428)
XYZ-568
XYZ-568
"XYZ-569" dummy title (#429)
XYZ-569
XYZ-569
dummy title XYZ-570 dummy title (#430)
XYZ-570
XYZ-570
DUMMY title XYZ-571 dummy title (#431)
XYZ-571
XYZ-571
'feature/XYZ-572' dummy title (#432)
XYZ-572
XYZ-572
FEATURE|XYZ-573 dummy title (#433)
XYZ-573
XYZ-573
<Feature\XYZ-574> dummy title (#434)
XYZ-574
XYZ-574
dummy title FAKE-XYZ-575 dummy title (#435)
dummy title abcdXYZ-576 dummy title (#436)

相关问题