shell 提取捕获组(如果存在)，否则，只提取原始字符串

wh6knrhe 于 2023-02-05 发布在 Shell

关注(0)|答案(4)|浏览(256)

给定一个字符串，我想使用正则表达式：
1.如果给定的字符串与regex不匹配，则返回整个字符串
1.如果给定的字符串与regex匹配，则仅返回捕获组
假设我有下面的正则表达式：

hello\s*([a-z]+)

以下是我正在寻找的投入和回报：

"well hello" --> "well hello" (regex did not match)
"well hello world extra words" --> "world"
"well hello   world!!!" --> "world"
"well hello \n \n world\n\n\n" --> "world" (should ignore all newlines)
"this string doesn't match at all" --> "this string doesn't match at all"

局限性：我只限于使用grep、sed和awk。egrep、gawk不可用。

> print "world hello something else\n" | sed -rn "s/hello ([a-z]+)/\1/p"
world something else

这是我最接近的一次了，有几件事

它返回字符串的其他部分
我无法使\s*匹配，但常规空格可以工作
不完全确定，但是sed末尾的/p似乎打印了一个换行符

shell

来源：https://stackoverflow.com/questions/75342696/extract-capture-group-if-it-exists-otherwise-just-extract-the-original-string

4条答案

按热度按时间

5vf7fwbs1#

使用替代方法：

hello\s*([a-z]+)|(.*)

然后提取第1组和第2组：

sed -rn "s/hello ([a-z]+)|(.*)/\1\2/p"

交替从左到右匹配，因此如果前几个部分不匹配，则匹配整个输入;组1或组2之一将是空白的。

赞(0）回复(0）举报 2023-02-05

dced5bon2#

这可能对您有用（GNU sed）：

sed -E 's/\\n/\n/g;/^well hello\s*([a-z]+).*/s//\1/;s/\n/\\n/g' file

把\n变成真正的换行符。
匹配以well hello开头，后跟零个或多个空格，后跟一个或多个字符a到z，再后跟任意字符的行。如果匹配，则返回字符a到z，否则返回原始字符串。

赞(0）回复(0）举报 2023-02-05

mzaanser3#

仅解决问题 * why parts of the string that shoudln't be printing, are printing *...
示例：

printf "world hello something else\n" | sed -rn "s/hello ([a-z]+)/\1/p"

Actual output : world something else
Desired output:       something

在sed手册页中：

-n, --quiet, --silent
                suppress automatic printing of pattern space

在示例脚本中，'patternspace' 由hello ([a-z]+)定义，因此这是输入的一部分，-n将对其应用;注意，在这个 "模式空间" 中没有任何内容来寻址输入行中的任何前导/尾随字符，因此所述前导/尾随字符没有被 "抑制"（即，它们仍然显示在输出中），因此出现了不需要的world和else。
要将-n应用于整行，需要扩展 "pattern space" 以覆盖整行;考虑：

hello ([a-z]+)             # does not cover leading/trailing characters
.*hello ([a-z]+)             # covers leading characters; does not cover trailing characters
  hello ([a-z]+).*           # does not cover leading characters; covers trailing characters
.*hello ([a-z]+).*           # covers leading/trailing characters

更新脚本以覆盖所有前导/尾随字符（即整行输入）：

printf "world hello something else\n" | sed -rn "s/.*hello ([a-z]+).*/\1/p"
                                                   ^^              ^^
Actual output: something

赞(0）回复(0）举报 2023-02-05

irtuqstp4#

因为你有 * 字符串 * 和一个文件，考虑完全在Bash中完成：

#!/bin/bash

strings=( 'well hello' 
    'well hello world extra words' 
    'well hello   world!!!' 
    'well hello \n\n  world\n\n' 
    "this string doesn't match at all" )

re='hello[[:space:]][[:space:]]*([a-z][a-z]*)'

for x in "${strings[@]}"; do 
    s=$(printf "$x")               # force interpretation of \n
    if [[ $s =~ $re ]]; then 
        printf \""$x"\""=> \"%s\"\n" "${BASH_REMATCH[2]}"
    else
        printf "No match: \"%s\"\n" "$s"
    fi  
done

图纸：

No match: "well hello"
"well hello world extra words"=> "world"
"well hello   world!!!"=> "world"
"well hello 

  world

"=> "world"
No match: "this string doesn't match at all"

(Note：可以在Bash/zsh中使用word boundary assertion，具体取决于平台。'hello'是这样的，因为regex只匹配完整的单词'hello'，而不是匹配'phellogen'或'Othello'。独立于平台的单词边界版本是re='(^|[^[:alnum:]_])hello[[:space:]][[:space:]]*([a-z][a-z]*)'，捕获的单词在"${BASH_REMATCH[2]}"中）
您也可以使用perl：

for s in "${strings[@]}"; do 
    perl -0777 -nE '/\bhello\s+([a-z]+)/;say $1 ? "\"$_\" => \"$1\"" : "No match: \"$_\""' <<<$(printf "$s")
done

图纸：

No match: "well hello"
"well hello world extra words" => "world"
"well hello   world!!!" => "world"
"well hello 

  world

" => "world"
No match: "this string doesn't match at all"

或者您可以使用GNU grep：

for s in "${strings[@]}"; do 
    r=$(ggrep -zoP '\bhello\s+\K([a-z]+)' <<<$(printf "$s") | tr -d '\0' )
    [[ -z "$r" ]] && printf "No match: \"$s\"\n" || printf "\"$s\" => \"$r\"\n"
done

或任何awk：

for s in "${strings[@]}"; do 
    awk '{s = s $0 ORS}
    END{
    sub(ORS "$", "", s)
    split(s,fields,"[^[:alpha:]]+")
    for(i=1;i<length(fields);i++){
        if(fields[i]=="hello" && fields[i+1]~/[a-z]+/) {
            printf "\"%s\" => %s\n", s, fields[i+1]
            found=1
            break
        }
    }
    if (!found) printf "Not Found: \"%s\"\n", s
    }' <<<$(printf "$s")
done

赞(0）回复(0）举报 2023-02-05

我来回答

shell 提取捕获组(如果存在)，否则，只提取原始字符串

4条答案

相关问题

热门标签

最新问答