shell 提取捕获组(如果存在),否则,只提取原始字符串

wh6knrhe  于 2023-02-05  发布在  Shell
关注(0)|答案(4)|浏览(268)

给定一个字符串,我想使用正则表达式:
1.如果给定的字符串与regex不匹配,则返回整个字符串
1.如果给定的字符串与regex匹配,则仅返回捕获组
假设我有下面的正则表达式:

hello\s*([a-z]+)

以下是我正在寻找的投入和回报:

"well hello" --> "well hello" (regex did not match)
"well hello world extra words" --> "world"
"well hello   world!!!" --> "world"
"well hello \n \n world\n\n\n" --> "world" (should ignore all newlines)
"this string doesn't match at all" --> "this string doesn't match at all"

局限性:我只限于使用grep、sed和awk。egrep、gawk不可用。

> print "world hello something else\n" | sed -rn "s/hello ([a-z]+)/\1/p"
world something else

这是我最接近的一次了,有几件事

  • 它返回字符串的其他部分
  • 我无法使\s*匹配,但常规空格可以工作
  • 不完全确定,但是sed末尾的/p似乎打印了一个换行符
5vf7fwbs

5vf7fwbs1#

使用替代方法:

hello\s*([a-z]+)|(.*)

然后提取第1组和第2组:

sed -rn "s/hello ([a-z]+)|(.*)/\1\2/p"

交替从左到右匹配,因此如果前几个部分不匹配,则匹配整个输入;组1或组2之一将是空白的。

dced5bon

dced5bon2#

这可能对您有用(GNU sed):

sed -E 's/\\n/\n/g;/^well hello\s*([a-z]+).*/s//\1/;s/\n/\\n/g' file

\n变成真正的换行符。
匹配以well hello开头,后跟零个或多个空格,后跟一个或多个字符az,再后跟任意字符的行。如果匹配,则返回字符az,否则返回原始字符串。

mzaanser

mzaanser3#

仅解决问题 * why parts of the string that shoudln't be printing, are printing *...
示例:

printf "world hello something else\n" | sed -rn "s/hello ([a-z]+)/\1/p"

Actual output : world something else
Desired output:       something

sed手册页中:

-n, --quiet, --silent
                suppress automatic printing of pattern space

在示例脚本中,'patternspace'hello ([a-z]+)定义,因此这是输入的一部分,-n将对其应用;注意,在这个 "模式空间" 中没有任何内容来寻址输入行中的任何前导/尾随字符,因此所述前导/尾随字符没有被 "抑制"(即,它们仍然显示在输出中),因此出现了不需要的worldelse
要将-n应用于整行,需要扩展 "pattern space" 以覆盖整行;考虑:

hello ([a-z]+)             # does not cover leading/trailing characters
.*hello ([a-z]+)             # covers leading characters; does not cover trailing characters
  hello ([a-z]+).*           # does not cover leading characters; covers trailing characters
.*hello ([a-z]+).*           # covers leading/trailing characters

更新脚本以覆盖所有前导/尾随字符(即整行输入):

printf "world hello something else\n" | sed -rn "s/.*hello ([a-z]+).*/\1/p"
                                                   ^^              ^^
Actual output: something
irtuqstp

irtuqstp4#

因为你有 * 字符串 * 和一个文件,考虑完全在Bash中完成:

#!/bin/bash

strings=( 'well hello' 
    'well hello world extra words' 
    'well hello   world!!!' 
    'well hello \n\n  world\n\n' 
    "this string doesn't match at all" )

re='hello[[:space:]][[:space:]]*([a-z][a-z]*)'

for x in "${strings[@]}"; do 
    s=$(printf "$x")               # force interpretation of \n
    if [[ $s =~ $re ]]; then 
        printf \""$x"\""=> \"%s\"\n" "${BASH_REMATCH[2]}"
    else
        printf "No match: \"%s\"\n" "$s"
    fi  
done

图纸:

No match: "well hello"
"well hello world extra words"=> "world"
"well hello   world!!!"=> "world"
"well hello 

  world

"=> "world"
No match: "this string doesn't match at all"

(Note:可以在Bash/zsh中使用word boundary assertion,具体取决于平台。'hello'是这样的,因为regex只匹配完整的单词'hello',而不是匹配'phellogen''Othello'。独立于平台的单词边界版本是re='(^|[^[:alnum:]_])hello[[:space:]][[:space:]]*([a-z][a-z]*)',捕获的单词在"${BASH_REMATCH[2]}"中)
您也可以使用perl

for s in "${strings[@]}"; do 
    perl -0777 -nE '/\bhello\s+([a-z]+)/;say $1 ? "\"$_\" => \"$1\"" : "No match: \"$_\""' <<<$(printf "$s")
done

图纸:

No match: "well hello"
"well hello world extra words" => "world"
"well hello   world!!!" => "world"
"well hello 

  world

" => "world"
No match: "this string doesn't match at all"

或者您可以使用GNU grep:

for s in "${strings[@]}"; do 
    r=$(ggrep -zoP '\bhello\s+\K([a-z]+)' <<<$(printf "$s") | tr -d '\0' )
    [[ -z "$r" ]] && printf "No match: \"$s\"\n" || printf "\"$s\" => \"$r\"\n"
done

或任何awk:

for s in "${strings[@]}"; do 
    awk '{s = s $0 ORS}
    END{
    sub(ORS "$", "", s)
    split(s,fields,"[^[:alpha:]]+")
    for(i=1;i<length(fields);i++){
        if(fields[i]=="hello" && fields[i+1]~/[a-z]+/) {
            printf "\"%s\" => %s\n", s, fields[i+1]
            found=1
            break
        }
    }
    if (!found) printf "Not Found: \"%s\"\n", s
    }' <<<$(printf "$s")
done

相关问题