是否可以使用sed可靠地转义regex元字符

3mpgtkmj 于 2022-12-24 发布在其他

关注(0)|答案(4)|浏览(197)

我想知道是否有可能编写一个100%可靠的sed命令来转义输入字符串中的任何正则表达式元字符，以便它可以在后续的sed命令中使用。

#!/bin/bash
# Trying to replace one regex by another in an input file with sed

search="/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3"
replace="/xyz\n\t[0-9]\+\([^ ]\)\{2,3\}\3"

# Sanitize input
search=$(sed 'script to escape' <<< "$search")
replace=$(sed 'script to escape' <<< "$replace")

# Use it in a sed command
sed "s/$search/$replace/" input

我知道有更好的工具可以处理固定字符串而不是模式，例如awk、perl或python。我只想证明sed是否可行。我想说，让我们专注于基本的POSIX正则表达式，以获得更多乐趣！：）
我已经尝试了很多东西，但任何时候我都可以找到一个输入，打破了我的尝试。我认为保持它抽象为script to escape不会导致任何人进入错误的方向。
顺便说一句，讨论到了这里。我想这可能是一个收集解决方案的好地方，可能会打破和/或阐述它们。

regex

来源：https://stackoverflow.com/questions/29613304/is-it-possible-to-escape-regex-metacharacters-reliably-with-sed

4条答案

按热度按时间

kqlmhetl1#

注：

如果您正在寻找基于本答案中讨论的技术的预打包功能：
- - bash函数能够实现健壮的转义**，甚至在*多行 * 替换中也是如此，可以在本文的底部找到**（另外还有一个perl解决方案，它使用perl内置的转义支持）。
@EdMorton的答案包含一个工具（bash脚本），可以健壮地执行*单行 * 替换。
Ed的答案现在有了下面使用的sed命令的 * 改进 * 版本，并在calestyo's answer中进行了更正，如果您希望转义字符串常量，以便与 * 其他 * 正则表达式处理工具（如awk和perl）一起使用，则需要使用该命令。* * 对于跨工具使用，\必须转义为\\而不是[\]，这意味着：而不是

sed 's/[^^]/[&]/g; s/\^/\\^/g'命令，则必须使用
x1米11米1x

下面的所有代码片段都假设bash为shell（可以进行符合POSIX的重构）：

单线解决方案

转义字符串文字以用作`sed`中的 * regex *：

表扬：给予应得的表扬：我在this answer中找到了下面使用的正则表达式。
假设搜索字符串是 * single * 行字符串：

search='abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3'  # sample input containing metachars.

searchEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$search") # escape it.

sed -n "s/$searchEscaped/foo/p" <<<"$search" # Echoes 'foo'

除^之外的每个字符都放置在其自己的字符集[...]表达式中，以将其视为文本。
注意，^是一个字符，您 * 不能 * 表示为[^]，因为它在该位置具有特殊含义（求反）。
然后，将^字符转义为\^。
请注意，您不能通过在每个字符前面放置一个\来转义它，因为这可能会将一个文字字符转换为元字符，例如，\<和\b在某些工具中是单词边界，\n是换行符，\{是RE间隔（如\{1,3\}）的开始，等等。

该方法是稳健的，但效率不高。

- 健壮性来自于 * 不 * 尝试预测所有特殊的正则表达式字符**-这在正则表达式方言中会有所不同-而是只关注 * 所有正则表达式方言共有的两个特性*：
在一个字符集中指定文字字符的能力。
将文字^转义为\^功能

转义字符串文字以用作`sed`的`s///`命令中的 * 替换字符串 *：

seds///命令中的替换字符串不是正则表达式，但它可以识别 * 占位符 *，这些占位符表示与正则表达式匹配的整个字符串（&）或按索引列出的特定捕获组结果（\1、\2、...），因此必须将它们与（常用的）正则表达式分隔符/一起转义。
假设替换字符串是 * single * 行字符串：

replace='Laurel & Hardy; PS\2' # sample input containing metachars.

replaceEscaped=$(sed 's/[&/\]/\\&/g' <<<"$replace") # escape it

sed -n "s/.*/$replaceEscaped/p" <<<"foo" # Echoes $replace as-is

多线解决方案

转义MULTI-LINE字符串文字以用作`sed`中的 * regex *：

- 注意**：只有在尝试匹配之前读取了 * 多个输入行 *（可能是全部）时，这才有意义。

由于sed和awk之类的工具在默认情况下一次只能操作 * 一行 *，因此需要额外的步骤才能使它们一次读取多行。

# Define sample multi-line literal.
search='/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3
/def\n\t[A-Z]\+\([^ ]\)\{3,4\}\4'

# Escape it.
searchEscaped=$(sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$search" | tr -d '\n')           #'

# Use in a Sed command that reads ALL input lines up front.
# If ok, echoes 'foo'
sed -n -e ':a' -e '$!{N;ba' -e '}' -e "s/$searchEscaped/foo/p" <<<"$search"

多行输入字符串中的换行符必须转换为'\n' * strings *，这是正则表达式中换行符的编码方式。
$!a\'$'\n''\\n'将 * string * '\n'附加到除最后一行之外的所有输出行（最后一个换行符被忽略，因为它是由<<<添加的）
然后tr -d '\n从字符串中删除所有的 * actual * 换行符（sed在打印其模式空间时加一），有效地用'\n'字符串替换输入中的所有换行符。
-e ':a' -e '$!{N;ba' -e '}'是sed习惯用法的POSIX兼容形式，该习惯用法在一个循环中读取 * 所有 * 输入行，因此让后续命令同时在所有输入行上操作。
如果您使用 * GNU * sed（仅限），您可以使用其-z选项来简化一次读取所有输入行的过程：

x1米50英寸

转义MULTI-LINE字符串文字以用作`sed`的`s///`命令中的 * 替换字符串 *：

# Define sample multi-line literal.
replace='Laurel & Hardy; PS\2
Masters\1 & Johnson\2'

# Escape it for use as a Sed replacement string.
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$replace")
replaceEscaped=${REPLY%$'\n'}

# If ok, outputs $replace as is.
sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar"

输入字符串中的换行符必须保留为实际的换行符，但\-转义。
-e ':a' -e '$!{N;ba' -e '}'是sed习惯用法的POSIX兼容形式，它在一个循环中读取 * 所有 * 输入行。
's/[&/\]/\\&/g转义所有&、\和/示例，与单行解决方案中一样。
s/\n/\\&/g'，然后\-前缀所有实际换行符。
IFS= read -d '' -r用于 * 按原样 * 读取sed命令的输出（以避免自动删除命令替换（$(...)）将执行的尾随换行符）。
然后${REPLY%$'\n'}删除一个 * single * 尾随换行符，<<<已将其隐式附加到输入中。

* `bash`函数 * 基于以上内容（对于`sed`）：

quoteRe()引号（转义）用于 * regex *
s///引号，用于s///调用的 * 替换字符串 *。
两者都能正确处理 * 多行 * 输入
请注意，由于sed在默认情况下一次只读取 * 一行 *，因此将quoteRe()与多行字符串一起使用仅在一次显式读取多行（或所有）的sed命令中有意义。
此外，使用命令替换（$(...)）调用函数对于具有 * trailing * 换行符的字符串不起作用;在这种情况下，请使用类似IFS= read -d '' -r escapedValue <(quoteSubst "$value")代码

# SYNOPSIS
#   quoteRe <text>
quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }

# SYNOPSIS
#  quoteSubst <text>
quoteSubst() {
  IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
  printf %s "${REPLY%$'\n'}"
}

- 示例：**

from=$'Cost\(*):\n$3.' # sample input containing metachars. 
to='You & I'$'\n''eating A\1 sauce.' # sample replacement string with metachars.

# Should print the unmodified value of $to
sed -e ':a' -e '$!{N;ba' -e '}' -e "s/$(quoteRe "$from")/$(quoteSubst "$to")/" <<<"$from"

注意，使用-e ':a' -e '$!{N;ba' -e '}'一次读取所有输入，以便多行替换工作。

`perl`溶液：

- Perl有内置的支持来转义正则表达式中的任意字符串：引用的**quotemeta() function或其等效\Q...\E。

对于单线和多线串，方法是相同的;例如：

from=$'Cost\(*):\n$3.' # sample input containing metachars.
to='You owe me $1/$& for'$'\n''eating A\1 sauce.' # sample replacement string w/ metachars.

# Should print the unmodified value of $to.
# Note that the replacement value needs NO escaping.
perl -s -0777 -pe 's/\Q$from\E/$to/' -- -from="$from" -to="$to" <<<"$from"

注意，使用-0777一次读取所有输入，以便多行替换工作。
-s选项允许将-<var>=<val>样式的Perl变量定义放在--之后，脚本之后，任何文件名操作数之前。

赞(0）回复(0）举报 2022-12-24

qhhrdooz2#

基于@mklement0在本线程中的回答，以下工具将使用sed和bash将任何单行字符串（与regexp相反）替换为任何其他单行字符串：

$ cat sedstr
#!/bin/bash
old="$1"
new="$2"
file="${3:--}"
escOld=$(sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g' <<< "$old")
escNew=$(sed 's/[&/\]/\\&/g' <<< "$new")
sed "s/$escOld/$escNew/g" "$file"

为了说明此工具的必要性，考虑尝试通过直接调用sed将a.*/b{2,}\nc替换为d&e\1f：

$ cat file
a.*/b{2,}\nc
axx/bb\nc

$ sed 's/a.*/b{2,}\nc/d&e\1f/' file  
sed: -e expression #1, char 16: unknown option to `s'
$ sed 's/a.*\/b{2,}\nc/d&e\1f/' file
sed: -e expression #1, char 23: invalid reference \1 on `s' command's RHS
$ sed 's/a.*\/b{2,}\nc/d&e\\1f/' file
a.*/b{2,}\nc
axx/bb\nc
# .... and so on, peeling the onion ad nauseum until:
$ sed 's/a\.\*\/b{2,}\\nc/d\&e\\1f/' file
d&e\1f
axx/bb\nc

或使用上述工具：

$ sedstr 'a.*/b{2,}\nc' 'd&e\1f' file  
d&e\1f
axx/bb\nc

这是有用的原因是，如果需要，它可以很容易地扩展到使用单词分隔符来替换单词，例如在GNU sed语法中：

sed "s/\<$escOld\>/$escNew/g" "$file"

而实际操作字符串的工具（例如X1 M6 N1 X的X1 M7 N1 X）不能使用单词分隔符。
注意：不将\ Package 在括号表达式中的原因是，如果您使用的工具接受[\]]作为括号表达式中的文字]（例如perl和大多数awk实现）来执行实际的最终替换（即代替sed "s/$escOld/$escNew/g"），那么您不能使用以下方法：

sed 's/[^^]/[&]/g; s/\^/\\^/g'

通过将\包含在[]中来转义\，因为\x将变为[\][x]，即\ or ] or [ or x。

sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'

因此，虽然[\]可能适用于当前所有的sed实现，但我们知道\\将适用于所有的sed、awk、perl等实现，因此使用这种形式的转义。

赞(0）回复(0）举报 2022-12-24

z0qdvdin3#

需要注意的是，在this和that one中，上述一些答案中使用的正则表达式：

's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'

似乎是错的

先执行s/\^/\\^/g，然后执行s/\\/\\\\/g是错误的，因为任何先转义到\^的^将再次转义其\。

更好的办法似乎是：'s/[^\^]/[&]/g; s/[\^]/\\&/g;'.

带sed（BRE/ERE）的[^^\\]应仅为[^\^]（或[^^\]）。\在方括号表达式中没有特殊含义，无需加引号。

赞(0）回复(0）举报 2022-12-24

muk1a3rh4#

Bash参数扩展可用于转义用作Sed替换字符串的字符串：

# Define a sample multi-line literal. Includes a trailing newline to test corner case
replace='a&b;c\1
d/e
'

# Escape it for use as a Sed replacement string.
: "${replace//\\/\\\\}"
: "${_//&/\\\&}"
: "${_//\//\\\/}"
: "${_//$'\n'/\\$'\n'}"
replaceEscaped=$_

# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''

在bash 5.2+中，它可以进一步简化：

# Define a sample multi-line literal. Includes a trailing newline to test corner case
replace='a&b;c\1
d/e
'

# Escape it for use as a Sed replacement string.
shopt -s extglob
shopt -s patsub_replacement # An & in the replacement will expand to what matched. bash 5.2+
: "${replace//@(&|\\|\/|$'\n')/\\&}"
replaceEscaped=$_

# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''

将其封装在bash函数中：

##
# escape_replacement -v var replacement
#
# Escape special characters in _replacement_ so that it can be
# used as the replacement part in a sed substitute command.
# Store the result in _var_.
escape_replacement() {
  if ! [[ $# = 3 && $1 = '-v' ]]; then
    echo "escape_replacement: invalid usage" >&2
    echo "escape_replacement: usage: escape_replacement -v var replacement" >&2
    return 1
  fi
  local -n var=$2 # nameref (requires Bash 4.3+)
  # We use the : command (true builtin) as a dummy command as we  
  # trigger a sequence of parameter expansions
  # We exploit that the $_ variable (last argument to the previous command
  # after expansion) contains the result of the previous parameter expansion
  : "${3//\\/\\\\}" # Backslash-escape any existing backslashes
  : "${_//&/\\\&}"  # Backslash-escape &
  : "${_//\//\\\/}" # Backslash-escape the delimiter (we assume /)
  : "${_//$'\n'/\\$'\n'}" # Backslash-escape newline
  var=$_ # Assign to the nameref
  # To support Bash older than 4.3, the following can be used instead of nameref
  #eval "$2=\$_" # Use eval instead of nameref https://mywiki.wooledge.org/BashFAQ/006
}

# Test the function
# =================

# Define a sample multi-line literal. Include a trailing newline to test corner case
replace='a&b;c\1
d/e
'

escape_replacement -v replaceEscaped "$replace"

# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''

赞(0）回复(0）举报 2022-12-24

我来回答

是否可以使用sed可靠地转义regex元字符

4条答案

单线解决方案

转义字符串文字以用作`sed`中的 * regex *：

转义字符串文字以用作`sed`的`s///`命令中的 * 替换字符串 *：

多线解决方案

转义MULTI-LINE字符串文字以用作`sed`中的 * regex *：

转义MULTI-LINE字符串文字以用作`sed`的`s///`命令中的 * 替换字符串 *：

* `bash`函数 * 基于以上内容（对于`sed`）：

`perl`溶液：

相关问题

热门标签

最新问答

是否可以使用sed可靠地转义regex元字符

4条答案

单线解决方案

转义字符串文字以用作sed中的 * regex *：

转义字符串文字以用作sed的s///命令中的 * 替换字符串 *：

多线解决方案

转义MULTI-LINE字符串文字以用作sed中的 * regex *：

转义MULTI-LINE字符串文字以用作sed的s///命令中的 * 替换字符串 *：

* bash函数 * 基于以上内容（对于sed）：

perl溶液：

相关问题

热门标签

最新问答

转义字符串文字以用作`sed`中的 * regex *：

转义字符串文字以用作`sed`的`s///`命令中的 * 替换字符串 *：

转义MULTI-LINE字符串文字以用作`sed`中的 * regex *：

转义MULTI-LINE字符串文字以用作`sed`的`s///`命令中的 * 替换字符串 *：

* `bash`函数 * 基于以上内容（对于`sed`）：

`perl`溶液：