regex powershell提取两个字符串之间文本

inn6fuwd 于 2022-11-18 发布在 Shell

关注(0)|答案(5)|浏览(143)

我知道这个问题以前有人问过，但是我没有找到任何答案。我有一个JSON文件，它有几千行，我想在每次出现两个字符串之间的文本时提取它们（这是一个很大的问题）。
作为一个简单的例子，我的JSON看起来像这样：

"customfield_11300": null,
    "customfield_11301": [
      {
        "self": "xxxxxxxx",
        "value": "xxxxxxxxx",
        "id": "10467"
      }
    ],
    "customfield_10730": null,
    "customfield_11302": null,
    "customfield_10720": 0.0,
    "customfield_11300": null,
    "customfield_11301": [
      {
        "self": "zzzzzzzzzzzzz",
        "value": "zzzzzzzzzzz",
        "id": "10467"
      }
    ],
    "customfield_10730": null,
    "customfield_11302": null,
    "customfield_10720": 0.0,

因此，我希望输出“customfield_11301”和“customfield_10730”之间的所有内容：

{
        "self": "xxxxxxxx",
        "value": "xxxxxxxxx",
        "id": "10467"
      }
    ],
      {
        "self": "zzzzzzzzzzzzz",
        "value": "zzzzzzzzzzz",
        "id": "10467"
      }
    ],

我试图使它尽可能简单-所以不要在意输出中显示的括号。
这是我所拥有的（它的输出远远超过我想要的）：

$importPath = "todays_changes.txt"
$pattern = "customfield_11301(.*)customfield_10730"

$string = Get-Content $importPath
$result = [regex]::match($string, $pattern).Groups[1].Value
$result

regex

来源：https://stackoverflow.com/questions/36746272/powershell-extract-text-between-two-strings

5条答案

按热度按时间

q7solyqu1#

快速的答案是-将贪婪的捕获(.*)改为非贪婪的-(.*?)。这样就可以了。

customfield_11301(.*?)customfield_10730

否则，捕获将尽可能多地吃，导致它继续'直到最后customfield_10730。
此致

赞(0）回复(0）举报 2022-11-18

wf82jlnq2#

下面是一个PowerShell函数，它将在两个字符串之间查找一个字符串。

function GetStringBetweenTwoStrings($firstString, $secondString, $importPath){

    #Get content from file
    $file = Get-Content $importPath

    #Regex pattern to compare two strings
    $pattern = "$firstString(.*?)$secondString"

    #Perform the opperation
    $result = [regex]::Match($file,$pattern).Groups[1].Value

    #Return result
    return $result

}

然后，您可以像这样运行函数：

GetStringBetweenTwoStrings -firstString "Lorem" -secondString "is" -importPath "C:\Temp\test.txt"

我的test.txt文件中包含以下文本：
存有是印刷和排版行业的简单虚拟文本。
所以我的结果是：
益普苏姆

赞(0）回复(0）举报 2022-11-18

z18hc3ub3#

您需要将RegEx * 设置为Lazy*：

customfield_11301(.*?)customfield_10730

Live Demo on Regex101
你的正则表达式是 Greedy。这意味着它会找到customfield_11301，然后进位，直到找到最后一个customfield_10730。
下面是Greedy与Lazy正则表达式的一个简单示例：

# Regex (Greedy): [(.*)]
# Input:          [foo]and[bar]
# Output:         foo]and[bar

# Regex (Lazy):   [(.*?)]
# Input:          [foo]and[bar]
# Output:         "foo" and "bar" separately

您的Regex与第一个非常相似，它捕获的数据太多，而这个新的Regex捕获的数据量尽可能少，因此将按您的预期工作

赞(0）回复(0）举报 2022-11-18

vmdwslir4#

第一个问题是Get-Content管道会一行一行地给予你，而不是一次提供整个内容。你可以用Out-String管道Get-Content来获得整个内容作为一个字符串，并对内容执行正则表达式。
您的问题的有效解决方案是：
Get-Content .\todays_changes.txt | Out-String | % {[Regex]::Matches($_, "(?<=customfield_11301)((.|\n)*?)(?=customfield_10730)")} | % {$_.Value}
输出结果为：

": [
  {
    "self": "xxxxxxxx",
    "value": "xxxxxxxxx",
    "id": "10467"
  }
],
"

": [
  {
    "self": "zzzzzzzzzzzzz",
    "value": "zzzzzzzzzzz",
    "id": "10467"
  }
],
"

赞(0）回复(0）举报 2022-11-18

6mzjoqzu5#

顺便说一句：由于您的输入看起来是JSON，通常最好使用ConvertFrom-Json将其解析为对象图，这样可以方便地查询;但是，您的JSON似乎 * 不标准 *，因为它包含 * 重复的属性名称 *。
现有答案中包含了很好的信息，但让我试着用一个答案来涵盖所有方面：

tl;dr

# * .Matches() (plural) is used to get *all* matches
# * Get-Content -Raw reads the file *as a wole*, into a single, multiline string
# * Inline regex option (?s) makes "." match newlines too, to match *across lines*
# * (.*?) rather than (.*) makes the matching *non-greedy*.
# * Look-around assertions - (?<=...) and (?=...) - to avoid the need for capture groups.
[regex]::Matches(
  (Get-Content -Raw todays_changes.txt),
  '(?s)(?<="customfield_11301":).*?(?="customfield_10730")'
).Value

使用示例输入的输出：

[
      {
        "self": "xxxxxxxx",
        "value": "xxxxxxxxx",
        "id": "10467"
      }
    ],
    
 [
      {
        "self": "zzzzzzzzzzzzz",
        "value": "zzzzzzzzzzz",
        "id": "10467"
      }
    ],

有关正则表达式的说明以及使用它进行实验的能力，请参见this regex101.com page
至于你试过什么：
$pattern = "customfield_11301(.*)customfield_10730"
如前所述，该正则表达式的主要问题是(.*)是 * 贪婪的 *，并且将保持匹配直到找到customfield_10730的 * 最后一次 * 出现;使其 * 非贪婪 * -(.*?)解决了该问题。
另外，这个正则表达式 * 不 * 匹配 * 多行 *，因为默认情况下.不匹配 * 换行符 *（\n）。最简单的方法是将内联正则表达式选项(?s)放在模式的开头，如上所示。
这只是一个 * 幸运的意外 *，仍然导致您的尝试跨行匹配，如下所述：
$string = Get-Content $importPath
它在$string中存储一个字符串 array，每个元素代表输入文件中的一行 *。
要将文件的内容 * 作为一个整体 * 读入 * 单个多行字符串 *，请使用Get-Content的-Raw开关：$string = Get-Content -Raw $importPath
$result = [regex]::match($string, $pattern).Groups[1].Value
由于您的$string变量包含一个 * 字符串数组 *，PowerShell * 在将其传递给[regex]::Match()方法的input类型参数[string]时 * 隐式地将其字符串化，这实际上创建了一个 * 单行 * 表示，因为数组元素是 * 用空格 * 连接的（默认情况下;您可以使用$OFS指定不同分隔符，但实际上很少这样做）。
例如，以下两个调用是--令人惊讶地--等价的：

[regex]::Match('one two'), 'e t').Value # -> 'e t'

# !! Ditto, because array @('one', 'two') stringifies to 'one two'
[regex]::Match(@('one', 'two'), 'e t').Value # -> 'e t'

赞(0）回复(0）举报 2022-11-18

我来回答

regex powershell提取两个字符串之间文本

5条答案

相关问题

热门标签

最新问答