powershell 从多个网页中提取URL

h6my8fg2  于 12个月前  发布在  Shell
关注(0)|答案(1)|浏览(188)

我想从多个域中提取URL并将唯一的输出值保存到一个txt文件中。URL有不同的格式,有些有http,https,127.0.0.1。我只想获取URL并删除前缀,特别是“127.0.0.1”我尝试了以下ps脚本,但它没有给予我任何结果。任何帮助请修复这个问题。

`$threatFeedUrls =@("https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate%20versions%20Anti-Malware%20List/AntiMalwareHosts.txt",
                     "https://osint.digitalside.it/Threat- 
Intel/lists/latestdomains.txt")

#Initialize an array to store all extracted URLs
$allUrls = @()

#Loop through the lists of URLs
foreach ($url in $threatFeedUrls) {

# Download the threat feed data
$threatFeedData = Invoke-RestMethod -Uri $threatFeedUrl

# Define a regular expression pattern to match URLs starting with '127.0.0.1'
$pattern = '127\.0\.0\.1 ([^\s]+)'

# Use the regular expression to find matches in the threat feed data
$matchList = [regex]::Matches($threatFeedData, $pattern)

# Create and populate the list with matched URLs
$urlList = 
foreach ($match in $matchList) {
   $match.Groups[1].Value
      }

# Specify the output file path
$outputFilePath = 'output250.txt'
  
# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath

Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to        $outputFilePath."
}`

字符串
我写了PS脚本提取所有的网址。但输出不是我所期望的。我想从列出的域中提取所有网址,删除重复项并保存在一个txt文件中

kuuvgm7e

kuuvgm7e1#

你可以试试这个:

# Define the URLs to get
$threatFeedUrls = @(
    "https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate%20versions%20Anti-Malware%20List/AntiMalwareHosts.txt",
    "https://osint.digitalside.it/Threat-Intel/lists/latestdomains.txt"
)
# Get all the raw files
$Result = $threatFeedUrls | foreach {Irm -Uri $_ -UseBasicParsing} 

# Filter out comments and empty lines
$OnlyInterestingLines = $Result -split "`n" | where {$_ -notmatch "^(#|\s|$)" }

# Remove 127.0.0.1 at the beginning of lines followed by any amount of whitespace, remove anything on the left side of a ":" and remove up to two "/" when it matches, sort it and return only unique addresses
$urlList = $OnlyInterestingLines -replace "^127\.0\.0\.1\s*" -replace "^.*:/?/?" | Sort-Object -Unique

# Specify the output file path
$outputFilePath = 'output250.txt'

# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath

字符串
结果是21.780行主机名
编辑:也许你也想删除任何类型的协议。所以现在我也删除它们。它将删除像https:youaddress.comhttp:youaddress.comhttps://youaddress.comhttp:youaddress.comftp:youaddress.comsocks:youaddress.comyourtest:/youaddress.com这样的协议。所有这些例子都将返回youaddress.com

相关问题