在处理大型CSV时缩短处理时间

rnmwe5a2  于 2023-07-31  发布在  其他
关注(0)|答案(1)|浏览(108)

该脚本工作正常,并输出正是我需要它输出。
当我有一个大的CSV文件要处理时,我的问题就来了(大约500 Mb,大约600万行)。
脚本需要很长时间才能运行。我知道处理这么多数据需要一段时间,但我想知道是否有方法可以改进它!以下是浓缩的脚本:

$DnsFilePath = "C:\dns.log"

Param([string]$DnsFilePath)
If (Test-Path $DnsFilePath) 
    { 
        $FileInfo = Get-ChildItem -Path $DnsFilePath
        $Ans = Read-Host "Do you want to continue(y/n)?"
        
        If ($Ans -eq 'y')
            {
                If (!($SkipLines)) { Write-Host "Processing..."; }
                $i = 0; ## Set to count the number of records;
                $Timer= [Diagnostics.Stopwatch]::StartNew() ## Start the timer
                $ArrayOfStrings = [System.Collections.ArrayList]@()

                Switch -regex ([System.IO.File]::ReadLines($FileInfo.fullname)) {
                ' UDP Rcv ' {
                    $Datetime = [regex]::matches($switch.current,'\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} (AM|PM)').Value
                    $IP = [regex]::matches($switch.current,'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b').Value
                    $FQDN = [regex]::matches($switch.current,"\)[A-z0-9-_]*\(").Value  -replace "\)|\(","" -join "."
                    [void]$ArrayOfStrings.Add("$Datetime,$IP,$FQDN")
                    $i++;
                            }
                }
                        $OutFilePath = "$($FileInfo.DirectoryName)\$($FileInfo.BaseName)_Parsed.txt"
                        [System.IO.File]::WriteAllLines($OutFilePath, $ArrayOfStrings)
                        $Timer.stop()
                        Write-host "Total time elapsed: $($Timer.Elapsed.ToString('hh\:mm\:ss\.ff'))"
                        Write-Host "Number of Record Processed: $i"
                        Write-Host "Parsed File created successfully at $OutFilePath"                       
            }
        else
            { Write-Host "Script exits." }
    }
Else
    {
    Write-Host -fore Red "File does not exist in the following location: $DnsFilePath. Script exits."
    }

字符串
DNS日志示例:

DNS Server log file creation at 7/10/2023 10:55:42 AM
Log file wrap at 7/10/2023 10:55:42 AM

Message logging key (for packets - other items use a subset of these fields):
    Field #  Information         Values
    -------  -----------         ------
       1     Date
       2     Time
       3     Thread ID
       4     Context
       5     Internal packet identifier
       6     UDP/TCP indicator
       7     Send/Receive indicator
       8     Remote IP
       9     Xid (hex)
      10     Query/Response      R = Response
                                 blank = Query
      11     Opcode              Q = Standard Query
                                 N = Notify
                                 U = Update
                                 ? = Unknown
      12     [ Flags (hex)
      13     Flags (char codes)  A = Authoritative Answer
                                 T = Truncated Response
                                 D = Recursion Desired
                                 R = Recursion Available
      14     ResponseCode ]
      15     Question Type
      16     Question Name

7/10/2023 10:55:42 AM 1B7C PACKET  000001D9D88C68D0 UDP Rcv 8.8.8.8         5fb1 R Q [8381   DR NXDOMAIN] A      (3)www(12)autodiscover(5)st1ad(4)emea(15)microsoftonline(3)com(0)

7/10/2023 10:55:42 AM 1B7C PACKET  000001D9D775F890 UDP Snd 10.x.x.x     92cb R Q [8381   DR NXDOMAIN] A      (3)www(12)autodiscover(5)st1ad(4)emea(15)microsoftonline(3)com(0)

7/10/2023 10:55:42 AM 1B7C PACKET  000001D9E4E338D0 UDP Rcv 10.x.x.x  a9bd   Q [0001   D   NOERROR] A      (18)addinsinstallation(5)store(6)office(3)com(0)

7/10/2023 10:55:42 AM 1B7C PACKET  000001D9D775F890 UDP Snd 8.8.8.8         afda   Q [0001   D   NOERROR] A      (23)prod-addinsinstallation(15)omexexternallfb(6)office(3)net(6)akadns(3)net(0)

7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E182BB80 UDP Rcv 10.x.x.x  d229   Q [0001   D   NOERROR] SOA    (15)pc_host01(7)contoso(5)local(0)

7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E182BB80 UDP Snd 10.x.x.x  d229 R Q [8085 A DR  NOERROR] SOA    (15)pc_host02(7)contoso(5)local(0)

7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E2A2D670 UDP Rcv 8.8.8.8         c95c R Q [8081   DR  NOERROR] A      (9)dtr-a-ncu(2)na(8)azurerms(3)com(0)

7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E1998D80 UDP Snd 10.x.x.x     2047 R Q [8081   DR  NOERROR] A      (6)portal(8)azurerms(3)com(0)

7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E2D07D00 UDP Rcv 10.x.x.x   788e   Q [0001   D   NOERROR] A      (2)tr(11)c1182306347(12)ip4-58f0802d(4)wgcs(7)skyhigh(5)cloud(0)

7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E1998D80 UDP Snd 8.8.8.8         1c22   Q [0001   D   NOERROR] A      (2)tr(11)c1182306347(12)ip4-58f0802d(4)wgcs(7)skyhigh(5)cloud(0)

rqenqsqc

rqenqsqc1#

不要将所有结果收集到$ArrayOfStrings中,而是立即将结果直接写入文件!
为了避免每次都必须关闭并重新打开文件句柄,请重复使用同一个句柄:

$counter = 0
$OutFilePath = Join-Path $FileInfo.DirectoryName "$($FileInfo.BaseName)_Parsed.txt"

# create output file if it doesn't already exist
$OutFile = if (-not (Test-Path $OutFilePath -PathType Leaf)){
    New-Item -Path $OutFilePath -ItemType File
}
else {
    $OutFilePath |Get-Item
}

# create writable filestream object and wind the cursor to the end of the file (so we don't overwrite any existing data)
$fileStream = $OutFile.OpenWrite()
$fileStream.Seek(0, 'End') |Out-Null

# create a writer
$fileWriter = [System.IO.StreamWriter]::new($fileStream)

try {
    Switch -regex ([System.IO.File]::ReadLines($FileInfo.fullname)) {
        ' UDP Rcv ' {
            $Datetime = [regex]::matches($switch.current, '\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} (AM|PM)').Value
            $IP = [regex]::matches($switch.current, '\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b').Value
            $FQDN = [regex]::matches($switch.current, "\)[A-z0-9-_]*\(").Value -replace "\)|\(", "" -join "."
            # write straight to the file, no in-memory storage of the string
            $fileWriter.WriteLine("$Datetime,$IP,$FQDN")
            $counter++
        }
    }

    $fileWriter.Flush()
    $fileWriter.Close()
    $Timer.stop()
    Write-host "Total time elapsed: $($Timer.Elapsed.ToString('hh\:mm\:ss\.ff'))"
    Write-Host "Number of Record Processed: $counter"
    Write-Host "Parsed File created successfully at $OutFilePath"                       
}
finally {
    # clean up
    $fileStream, $fileWriter |ForEach-Object Dispose
}

字符串

相关问题