提高处理大型CSV文件时的处理时间

huangapple go评论168阅读模式
英文:

Improve processing time when dealing with large CSV

问题

脚本运行正常,并输出我需要的内容。

当我有一个大型CSV文件要处理时(约500MB,约6M行),出现问题。

脚本运行时间很长。我明白处理这么多数据会花费一些时间,但我想知道是否有改进的方法!以下是精简的脚本部分:

$DnsFilePath = "C:\dns.log"

Param([string]$DnsFilePath)
If (Test-Path $DnsFilePath) 
{ 
    $FileInfo = Get-ChildItem -Path $DnsFilePath
    $Ans = Read-Host "Do you want to continue(y/n)?"
    
    If ($Ans -eq 'y')
    {
        If (!($SkipLines)) { Write-Host "Processing..."; }
        $i = 0; ## 设置用于计算记录数量的变量;
        $Timer= [Diagnostics.Stopwatch]::StartNew()	## 启动计时器
        $ArrayOfStrings = [System.Collections.ArrayList]@()

        Switch -regex ([System.IO.File]::ReadLines($FileInfo.fullname)) {
        ' UDP Rcv '	{
            $Datetime = [regex]::matches($switch.current,'\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} (AM|PM)').Value
            $IP = [regex]::matches($switch.current,'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b').Value
            $FQDN = [regex]::matches($switch.current,"\)[A-z0-9-_]*\(").Value  -replace "\)|\(","" -join "."
            [void]$ArrayOfStrings.Add("$Datetime,$IP,$FQDN")
            $i++;
                    }
        }
        $OutFilePath = "$($FileInfo.DirectoryName)\$($FileInfo.BaseName)_Parsed.txt"
        [System.IO.File]::WriteAllLines($OutFilePath, $ArrayOfStrings)
        $Timer.stop()
        Write-host "Total time elapsed: $($Timer.Elapsed.ToString('hh:mm:ss.ff'))"
        Write-Host "Number of Record Processed: $i"
        Write-Host "Parsed File created successfully at $OutFilePath"						
    }
    else
    { Write-Host "Script exits." }
}
Else
{
Write-Host -fore Red "File does not exist in the following location: $DnsFilePath. Script exits."
}

示例DNS日志:

DNS服务器日志文件创建于2023年7月10日上午10:55:42
日志文件在2023年7月10日上午10:55:42时包装
消息记录键(用于数据包 - 其他项使用这些字段的子集):
字段#  信息             值
-------  -----------         ------
1     日期
2     时间
3     线程ID
4     上下文
5     内部数据包标识符
6     UDP/TCP指示器
7     发送/接收指示器
8     远程IP
9     Xid(十六进制)
10     查询/响应      R = 响应
空白 = 查询
11     操作码          Q = 标准查询
N = 通知
U = 更新
? = 未知
12     [标志(十六进制)
13     标志(字符代码)  A = 授权回答
T = 截断响应
D = 期望递归
R = 递归可用
14     响应码]
15     问题类型
16     问题名称
2023年7月10日上午10:55:42 1B7C PACKET  000001D9D88C68D0 UDP Rcv 8.8.8.8         5fb1 R Q [8381   DR NXDOMAIN] A      (3)www(12)autodiscover(5)st1ad(4)emea(15)microsoftonline(3)com(0)
2023年7月10日上午10:55:42 1B7C PACKET  000001D9D775F890 UDP Snd 10.x.x.x     92cb R Q [8381   DR NXDOMAIN] A      (3)www(12)autodiscover(5)st1ad(4)emea(15)microsoftonline(3)com(0)
2023年7月10日上午10:55:42 1B7C PACKET  000001D9E4E338D0 UDP Rcv 10.x.x.x  a9bd   Q [0001   D   NOERROR] A      (18)addinsinstallation(5)store(6)office(3)com(0)
2023年7月10日上午10:55:42 1B7C PACKET  000001D9D775F890 UDP Snd 8.8.8.8         afda   Q [0001   D   NOERROR] A      (23)prod-addinsinstallation(15)omexexternallfb(6)office(3)net(6)akadns(3)net(0)
2023年7月10日上午10:55:42 1B78 PACKET  000001D9E182BB80 UDP Rcv 10.x.x.x  d229   Q [0001   D   NOERROR] SOA    (15)pc_host01(7)contoso(5)local(0)
2023年7月10日上午10:55:42 1B78 PACKET  000001D9E182BB80 UDP Snd 10.x.x.x  d229 R Q [8085 A DR  NOERROR] SOA    (15)pc_host02(7)contoso(5)local(0)
2023年7月10日上午10:55:42 1B78 PACKET  000001D9E2A2D670 UDP Rcv 8.8.8.8         c95c R Q [8081   DR  NOERROR] A      (9)dtr-a-ncu(2)na(8)azurerms(3)com(0)
2023年7月
<details>
<summary>英文:</summary>
The script works fine, and outputs exactly what I need it to output.
My issue comes when I have a large CSV file to work with (approx 500Mb with ~6M lines).
The script takes long time to run. I appreciate that it&#39;s going to take a while to process that much data, but I&#39;m wondering if there is a way to improve it! Here is the condensed script:
$DnsFilePath = &quot;C:\dns.log&quot;
Param([string]$DnsFilePath)
If (Test-Path $DnsFilePath) 
{ 
$FileInfo = Get-ChildItem -Path $DnsFilePath
$Ans = Read-Host &quot;Do you want to continue(y/n)?&quot;
If ($Ans -eq &#39;y&#39;)
{
If (!($SkipLines)) { Write-Host &quot;Processing...&quot;; }
$i = 0; ## Set to count the number of records;
$Timer= [Diagnostics.Stopwatch]::StartNew()	## Start the timer
$ArrayOfStrings = [System.Collections.ArrayList]@()
Switch -regex ([System.IO.File]::ReadLines($FileInfo.fullname)) {
&#39; UDP Rcv &#39;	{
$Datetime = [regex]::matches($switch.current,&#39;\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} (AM|PM)&#39;).Value
$IP = [regex]::matches($switch.current,&#39;\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b&#39;).Value
$FQDN = [regex]::matches($switch.current,&quot;\)[A-z0-9-_]*\(&quot;).Value  -replace &quot;\)|\(&quot;,&quot;&quot; -join &quot;.&quot;
[void]$ArrayOfStrings.Add(&quot;$Datetime,$IP,$FQDN&quot;)
$i++;
}
}
$OutFilePath = &quot;$($FileInfo.DirectoryName)\$($FileInfo.BaseName)_Parsed.txt&quot;
[System.IO.File]::WriteAllLines($OutFilePath, $ArrayOfStrings)
$Timer.stop()
Write-host &quot;Total time elapsed: $($Timer.Elapsed.ToString(&#39;hh\:mm\:ss\.ff&#39;))&quot;
Write-Host &quot;Number of Record Processed: $i&quot;
Write-Host &quot;Parsed File created successfully at $OutFilePath&quot;						
}
else
{ Write-Host &quot;Script exits.&quot; }
}
Else
{
Write-Host -fore Red &quot;File does not exist in the following location: $DnsFilePath. Script exits.&quot;
}
Sample DNS Log:
DNS Server log file creation at 7/10/2023 10:55:42 AM
Log file wrap at 7/10/2023 10:55:42 AM
Message logging key (for packets - other items use a subset of these fields):
Field #  Information         Values
-------  -----------         ------
1     Date
2     Time
3     Thread ID
4     Context
5     Internal packet identifier
6     UDP/TCP indicator
7     Send/Receive indicator
8     Remote IP
9     Xid (hex)
10     Query/Response      R = Response
blank = Query
11     Opcode              Q = Standard Query
N = Notify
U = Update
? = Unknown
12     [ Flags (hex)
13     Flags (char codes)  A = Authoritative Answer
T = Truncated Response
D = Recursion Desired
R = Recursion Available
14     ResponseCode ]
15     Question Type
16     Question Name
7/10/2023 10:55:42 AM 1B7C PACKET  000001D9D88C68D0 UDP Rcv 8.8.8.8         5fb1 R Q [8381   DR NXDOMAIN] A      (3)www(12)autodiscover(5)st1ad(4)emea(15)microsoftonline(3)com(0)
7/10/2023 10:55:42 AM 1B7C PACKET  000001D9D775F890 UDP Snd 10.x.x.x     92cb R Q [8381   DR NXDOMAIN] A      (3)www(12)autodiscover(5)st1ad(4)emea(15)microsoftonline(3)com(0)
7/10/2023 10:55:42 AM 1B7C PACKET  000001D9E4E338D0 UDP Rcv 10.x.x.x  a9bd   Q [0001   D   NOERROR] A      (18)addinsinstallation(5)store(6)office(3)com(0)
7/10/2023 10:55:42 AM 1B7C PACKET  000001D9D775F890 UDP Snd 8.8.8.8         afda   Q [0001   D   NOERROR] A      (23)prod-addinsinstallation(15)omexexternallfb(6)office(3)net(6)akadns(3)net(0)
7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E182BB80 UDP Rcv 10.x.x.x  d229   Q [0001   D   NOERROR] SOA    (15)pc_host01(7)contoso(5)local(0)
7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E182BB80 UDP Snd 10.x.x.x  d229 R Q [8085 A DR  NOERROR] SOA    (15)pc_host02(7)contoso(5)local(0)
7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E2A2D670 UDP Rcv 8.8.8.8         c95c R Q [8081   DR  NOERROR] A      (9)dtr-a-ncu(2)na(8)azurerms(3)com(0)
7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E1998D80 UDP Snd 10.x.x.x     2047 R Q [8081   DR  NOERROR] A      (6)portal(8)azurerms(3)com(0)
7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E2D07D00 UDP Rcv 10.x.x.x   788e   Q [0001   D   NOERROR] A      (2)tr(11)c1182306347(12)ip4-58f0802d(4)wgcs(7)skyhigh(5)cloud(0)
7/10/2023 10:55:42 AM 1B78 PACKET  000001D9E1998D80 UDP Snd 8.8.8.8         1c22   Q [0001   D   NOERROR] A      (2)tr(11)c1182306347(12)ip4-58f0802d(4)wgcs(7)skyhigh(5)cloud(0)
</details>
# 答案1
**得分**: 1
不要收集所有结果到`$ArrayOfStrings`,直接将结果写入文件!
为了避免每次都关闭和重新打开文件句柄,重复使用同一个句柄:
```powershell
$counter = 0
$OutFilePath = Join-Path $FileInfo.DirectoryName "$($FileInfo.BaseName)_Parsed.txt"
# 如果输出文件不存在,则创建它
$OutFile = if (-not (Test-Path $OutFilePath -PathType Leaf)){
New-Item -Path $OutFilePath -ItemType File
}
else {
$OutFilePath | Get-Item
}
# 创建可写的文件流对象并将光标移动到文件末尾(这样我们就不会覆盖任何现有数据)
$fileStream = $OutFile.OpenWrite()
$fileStream.Seek(0, 'End') | Out-Null
# 创建一个写入器
$fileWriter = [System.IO.StreamWriter]::new($fileStream)
try {
Switch -regex ([System.IO.File]::ReadLines($FileInfo.fullname)) {
' UDP Rcv ' {
$Datetime = [regex]::matches($switch.current, '\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} (AM|PM)').Value
$IP = [regex]::matches($switch.current, '\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b').Value
$FQDN = [regex]::matches($switch.current, "\)[A-z0-9-_]*\(").Value -replace "\)|\(", "" -join "."
# 直接写入文件,不在内存中存储字符串
$fileWriter.WriteLine("$Datetime,$IP,$FQDN")
$counter++
}
}
$fileWriter.Flush()
$fileWriter.Close()
$Timer.stop()
Write-Host "总耗时: $($Timer.Elapsed.ToString('hh\:mm\:ss\.ff'))"
Write-Host "处理的记录数: $counter"
Write-Host "已成功创建解析后的文件: $OutFilePath"                       
}
finally {
# 清理
$fileStream, $fileWriter | ForEach-Object Dispose
}

希望这对你有所帮助!如果需要进一步的翻译或有其他问题,请随时提出。

英文:

Instead of collecting all the results into $ArrayOfStrings, write the results straight to the file immediately!

In order to avoid having to close and re-open the file handle everytime, re-use the same one:

$counter = 0
$OutFilePath = Join-Path $FileInfo.DirectoryName &quot;$($FileInfo.BaseName)_Parsed.txt&quot;
# create output file if it doesn&#39;t already exist
$OutFile = if (-not (Test-Path $OutFilePath -PathType Leaf)){
New-Item -Path $OutFilePath -ItemType File
}
else {
$OutFilePath |Get-Item
}
# create writable filestream object and wind the cursor to the end of the file (so we don&#39;t overwrite any existing data)
$fileStream = $OutFile.OpenWrite()
$fileStream.Seek(0, &#39;End&#39;) |Out-Null
# create a writer
$fileWriter = [System.IO.StreamWriter]::new($fileStream)
try {
Switch -regex ([System.IO.File]::ReadLines($FileInfo.fullname)) {
&#39; UDP Rcv &#39; {
$Datetime = [regex]::matches($switch.current, &#39;\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} (AM|PM)&#39;).Value
$IP = [regex]::matches($switch.current, &#39;\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b&#39;).Value
$FQDN = [regex]::matches($switch.current, &quot;\)[A-z0-9-_]*\(&quot;).Value -replace &quot;\)|\(&quot;, &quot;&quot; -join &quot;.&quot;
# write straight to the file, no in-memory storage of the string
$fileWriter.WriteLine(&quot;$Datetime,$IP,$FQDN&quot;)
$counter++
}
}
$fileWriter.Flush()
$fileWriter.Close()
$Timer.stop()
Write-host &quot;Total time elapsed: $($Timer.Elapsed.ToString(&#39;hh\:mm\:ss\.ff&#39;))&quot;
Write-Host &quot;Number of Record Processed: $counter&quot;
Write-Host &quot;Parsed File created successfully at $OutFilePath&quot;                       
}
finally {
# clean up
$fileStream, $fileWriter |ForEach-Object Dispose
}

huangapple
  • 本文由 发表于 2023年7月10日 16:06:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76651830.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定