英文:
Improve processing time when dealing with large CSV
问题
脚本运行正常,并输出我需要的内容。
当我有一个大型CSV文件要处理时(约500MB,约6M行),出现问题。
脚本运行时间很长。我明白处理这么多数据会花费一些时间,但我想知道是否有改进的方法!以下是精简的脚本部分:
$DnsFilePath = "C:\dns.log"
Param([string]$DnsFilePath)
If (Test-Path $DnsFilePath)
{
$FileInfo = Get-ChildItem -Path $DnsFilePath
$Ans = Read-Host "Do you want to continue(y/n)?"
If ($Ans -eq 'y')
{
If (!($SkipLines)) { Write-Host "Processing..."; }
$i = 0; ## 设置用于计算记录数量的变量;
$Timer= [Diagnostics.Stopwatch]::StartNew() ## 启动计时器
$ArrayOfStrings = [System.Collections.ArrayList]@()
Switch -regex ([System.IO.File]::ReadLines($FileInfo.fullname)) {
' UDP Rcv ' {
$Datetime = [regex]::matches($switch.current,'\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} (AM|PM)').Value
$IP = [regex]::matches($switch.current,'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b').Value
$FQDN = [regex]::matches($switch.current,"\)[A-z0-9-_]*\(").Value -replace "\)|\(","" -join "."
[void]$ArrayOfStrings.Add("$Datetime,$IP,$FQDN")
$i++;
}
}
$OutFilePath = "$($FileInfo.DirectoryName)\$($FileInfo.BaseName)_Parsed.txt"
[System.IO.File]::WriteAllLines($OutFilePath, $ArrayOfStrings)
$Timer.stop()
Write-host "Total time elapsed: $($Timer.Elapsed.ToString('hh:mm:ss.ff'))"
Write-Host "Number of Record Processed: $i"
Write-Host "Parsed File created successfully at $OutFilePath"
}
else
{ Write-Host "Script exits." }
}
Else
{
Write-Host -fore Red "File does not exist in the following location: $DnsFilePath. Script exits."
}
示例DNS日志:
DNS服务器日志文件创建于2023年7月10日上午10:55:42
日志文件在2023年7月10日上午10:55:42时包装
消息记录键(用于数据包 - 其他项使用这些字段的子集):
字段# 信息 值
------- ----------- ------
1 日期
2 时间
3 线程ID
4 上下文
5 内部数据包标识符
6 UDP/TCP指示器
7 发送/接收指示器
8 远程IP
9 Xid(十六进制)
10 查询/响应 R = 响应
空白 = 查询
11 操作码 Q = 标准查询
N = 通知
U = 更新
? = 未知
12 [标志(十六进制)
13 标志(字符代码) A = 授权回答
T = 截断响应
D = 期望递归
R = 递归可用
14 响应码]
15 问题类型
16 问题名称
2023年7月10日上午10:55:42 1B7C PACKET 000001D9D88C68D0 UDP Rcv 8.8.8.8 5fb1 R Q [8381 DR NXDOMAIN] A (3)www(12)autodiscover(5)st1ad(4)emea(15)microsoftonline(3)com(0)
2023年7月10日上午10:55:42 1B7C PACKET 000001D9D775F890 UDP Snd 10.x.x.x 92cb R Q [8381 DR NXDOMAIN] A (3)www(12)autodiscover(5)st1ad(4)emea(15)microsoftonline(3)com(0)
2023年7月10日上午10:55:42 1B7C PACKET 000001D9E4E338D0 UDP Rcv 10.x.x.x a9bd Q [0001 D NOERROR] A (18)addinsinstallation(5)store(6)office(3)com(0)
2023年7月10日上午10:55:42 1B7C PACKET 000001D9D775F890 UDP Snd 8.8.8.8 afda Q [0001 D NOERROR] A (23)prod-addinsinstallation(15)omexexternallfb(6)office(3)net(6)akadns(3)net(0)
2023年7月10日上午10:55:42 1B78 PACKET 000001D9E182BB80 UDP Rcv 10.x.x.x d229 Q [0001 D NOERROR] SOA (15)pc_host01(7)contoso(5)local(0)
2023年7月10日上午10:55:42 1B78 PACKET 000001D9E182BB80 UDP Snd 10.x.x.x d229 R Q [8085 A DR NOERROR] SOA (15)pc_host02(7)contoso(5)local(0)
2023年7月10日上午10:55:42 1B78 PACKET 000001D9E2A2D670 UDP Rcv 8.8.8.8 c95c R Q [8081 DR NOERROR] A (9)dtr-a-ncu(2)na(8)azurerms(3)com(0)
2023年7月
<details>
<summary>英文:</summary>
The script works fine, and outputs exactly what I need it to output.
My issue comes when I have a large CSV file to work with (approx 500Mb with ~6M lines).
The script takes long time to run. I appreciate that it's going to take a while to process that much data, but I'm wondering if there is a way to improve it! Here is the condensed script:
$DnsFilePath = "C:\dns.log"
Param([string]$DnsFilePath)
If (Test-Path $DnsFilePath)
{
$FileInfo = Get-ChildItem -Path $DnsFilePath
$Ans = Read-Host "Do you want to continue(y/n)?"
If ($Ans -eq 'y')
{
If (!($SkipLines)) { Write-Host "Processing..."; }
$i = 0; ## Set to count the number of records;
$Timer= [Diagnostics.Stopwatch]::StartNew() ## Start the timer
$ArrayOfStrings = [System.Collections.ArrayList]@()
Switch -regex ([System.IO.File]::ReadLines($FileInfo.fullname)) {
' UDP Rcv ' {
$Datetime = [regex]::matches($switch.current,'\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} (AM|PM)').Value
$IP = [regex]::matches($switch.current,'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b').Value
$FQDN = [regex]::matches($switch.current,"\)[A-z0-9-_]*\(").Value -replace "\)|\(","" -join "."
[void]$ArrayOfStrings.Add("$Datetime,$IP,$FQDN")
$i++;
}
}
$OutFilePath = "$($FileInfo.DirectoryName)\$($FileInfo.BaseName)_Parsed.txt"
[System.IO.File]::WriteAllLines($OutFilePath, $ArrayOfStrings)
$Timer.stop()
Write-host "Total time elapsed: $($Timer.Elapsed.ToString('hh\:mm\:ss\.ff'))"
Write-Host "Number of Record Processed: $i"
Write-Host "Parsed File created successfully at $OutFilePath"
}
else
{ Write-Host "Script exits." }
}
Else
{
Write-Host -fore Red "File does not exist in the following location: $DnsFilePath. Script exits."
}
Sample DNS Log:
DNS Server log file creation at 7/10/2023 10:55:42 AM
Log file wrap at 7/10/2023 10:55:42 AM
Message logging key (for packets - other items use a subset of these fields):
Field # Information Values
------- ----------- ------
1 Date
2 Time
3 Thread ID
4 Context
5 Internal packet identifier
6 UDP/TCP indicator
7 Send/Receive indicator
8 Remote IP
9 Xid (hex)
10 Query/Response R = Response
blank = Query
11 Opcode Q = Standard Query
N = Notify
U = Update
? = Unknown
12 [ Flags (hex)
13 Flags (char codes) A = Authoritative Answer
T = Truncated Response
D = Recursion Desired
R = Recursion Available
14 ResponseCode ]
15 Question Type
16 Question Name
7/10/2023 10:55:42 AM 1B7C PACKET 000001D9D88C68D0 UDP Rcv 8.8.8.8 5fb1 R Q [8381 DR NXDOMAIN] A (3)www(12)autodiscover(5)st1ad(4)emea(15)microsoftonline(3)com(0)
7/10/2023 10:55:42 AM 1B7C PACKET 000001D9D775F890 UDP Snd 10.x.x.x 92cb R Q [8381 DR NXDOMAIN] A (3)www(12)autodiscover(5)st1ad(4)emea(15)microsoftonline(3)com(0)
7/10/2023 10:55:42 AM 1B7C PACKET 000001D9E4E338D0 UDP Rcv 10.x.x.x a9bd Q [0001 D NOERROR] A (18)addinsinstallation(5)store(6)office(3)com(0)
7/10/2023 10:55:42 AM 1B7C PACKET 000001D9D775F890 UDP Snd 8.8.8.8 afda Q [0001 D NOERROR] A (23)prod-addinsinstallation(15)omexexternallfb(6)office(3)net(6)akadns(3)net(0)
7/10/2023 10:55:42 AM 1B78 PACKET 000001D9E182BB80 UDP Rcv 10.x.x.x d229 Q [0001 D NOERROR] SOA (15)pc_host01(7)contoso(5)local(0)
7/10/2023 10:55:42 AM 1B78 PACKET 000001D9E182BB80 UDP Snd 10.x.x.x d229 R Q [8085 A DR NOERROR] SOA (15)pc_host02(7)contoso(5)local(0)
7/10/2023 10:55:42 AM 1B78 PACKET 000001D9E2A2D670 UDP Rcv 8.8.8.8 c95c R Q [8081 DR NOERROR] A (9)dtr-a-ncu(2)na(8)azurerms(3)com(0)
7/10/2023 10:55:42 AM 1B78 PACKET 000001D9E1998D80 UDP Snd 10.x.x.x 2047 R Q [8081 DR NOERROR] A (6)portal(8)azurerms(3)com(0)
7/10/2023 10:55:42 AM 1B78 PACKET 000001D9E2D07D00 UDP Rcv 10.x.x.x 788e Q [0001 D NOERROR] A (2)tr(11)c1182306347(12)ip4-58f0802d(4)wgcs(7)skyhigh(5)cloud(0)
7/10/2023 10:55:42 AM 1B78 PACKET 000001D9E1998D80 UDP Snd 8.8.8.8 1c22 Q [0001 D NOERROR] A (2)tr(11)c1182306347(12)ip4-58f0802d(4)wgcs(7)skyhigh(5)cloud(0)
</details>
# 答案1
**得分**: 1
不要收集所有结果到`$ArrayOfStrings`,直接将结果写入文件!
为了避免每次都关闭和重新打开文件句柄,重复使用同一个句柄:
```powershell
$counter = 0
$OutFilePath = Join-Path $FileInfo.DirectoryName "$($FileInfo.BaseName)_Parsed.txt"
# 如果输出文件不存在,则创建它
$OutFile = if (-not (Test-Path $OutFilePath -PathType Leaf)){
New-Item -Path $OutFilePath -ItemType File
}
else {
$OutFilePath | Get-Item
}
# 创建可写的文件流对象并将光标移动到文件末尾(这样我们就不会覆盖任何现有数据)
$fileStream = $OutFile.OpenWrite()
$fileStream.Seek(0, 'End') | Out-Null
# 创建一个写入器
$fileWriter = [System.IO.StreamWriter]::new($fileStream)
try {
Switch -regex ([System.IO.File]::ReadLines($FileInfo.fullname)) {
' UDP Rcv ' {
$Datetime = [regex]::matches($switch.current, '\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} (AM|PM)').Value
$IP = [regex]::matches($switch.current, '\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b').Value
$FQDN = [regex]::matches($switch.current, "\)[A-z0-9-_]*\(").Value -replace "\)|\(", "" -join "."
# 直接写入文件,不在内存中存储字符串
$fileWriter.WriteLine("$Datetime,$IP,$FQDN")
$counter++
}
}
$fileWriter.Flush()
$fileWriter.Close()
$Timer.stop()
Write-Host "总耗时: $($Timer.Elapsed.ToString('hh\:mm\:ss\.ff'))"
Write-Host "处理的记录数: $counter"
Write-Host "已成功创建解析后的文件: $OutFilePath"
}
finally {
# 清理
$fileStream, $fileWriter | ForEach-Object Dispose
}
希望这对你有所帮助!如果需要进一步的翻译或有其他问题,请随时提出。
英文:
Instead of collecting all the results into $ArrayOfStrings
, write the results straight to the file immediately!
In order to avoid having to close and re-open the file handle everytime, re-use the same one:
$counter = 0
$OutFilePath = Join-Path $FileInfo.DirectoryName "$($FileInfo.BaseName)_Parsed.txt"
# create output file if it doesn't already exist
$OutFile = if (-not (Test-Path $OutFilePath -PathType Leaf)){
New-Item -Path $OutFilePath -ItemType File
}
else {
$OutFilePath |Get-Item
}
# create writable filestream object and wind the cursor to the end of the file (so we don't overwrite any existing data)
$fileStream = $OutFile.OpenWrite()
$fileStream.Seek(0, 'End') |Out-Null
# create a writer
$fileWriter = [System.IO.StreamWriter]::new($fileStream)
try {
Switch -regex ([System.IO.File]::ReadLines($FileInfo.fullname)) {
' UDP Rcv ' {
$Datetime = [regex]::matches($switch.current, '\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}:\d{1,2} (AM|PM)').Value
$IP = [regex]::matches($switch.current, '\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b').Value
$FQDN = [regex]::matches($switch.current, "\)[A-z0-9-_]*\(").Value -replace "\)|\(", "" -join "."
# write straight to the file, no in-memory storage of the string
$fileWriter.WriteLine("$Datetime,$IP,$FQDN")
$counter++
}
}
$fileWriter.Flush()
$fileWriter.Close()
$Timer.stop()
Write-host "Total time elapsed: $($Timer.Elapsed.ToString('hh\:mm\:ss\.ff'))"
Write-Host "Number of Record Processed: $counter"
Write-Host "Parsed File created successfully at $OutFilePath"
}
finally {
# clean up
$fileStream, $fileWriter |ForEach-Object Dispose
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论