使用哈希值和文件长度查找重复文件,但使用其他算法。

huangapple go评论59阅读模式
英文:

Find Duplicate Files with hash and length, but use other algorithm

问题

I'm trying to Find any duplicate files from my computer, I am using length and hash to speed the process,

Someone told me I can improve the speed of my code changing the algorithm of hashing to MD5, I don't know where I have to write that, I copied my code to show you what I'm trying to do.

$srcDir = "C:\Users\Dell\Documents"
Measure-Command {
  Get-ChildItem -Path $srcDir -File -Recurse | Group-Object -Property Length | 
  Where-Object { $_.Count -gt 1 } | Select-Object -ExpandProperty Group | 
  Get-FileHash -Algorithm MD5 | 
  Group-Object -Property Hash | Where-Object { $_.Count -gt 1 } | 
  ForEach-Object { $_.Group | Select-Object Path, Hash }
}
英文:

I'm trying to Find any duplicate files from my computer, I am using length and hash to speed the process,

Someone told me I can improve the speed of my code changing the algorithm of hashing to MD5, I don't know where I have to write that, I copied my code to show you what I'm trying to do.

$srcDir = "C:\Users\Dell\Documents"
Measure-Command {
  Get-ChildItem -Path $srcDir -File -Recurse | Group -Property Length | 
  where { $_.Count -gt 1 } | select -ExpandProperty Group | 
  Get-FileHash -Algorithm MD5 | 
  Group -Property Hash | where { $_.count -gt 1 } | 
  foreach { $_.Group | select Path, Hash }
}

答案1

得分: 2

以下是翻译好的部分:

"可能并行处理哈希运算会改善您的当前代码,正如iRon在评论中指出的那样,在进行了一些测试后,确实提高了效率。这是一个可以在与Windows PowerShell 5.1兼容且无需模块的情况下并行处理哈希运算的实现。"

$srcDir = 'C:\Users\Dell\Documents'
$maxThreads = 6 # 根据需要调整此值以增加或减少线程数
$rs = [runspacefactory]::CreateRunspacePool(1, $maxThreads)
$rs.Open()

$tasks = Get-ChildItem -Path $srcDir -File -Recurse | Group-Object Length |
    Where-Object Count -GT 1 | ForEach-Object {
        $ps = [powershell]::Create().AddScript({
            $args[0] | Get-FileHash -Algorithm MD5 |
                Group-Object Hash |
                Where-Object Count -GT 1
        }).AddArgument($_.Group)

        $ps.RunspacePool = $rs
        
        @{ ps = $ps; iasync = $ps.BeginInvoke() }
    }

$tasks | ForEach-Object {
    try {
        $_.ps.EndInvoke($_.iasync)
    }
    finally {
        if($_.ps) {
            $_.ps.Dispose()
        }
    }
}

if($rs) {
    $rs.Dispose()
}

请注意,这是代码的翻译部分。

英文:

It is possible that doing the hashing in parallel improves your current code as iRon pointed out in comments, after doing some testing it does indeed improve efficiency. Here is an implementation that can do the hashing in parallel while being compatible with Windows PowerShell 5.1 and no modules needed.

$srcDir = 'C:\Users\Dell\Documents'
$maxThreads = 6 # Tweak this value for more or less threads
$rs = [runspacefactory]::CreateRunspacePool(1, $maxThreads)
$rs.Open()

$tasks = Get-ChildItem -Path $srcDir -File -Recurse | Group-Object Length |
    Where-Object Count -GT 1 | ForEach-Object {
        $ps = [powershell]::Create().AddScript({
            $args[0] | Get-FileHash -Algorithm MD5 |
                Group-Object Hash |
                Where-Object Count -GT 1
        }).AddArgument($_.Group)

        $ps.RunspacePool = $rs
        
        @{ ps = $ps; iasync = $ps.BeginInvoke() }
    }

$tasks | ForEach-Object {
    try {
        $_.ps.EndInvoke($_.iasync)
    }
    finally {
        if($_.ps) {
            $_.ps.Dispose()
        }
    }
}

if($rs) {
    $rs.Dispose()
}

答案2

得分: 0

获取文件哈希值总是需要时间的,所以您需要检查下面的方法是否会快一些:

    $srcDir = "C:\Users\Dell\Documents"
    $files  = Get-ChildItem -Path $srcDir -File -Recurse | Group-Object -Property Length | Where-Object { $_.Count -gt 1 } |
              ForEach-Object { $_.Group | Select-Object FullName, Length, @{Name = 'Hash'; Expression = {($_ | Get-FileHash -Algorithm MD5).Hash}}}
    $files | Group-Object Hash | Where-Object { $_.Count -gt 1 } | ForEach-Object {$_.Group}
英文:

Getting a file Hash will always take its time, so you will have to check if the below would be a bit faster

$srcDir = "C:\Users\Dell\Documents"
$files  = Get-ChildItem -Path $srcDir -File -Recurse | Group-Object -Property Length | Where-Object { $_.Count -gt 1 } |
          ForEach-Object { $_.Group | Select-Object FullName, Length, @{Name = 'Hash'; Expression = {($_ | Get-FileHash -Algorithm MD5).Hash}}}
$files | Group-Object Hash | Where-Object { $_.Count -gt 1 } | ForEach-Object {$_.Group}

huangapple
  • 本文由 发表于 2023年5月13日 20:23:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76242708.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定