Loop through all files in all folders recursively as fast as possible in GOLANG

huangapple go评论117阅读模式
英文:

Loop through all files in all folders recursively as fast as possible in GOLANG

问题

我遇到了一个问题,即使我在论坛上花了一整天的时间,我仍然无法完全理解和解决它。

问题是这样的,我编写了一个函数,循环遍历所有的文件夹及其子文件夹,并包含了两个子函数:

  • 对于找到的每个文件,列出文件的名称。
  • 对于找到的每个文件夹,重新启动相同的父函数以再次查找子文件和子文件夹。

为了简化问题,该宏使用递归列出树中的所有文件。但我的目标是尽快完成,所以每当遇到一个新的文件夹时,我会运行一个新的goroutine。

问题是,当树结构太大时(文件夹和子文件夹中的文件太多),脚本会生成太多的线程,从而导致错误。所以我增加了这个限制,但突然间电脑不再工作了:/

所以我的问题是,如何创建一个适合我的代码的工作线程系统(带有池大小)?
无论我怎么看,我都不知道如何说,例如,生成新的goroutine直到达到某个限制,以便清空缓冲区。

源代码:
https://github.com/LaM0uette/FilesDIR/tree/V0.5

主要部分:

  1. package main
  2. import (
  3. "FilesDIR/globals"
  4. "FilesDIR/task"
  5. "fmt"
  6. "log"
  7. "runtime/debug"
  8. "sync"
  9. "time"
  10. )
  11. func main() {
  12. timeStart := time.Now()
  13. debug.SetMaxThreads(5 * 1000)
  14. var wg sync.WaitGroup
  15. err := task.LoopDirsFiles(globals.SrcPath, &wg) // globals.SrcPath = My path with ~2000000 files ( this is a serveur of my entreprise)
  16. if err != nil {
  17. log.Print(err.Error())
  18. }
  19. wg.Wait()
  20. fmt.Println("FINI: Nb Fichiers: ", task.Id)
  21. timeEnd := time.Since(timeStart)
  22. fmt.Println(timeEnd)
  23. }

任务部分:

  1. package task
  2. import (
  3. "fmt"
  4. "io/ioutil"
  5. "log"
  6. "os"
  7. "path/filepath"
  8. "strings"
  9. "sync"
  10. "time"
  11. )
  12. var Id = 0
  13. func LoopDirsFiles(path string, wg *sync.WaitGroup) error {
  14. wg.Add(1)
  15. defer wg.Done()
  16. files, err := ioutil.ReadDir(path)
  17. if err != nil {
  18. return err
  19. }
  20. for _, file := range files {
  21. if !file.IsDir() && !strings.Contains(file.Name(), "~") {
  22. fmt.Println(file.Name(), Id)
  23. Id++
  24. } else if file.IsDir() {
  25. go func() {
  26. err = LoopDirsFiles(filepath.Join(path, file.Name()), wg)
  27. if err != nil {
  28. log.Print(err)
  29. }
  30. }()
  31. time.Sleep(20 * time.Millisecond)
  32. }
  33. }
  34. return nil
  35. }

希望对你有帮助!

英文:

I'm facing a problem that even after spending the day on the forums I still can't quite understand and solve.

So here it is, I made a function that loops over all the folders as well as its sub-folders, and which has 2 sub-functions:
- For each file found, list the name of the file.
- For each folder found, restart the same parent function to find the child files and folders again.

To make it simpler, the macro lists all files in a tree with recursion. But my goal is to do it as fast as possible so I run a new goroutine every time I come across a new folder.

PROBLEM:
My problem is that when the tree structure is too large (too many folders in folders and subfolders...) the script generates too many threads and therefore gives me an error. So I increased this limit, but suddenly it's the pc that no longer wants :/

So my question is, how can I make a worker system (with poolsize) that fits my code?
No matter how much I look, I don't see how to say, for example, to generate new goroutines up to a certain limit, the time to empty the buffer.


Source code:
https://github.com/LaM0uette/FilesDIR/tree/V0.5

main:

  1. package main
  2. import (
  3. "FilesDIR/globals"
  4. "FilesDIR/task"
  5. "fmt"
  6. "log"
  7. "runtime/debug"
  8. "sync"
  9. "time"
  10. )
  11. func main() {
  12. timeStart := time.Now()
  13. debug.SetMaxThreads(5 * 1000)
  14. var wg sync.WaitGroup
  15. // task.DrawStart()
  16. /*
  17. err := task.LoopDir(globals.SrcPath)
  18. if err != nil {
  19. log.Print(err.Error())
  20. }
  21. */
  22. err := task.LoopDirsFiles(globals.SrcPath, &wg) // globals.SrcPath = My path with ~2000000 files ( this is a serveur of my entreprise)
  23. if err != nil {
  24. log.Print(err.Error())
  25. }
  26. wg.Wait()
  27. fmt.Println("FINI: Nb Fichiers: ", task.Id)
  28. timeEnd := time.Since(timeStart)
  29. fmt.Println(timeEnd)
  30. }

task:

  1. package task
  2. import (
  3. "fmt"
  4. "io/ioutil"
  5. "log"
  6. "os"
  7. "path/filepath"
  8. "strings"
  9. "sync"
  10. "time"
  11. )
  12. var Id = 0
  13. // LoopDir TODO: Code à supprimer / Code to delete
  14. func LoopDir(path string) error {
  15. var wg sync.WaitGroup
  16. countDir := 0
  17. err := filepath.Walk(path, func(path string, info os.FileInfo, err error) error {
  18. if err != nil {
  19. return err
  20. }
  21. if info.IsDir() {
  22. wg.Add(1)
  23. countDir++
  24. go func() {
  25. err := loopFiles(path, &wg)
  26. if err != nil {
  27. log.Println(err.Error())
  28. }
  29. }()
  30. }
  31. return nil
  32. })
  33. if err != nil {
  34. return err
  35. }
  36. wg.Wait()
  37. fmt.Println("Finished", countDir, Id)
  38. return nil
  39. }
  40. // loopFiles TODO: Code à supprimer / Code to delete
  41. func loopFiles(path string, wg *sync.WaitGroup) error {
  42. files, err := ioutil.ReadDir(path)
  43. if err != nil {
  44. wg.Done()
  45. return err
  46. }
  47. for _, file := range files {
  48. if !file.IsDir() {
  49. go fmt.Println(file.Name())
  50. Id++
  51. }
  52. }
  53. wg.Done()
  54. return nil
  55. }
  56. func LoopDirsFiles(path string, wg *sync.WaitGroup) error {
  57. wg.Add(1)
  58. defer wg.Done()
  59. files, err := ioutil.ReadDir(path)
  60. if err != nil {
  61. return err
  62. }
  63. for _, file := range files {
  64. if !file.IsDir() && !strings.Contains(file.Name(), "~") {
  65. fmt.Println(file.Name(), Id)
  66. Id++
  67. } else if file.IsDir() {
  68. go func() {
  69. err = LoopDirsFiles(filepath.Join(path, file.Name()), wg)
  70. if err != nil {
  71. log.Print(err)
  72. }
  73. }()
  74. time.Sleep(20 * time.Millisecond)
  75. }
  76. }
  77. return nil
  78. }

答案1

得分: 2

如果您不想使用任何外部包,可以为文件处理创建一个单独的工作程序,然后启动所需数量的工作程序。之后,在主线程中递归进入树,并将作业发送给工作程序。如果任何工作程序“有时间”,它将从作业通道中获取下一个作业并处理它。

  1. var (
  2. wg *sync.WaitGroup
  3. jobs chan string = make(chan string)
  4. )
  5. func loopFilesWorker() error {
  6. for path := range jobs {
  7. files, err := ioutil.ReadDir(path)
  8. if err != nil {
  9. wg.Done()
  10. return err
  11. }
  12. for _, file := range files {
  13. if !file.IsDir() {
  14. fmt.Println(file.Name())
  15. }
  16. }
  17. wg.Done()
  18. }
  19. return nil
  20. }
  21. func LoopDirsFiles(path string) error {
  22. files, err := ioutil.ReadDir(path)
  23. if err != nil {
  24. return err
  25. }
  26. //将此路径作为工作任务添加到工作程序中
  27. //您必须在go例程中调用它,因为如果每个工作程序都忙碌,那么您必须等待通道空闲。
  28. go func() {
  29. wg.Add(1)
  30. jobs <- path
  31. }()
  32. for _, file := range files {
  33. if file.IsDir() {
  34. //递归进入树
  35. LoopDirsFiles(filepath.Join(path, file.Name()))
  36. }
  37. }
  38. return nil
  39. }
  40. func main() {
  41. //启动所需数量的工作程序,现在是10个工作程序
  42. for w := 1; w <= 10; w++ {
  43. go loopFilesWorker()
  44. }
  45. //开始递归
  46. LoopDirsFiles(globals.SrcPath)
  47. wg.Wait()
  48. }
英文:

If you don't want to use any external package, you can create a separate worker routine for file processing, then start as many workers you want. After that, go into the tree recursively in your main thread, and send out the jobs to the workers. If any worker "has time", it will pick up the following job from the jobs channel and process it.

  1. var (
  2. wg *sync.WaitGroup
  3. jobs chan string = make(chan string)
  4. )
  5. func loopFilesWorker() error {
  6. for path := range jobs {
  7. files, err := ioutil.ReadDir(path)
  8. if err != nil {
  9. wg.Done()
  10. return err
  11. }
  12. for _, file := range files {
  13. if !file.IsDir() {
  14. fmt.Println(file.Name())
  15. }
  16. }
  17. wg.Done()
  18. }
  19. return nil
  20. }
  21. func LoopDirsFiles(path string) error {
  22. files, err := ioutil.ReadDir(path)
  23. if err != nil {
  24. return err
  25. }
  26. //Add this path as a job to the workers
  27. //You must call it in a go routine, since if every worker is busy, then you have to wait for the channel to be free.
  28. go func() {
  29. wg.Add(1)
  30. jobs &lt;- path
  31. }()
  32. for _, file := range files {
  33. if file.IsDir() {
  34. //Recursively go further in the tree
  35. LoopDirsFiles(filepath.Join(path, file.Name()))
  36. }
  37. }
  38. return nil
  39. }
  40. func main() {
  41. //Start as many workers you want, now 10 workers
  42. for w := 1; w &lt;= 10; w++ {
  43. go loopFilesWorker()
  44. }
  45. //Start the recursion
  46. LoopDirsFiles(globals.SrcPath)
  47. wg.Wait()
  48. }

huangapple
  • 本文由 发表于 2022年4月6日 20:32:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/71766816.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定