Loop through all files in all folders recursively as fast as possible in GOLANG

huangapple go评论92阅读模式
英文:

Loop through all files in all folders recursively as fast as possible in GOLANG

问题

我遇到了一个问题,即使我在论坛上花了一整天的时间,我仍然无法完全理解和解决它。

问题是这样的,我编写了一个函数,循环遍历所有的文件夹及其子文件夹,并包含了两个子函数:

  • 对于找到的每个文件,列出文件的名称。
  • 对于找到的每个文件夹,重新启动相同的父函数以再次查找子文件和子文件夹。

为了简化问题,该宏使用递归列出树中的所有文件。但我的目标是尽快完成,所以每当遇到一个新的文件夹时,我会运行一个新的goroutine。

问题是,当树结构太大时(文件夹和子文件夹中的文件太多),脚本会生成太多的线程,从而导致错误。所以我增加了这个限制,但突然间电脑不再工作了:/

所以我的问题是,如何创建一个适合我的代码的工作线程系统(带有池大小)?
无论我怎么看,我都不知道如何说,例如,生成新的goroutine直到达到某个限制,以便清空缓冲区。

源代码:
https://github.com/LaM0uette/FilesDIR/tree/V0.5

主要部分:

package main

import (
	"FilesDIR/globals"
	"FilesDIR/task"
	"fmt"
	"log"
	"runtime/debug"
	"sync"
	"time"
)

func main() {
	timeStart := time.Now()
	debug.SetMaxThreads(5 * 1000)

	var wg sync.WaitGroup

	err := task.LoopDirsFiles(globals.SrcPath, &wg) // globals.SrcPath = My path with ~2000000 files ( this is a serveur of my entreprise)
	if err != nil {
		log.Print(err.Error())
	}

	wg.Wait()

	fmt.Println("FINI: Nb Fichiers: ", task.Id)

	timeEnd := time.Since(timeStart)
	fmt.Println(timeEnd)
}

任务部分:

package task

import (
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"path/filepath"
	"strings"
	"sync"
	"time"
)

var Id = 0

func LoopDirsFiles(path string, wg *sync.WaitGroup) error {
	wg.Add(1)
	defer wg.Done()

	files, err := ioutil.ReadDir(path)
	if err != nil {
		return err
	}

	for _, file := range files {
		if !file.IsDir() && !strings.Contains(file.Name(), "~") {
			fmt.Println(file.Name(), Id)
			Id++
		} else if file.IsDir() {
			go func() {
				err = LoopDirsFiles(filepath.Join(path, file.Name()), wg)
				if err != nil {
					log.Print(err)
				}
			}()
			time.Sleep(20 * time.Millisecond)
		}
	}
	return nil
}

希望对你有帮助!

英文:

I'm facing a problem that even after spending the day on the forums I still can't quite understand and solve.

So here it is, I made a function that loops over all the folders as well as its sub-folders, and which has 2 sub-functions:
- For each file found, list the name of the file.
- For each folder found, restart the same parent function to find the child files and folders again.

To make it simpler, the macro lists all files in a tree with recursion. But my goal is to do it as fast as possible so I run a new goroutine every time I come across a new folder.

PROBLEM:
My problem is that when the tree structure is too large (too many folders in folders and subfolders...) the script generates too many threads and therefore gives me an error. So I increased this limit, but suddenly it's the pc that no longer wants :/

So my question is, how can I make a worker system (with poolsize) that fits my code?
No matter how much I look, I don't see how to say, for example, to generate new goroutines up to a certain limit, the time to empty the buffer.


Source code:
https://github.com/LaM0uette/FilesDIR/tree/V0.5

main:

package main

import (
	"FilesDIR/globals"
	"FilesDIR/task"
	"fmt"
	"log"
	"runtime/debug"
	"sync"
	"time"
)

func main() {
	timeStart := time.Now()
	debug.SetMaxThreads(5 * 1000)

	var wg sync.WaitGroup

	// task.DrawStart()

	/*
		err := task.LoopDir(globals.SrcPath)
		if err != nil {
			log.Print(err.Error())
		}
	*/

	err := task.LoopDirsFiles(globals.SrcPath, &wg) // globals.SrcPath = My path with ~2000000 files ( this is a serveur of my entreprise)
	if err != nil {
		log.Print(err.Error())
	}

	wg.Wait()

	fmt.Println("FINI: Nb Fichiers: ", task.Id)

	timeEnd := time.Since(timeStart)
	fmt.Println(timeEnd)
}

task:

package task

import (
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"path/filepath"
	"strings"
	"sync"
	"time"
)

var Id = 0

// LoopDir TODO: Code à supprimer / Code to delete
func LoopDir(path string) error {
	var wg sync.WaitGroup

	countDir := 0

	err := filepath.Walk(path, func(path string, info os.FileInfo, err error) error {
		if err != nil {
			return err
		}

		if info.IsDir() {
			wg.Add(1)
			countDir++

			go func() {
				err := loopFiles(path, &wg)
				if err != nil {
					log.Println(err.Error())
				}
			}()
		}

		return nil
	})
	if err != nil {
		return err
	}

	wg.Wait()
	fmt.Println("Finished", countDir, Id)
	return nil
}

// loopFiles TODO: Code à supprimer / Code to delete
func loopFiles(path string, wg *sync.WaitGroup) error {

	files, err := ioutil.ReadDir(path)
	if err != nil {
		wg.Done()
		return err
	}

	for _, file := range files {
		if !file.IsDir() {
			go fmt.Println(file.Name())
			Id++
		}
	}

	wg.Done()
	return nil
}

func LoopDirsFiles(path string, wg *sync.WaitGroup) error {
	wg.Add(1)
	defer wg.Done()

	files, err := ioutil.ReadDir(path)
	if err != nil {
		return err
	}

	for _, file := range files {
		if !file.IsDir() && !strings.Contains(file.Name(), "~") {
			fmt.Println(file.Name(), Id)
			Id++
		} else if file.IsDir() {
			go func() {
				err = LoopDirsFiles(filepath.Join(path, file.Name()), wg)
				if err != nil {
					log.Print(err)
				}
			}()
			time.Sleep(20 * time.Millisecond)
		}
	}
	return nil
}

答案1

得分: 2

如果您不想使用任何外部包,可以为文件处理创建一个单独的工作程序,然后启动所需数量的工作程序。之后,在主线程中递归进入树,并将作业发送给工作程序。如果任何工作程序“有时间”,它将从作业通道中获取下一个作业并处理它。

var (
    wg   *sync.WaitGroup
    jobs chan string = make(chan string)
)

func loopFilesWorker() error {
    for path := range jobs {
        files, err := ioutil.ReadDir(path)
        if err != nil {
            wg.Done()
            return err
        }

        for _, file := range files {
            if !file.IsDir() {
                fmt.Println(file.Name())
            }
        }
        wg.Done()
    }
    return nil
}

func LoopDirsFiles(path string) error {
    files, err := ioutil.ReadDir(path)
    if err != nil {
        return err
    }
    //将此路径作为工作任务添加到工作程序中
    //您必须在go例程中调用它,因为如果每个工作程序都忙碌,那么您必须等待通道空闲。
    go func() {
        wg.Add(1)
        jobs <- path
    }()
    for _, file := range files {
        if file.IsDir() {
            //递归进入树
            LoopDirsFiles(filepath.Join(path, file.Name()))
        }
    }
    return nil
}

func main() {
    //启动所需数量的工作程序,现在是10个工作程序
    for w := 1; w <= 10; w++ {
        go loopFilesWorker()
    }
    //开始递归
    LoopDirsFiles(globals.SrcPath)
    wg.Wait()
}
英文:

If you don't want to use any external package, you can create a separate worker routine for file processing, then start as many workers you want. After that, go into the tree recursively in your main thread, and send out the jobs to the workers. If any worker "has time", it will pick up the following job from the jobs channel and process it.

var (
wg   *sync.WaitGroup
jobs chan string = make(chan string)
)
func loopFilesWorker() error {
for path := range jobs {
files, err := ioutil.ReadDir(path)
if err != nil {
wg.Done()
return err
}
for _, file := range files {
if !file.IsDir() {
fmt.Println(file.Name())
}
}
wg.Done()
}
return nil
}
func LoopDirsFiles(path string) error {
files, err := ioutil.ReadDir(path)
if err != nil {
return err
}
//Add this path as a job to the workers
//You must call it in a go routine, since if every worker is busy, then you have to wait for the channel to be free.
go func() {
wg.Add(1)
jobs &lt;- path
}()
for _, file := range files {
if file.IsDir() {
//Recursively go further in the tree
LoopDirsFiles(filepath.Join(path, file.Name()))
}
}
return nil
}
func main() {
//Start as many workers you want, now 10 workers
for w := 1; w &lt;= 10; w++ {
go loopFilesWorker()
}
//Start the recursion
LoopDirsFiles(globals.SrcPath)
wg.Wait()
}

huangapple
  • 本文由 发表于 2022年4月6日 20:32:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/71766816.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定