英文:
golang os *File.Readdir using lstat on all files. Can it be optimised?
问题
我正在编写一个程序,使用os.File.Readdir
从父目录中找到包含大量文件的所有子目录,但是运行strace
以查看系统调用的计数时,发现Go版本在父目录中的所有文件/目录上都使用了lstat()
。(我现在正在使用/usr/bin
目录进行测试)
Go代码:
package main
import (
"fmt"
"os"
)
func main() {
x, err := os.Open("/usr/bin")
if err != nil {
panic(err)
}
y, err := x.Readdir(0)
if err != nil {
panic(err)
}
for _, i := range y {
fmt.Println(i)
}
}
程序的strace(不包括跟踪线程):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
93.62 0.004110 2 2466 write
3.46 0.000152 7 22 getdents64
2.92 0.000128 0 2466 lstat // 这个随着文件数量的增加而增加。
0.00 0.000000 0 11 mmap
0.00 0.000000 0 1 munmap
0.00 0.000000 0 114 rt_sigaction
0.00 0.000000 0 8 rt_sigprocmask
0.00 0.000000 0 1 sched_yield
0.00 0.000000 0 3 clone
0.00 0.000000 0 1 execve
0.00 0.000000 0 2 sigaltstack
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 gettid
0.00 0.000000 0 57 futex
0.00 0.000000 0 1 sched_getaffinity
0.00 0.000000 0 1 openat
------ ----------- ----------- --------- --------- ----------------
100.00 0.004390 5156 total
我用C的readdir()
进行了相同的测试,但没有看到这种行为。
C代码:
#include <stdio.h>
#include <dirent.h>
int main (void) {
DIR* dir_p;
struct dirent* dir_ent;
dir_p = opendir ("/usr/bin");
if (dir_p != NULL) {
while ((dir_ent = readdir (dir_p)) != NULL) {
if (dir_ent->d_type == DT_DIR) {
printf("%s is a directory\n", dir_ent->d_name);
} else {
printf("%s is not a directory\n", dir_ent->d_name);
}
printf("\n");
}
(void) closedir(dir_p);
}
else
perror ("Couldn't open the directory");
return 0;
}
程序的strace:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000128 0 2468 write
0.00 0.000000 0 1 read
0.00 0.000000 0 3 open
0.00 0.000000 0 3 close
0.00 0.000000 0 4 fstat
0.00 0.000000 0 8 mmap
0.00 0.000000 0 3 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 3 3 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 4 getdents
0.00 0.000000 0 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.000128 2503 3 total
我知道dirent结构中POSIX.1规定的唯一字段是d_name和d_ino,但我正在为特定的文件系统编写此程序。
我尝试了*File.Readdirnames()
,它不使用lstat
并返回所有文件和目录的列表,但是要确定返回的字符串是文件还是目录,最终还是需要进行lstat
。
- 我想知道是否有可能以避免不必要地对所有文件执行
lstat()
的方式重新编写go程序。我注意到C程序使用了以下系统调用:open("/usr/bin", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFDIR|0755, st_size=69632, ...}) = 0 brk(NULL) = 0x1098000 brk(0x10c1000) = 0x10c1000 getdents(3, /* 986 entries */, 32768) = 32752
- 这是否属于过早优化的范畴,我不应该担心?我提出这个问题是因为要监视的目录中的文件数量将会非常庞大,而且
C
和GO
版本之间的系统调用差异几乎是两倍,这将会对磁盘造成影响。
英文:
I am writing a program that finds all sub-directories from a parent directory which contains a huge number of files using os.File.Readdir
, but running an strace
to see the count of systemcalls showed that the go version is using an lstat()
on all the files/directories present in the parent directory. (I am testing this with /usr/bin
directory for now)
Go code:
package main
import (
"fmt"
"os"
)
func main() {
x, err := os.Open("/usr/bin")
if err != nil {
panic(err)
}
y, err := x.Readdir(0)
if err != nil {
panic(err)
}
for _, i := range y {
fmt.Println(i)
}
}
Strace on the program (without following threads):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
93.62 0.004110 2 2466 write
3.46 0.000152 7 22 getdents64
2.92 0.000128 0 2466 lstat // this increases with increase in no. of files.
0.00 0.000000 0 11 mmap
0.00 0.000000 0 1 munmap
0.00 0.000000 0 114 rt_sigaction
0.00 0.000000 0 8 rt_sigprocmask
0.00 0.000000 0 1 sched_yield
0.00 0.000000 0 3 clone
0.00 0.000000 0 1 execve
0.00 0.000000 0 2 sigaltstack
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 gettid
0.00 0.000000 0 57 futex
0.00 0.000000 0 1 sched_getaffinity
0.00 0.000000 0 1 openat
------ ----------- ----------- --------- --------- ----------------
100.00 0.004390 5156 total
I tested the same with the C's readdir()
without seeing this behaviour.
C code:
#include <stdio.h>
#include <dirent.h>
int main (void) {
DIR* dir_p;
struct dirent* dir_ent;
dir_p = opendir ("/usr/bin");
if (dir_p != NULL) {
// The readdir() function returns a pointer to a dirent structure representing the next
// directory entry in the directory stream pointed to by dirp.
// It returns NULL on reaching the end of the directory stream or if an error occurred.
while ((dir_ent = readdir (dir_p)) != NULL) {
// printf("%s", dir_ent->d_name);
// printf("%d", dir_ent->d_type);
if (dir_ent->d_type == DT_DIR) {
printf("%s is a directory", dir_ent->d_name);
} else {
printf("%s is not a directory", dir_ent->d_name);
}
printf("\n");
}
(void) closedir(dir_p);
}
else
perror ("Couldn't open the directory");
return 0;
}
Strace on the program:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000128 0 2468 write
0.00 0.000000 0 1 read
0.00 0.000000 0 3 open
0.00 0.000000 0 3 close
0.00 0.000000 0 4 fstat
0.00 0.000000 0 8 mmap
0.00 0.000000 0 3 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 3 3 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 4 getdents
0.00 0.000000 0 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.000128 2503 3 total
I am aware that the only fields in the dirent structure that are mandated by POSIX.1 are d_name and d_ino, but I am writing this for a specific filesystem.
Tried *File.Readdirnames()
, which doesn't use an lstat
and gives a list of all files and directories, but to see if the returned string is a file or a directory will eventually do an lstat
again.
- I was wondering if it is possible to re-write the go program in a way to avoid the
lstat()
on all the files un-necessarily. I could see the C program is using the following syscalls.open("/usr/bin", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=69632, ...}) = 0
brk(NULL) = 0x1098000
brk(0x10c1000) = 0x10c1000
getdents(3, /* 986 entries */, 32768) = 32752 - Is this something like a premature optimisation, which I shouldn't be worried about? I raised this question because the number of files in the directory being monitored will be having huge number of small archived files, and the difference in systemcalls is almost twice between
C
andGO
version, which will be hitting the disk.
答案1
得分: 7
dirent
包看起来可以实现你所需要的功能。以下是用Go语言编写的C示例:
package main
import (
"bytes"
"fmt"
"io"
"github.com/EricLagergren/go-gnulib/dirent"
"golang.org/x/sys/unix"
)
func int8ToString(s []int8) string {
var buff bytes.Buffer
for _, chr := range s {
if chr == 0x00 {
break
}
buff.WriteByte(byte(chr))
}
return buff.String()
}
func main() {
stream, err := dirent.Open("/usr/bin")
if err != nil {
panic(err)
}
defer stream.Close()
for {
entry, err := stream.Read()
if err != nil {
if err == io.EOF {
break
}
panic(err)
}
name := int8ToString(entry.Name[:])
if entry.Type == unix.DT_DIR {
fmt.Printf("%s 是一个目录\n", name)
} else {
fmt.Printf("%s 不是一个目录\n", name)
}
}
}
英文:
The package dirent
looks like it accomplishes what you are looking for. Below is your C example written in Go:
package main
import (
"bytes"
"fmt"
"io"
"github.com/EricLagergren/go-gnulib/dirent"
"golang.org/x/sys/unix"
)
func int8ToString(s []int8) string {
var buff bytes.Buffer
for _, chr := range s {
if chr == 0x00 {
break
}
buff.WriteByte(byte(chr))
}
return buff.String()
}
func main() {
stream, err := dirent.Open("/usr/bin")
if err != nil {
panic(err)
}
defer stream.Close()
for {
entry, err := stream.Read()
if err != nil {
if err == io.EOF {
break
}
panic(err)
}
name := int8ToString(entry.Name[:])
if entry.Type == unix.DT_DIR {
fmt.Printf("%s is a directory\n", name)
} else {
fmt.Printf("%s is not a directory\n", name)
}
}
}
答案2
得分: 2
从Go 1.16(2021年2月)开始,一个很好的选择是使用os.ReadDir
:
package main
import "os"
func main() {
files, e := os.ReadDir(".")
if e != nil {
panic(e)
}
for _, file := range files {
println(file.Name())
}
}
os.ReadDir
返回的是fs.DirEntry
而不是fs.FileInfo
,这意味着Size
和ModTime
方法被省略了,使得过程更加高效。
https://golang.org/pkg/os#ReadDir
英文:
Starting with Go 1.16 (Feb 2021), a good option is os.ReadDir
:
package main
import "os"
func main() {
files, e := os.ReadDir(".")
if e != nil {
panic(e)
}
for _, file := range files {
println(file.Name())
}
}
os.ReadDir
returns fs.DirEntry
instead of fs.FileInfo
, which means that
Size
and ModTime
methods are omitted, making the process more efficient.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论