最有效的方式是按名称查找文件并打印其属性是什么?

huangapple go评论52阅读模式
英文:

What is the most efficient way to look for a file by name and print its attributes?

问题

  1. 这个问题涉及到文件系统缓存的工作方式。当你去除了 lstat 调用后,执行时间减少的原因是你不再主动访问文件的元数据(如最后修改时间),这意味着文件系统缓存可以更有效地工作。在连续调用中,文件系统可能会将一些目录和文件的信息缓存起来,而你的程序再次运行时可以从缓存中获取这些信息,而不必再次访问磁盘。这解释了为什么连续调用速度更快。

  2. 要优化你的代码并继续打印最后修改日期,你可以考虑以下几点:

    • 使用多线程:考虑使用多线程来并行地搜索目录树,这样可以更有效地利用多核处理器,提高性能。
    • 批量读取目录项:不要每次都读取一个目录项,而是尝试一次性读取多个目录项,以减少文件系统调用的次数。
    • 考虑使用平台特定的API:如果你愿意使代码与特定平台相关,你可以使用操作系统提供的特定API来提高性能,例如,macOS上的fts函数。

这些建议可能会提高程序的性能,但请注意,性能优化可能会涉及复杂性和平台依赖性的增加。因此,你需要权衡性能和代码的复杂性之间的权衡。

英文:

Sorry if the title isn't very fitting, I have actually multiple question, and I wasn't able to separate them.

I wrote a simplified version of the command line "find" program running on macOS that simply looks for every instance of a file name in a specified path and prints the corresponding file paths if any, and date of last modification. It's a single thread program. I used readdir and lstat to get the necessary file info.

Then I compared the performance of my program, let's call it "myfind"(printing as I said both file path and last modified time), to "find"(used with "find path -name filename", to print only the resulting file paths). "myfind" took 1m 40s and "find" just 40s. Both times were consistent during consecutive tries.

Unhappy about the result, I made some "optimizations" that didn't do anything at all (like removing recursion and using a simple stack holding some variables instead).

This is the code I wrote, error checking removed (executing with or without error checking doesn't change execution time):

struct recstruct{
	DIR *st_dir;
	char pathname[512];
};
//usage: ./execname path filename
int main(int argc, char *argv[]){
	chdir(argv[1]);
	myfind(argv[2]);
	return 0;
}
int myfind(char *name){
	DIR *curr_dir = opendir(".");
	int stk_i = 0;
	struct recstruct stk[128];
	struct dirent *next_entry;
	struct stat info;
	char *entry_name;
	char path[512];
	errno = 0;
	//ignores "." and ".."
	readdir(curr_dir);
	readdir(curr_dir);


    //gets next directory entry until it has fully explored the path
	while((next_entry = readdir(curr_dir)) != NULL || stk_i != 0){
		//goes back to previous directory if there is any and it's done with the current one
		if(next_entry == NULL){
			closedir(curr_dir);
			stk_i--;
			chdir(stk[stk_i].pathname);
			curr_dir = stk[stk_i].st_dir;
			continue;
		}
		//gets next directory entry and checks if its name is the one we are looking for
		entry_name = next_entry->d_name;
		lstat(entry_name, &info);
		if(strncmp(name, entry_name, 127) == 0){
			realpath(entry_name, path);
			printf("path:%s\nlast modified:%s", path, ctime(&(info.st_mtime)));
		}
		//if next_entry is a dir, save current environment and explore the new dir
		if(S_ISDIR(info.st_mode)){ 
			getcwd(stk[stk_i].pathname, 512);
			stk[stk_i].st_dir = curr_dir;
			stk_i++;
			chdir(entry_name);
			curr_dir = opendir(".");
			readdir(curr_dir);
			readdir(curr_dir);
		}
	}
	closedir(curr_dir);
	return 0;
}

Then I removed from "myfind" the part where I printed the last modified date by removing all lstat calls. The execution time dropped to 30 seconds, and dropped to about 3-6 seconds on consecutive calls (program calls from terminal, not function calls. The improvement stayed there even after closing and reopening the terminal).

This behavior also improved execution time on consecutive calls that looked for files close in the directory tree, which seems a likely use scenario for this type of program, so I would like to be able to understand it more.

So what I'm actually asking is:

  1. How does this improvement on execution time on consecutive calls work exactly? Why isn't it there in neither "find", nor "myfind" when using lstat?

  2. Is there any other thing I could change in my code to make it more optimized, while making it also print the last modified date? I'm ok with making it platform dependent, too.

I'm operating on an SSD, and only looking for 1 filename every execution

答案1

得分: 1

这里是使用 fts_ 实现的示例,如 @EricPostpischil 建议。它假定用户提供了一个目录,如果您想要验证,请在循环之前执行第一个 fts_read() 并检查 fts_info。它还假定文件名是基本名称(不包含任何斜杠)。您也可以检查一下。

查看 fts_open() 的选项,确保这是您想要的行为。

它仅检查文件名是否与普通文件匹配;find -name 也会与目录名称匹配。在 for-loop 之前很容易解决这个问题。

#include <fts.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>

int myfind(char *const path, const char *filename) {
	FTS *ftsp = fts_open(
		(char *const []) { path, NULL},
		FTS_LOGICAL | FTS_NOCHDIR | FTS_NOSTAT,
		NULL
	);
	if(!ftsp) {
		perror("");
		goto err;
	}
	for(FTSENT *ftsent = fts_read(ftsp); ftsent; ftsent = fts_read(ftsp)) {
		if(ftsent->fts_info & FTS_D)
			for(ftsent = fts_children(ftsp, FTS_NAMEONLY); ftsent; ftsent = ftsent->fts_link)
				if(!strcmp(filename, ftsent->fts_name))
					printf("%s%s\n", ftsent->fts_path, ftsent->fts_name);
	}
	if(errno) {
		perror("");
		goto err;
	}
	fts_close(ftsp);
	return EXIT_SUCCESS;
err:
	if(ftsp)
		fts_close(ftsp);
	return EXIT_FAILURE;
}

int main(int argc, char *argv[]){
	if(argc != 3) {
		printf("usage: path filename\n");
		return 0;
	}
	return myfind(argv[1], argv[2]);
}

这里是一些示例运行(使用温热的文件缓存以便比较):

$ time find . -name non-existing_file
real    0m5.296s
user    0m0.865s

$ time ./myfind . non-existing_file
real    0m4.259s
user    0m0.942s
$ time find  . -name AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1670391304.3514712-709042-260172808017632/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658890524.6326995-109370-280029288970992/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1567991519.7013745-230514371888007/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658888271.8413315-108112-177019459387621/AnsiballZ_file.py

real    0m5.445s
user    0m0.847s
sys     0m0.961s

$ time ./myfind . AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1670391304.3514712-709042-260172808017632/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658890524.6326995-109370-280029288970992/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1567991519.7013745-230514371888007/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658888271.8413315-108112-177019459387621/AnsiballZ_file.py

real    0m7.481s
user    0m0.882s
sys     0m2.713s
英文:

Here is an implementation using fts_ as suggested by @EricPostpischil. It assumes user supplied a directory, if you want to verify that do the first fts_read() before the loop and check fts_info. It also assumes that the filename is a basename (doesn't contain any slashes). You could check that, too.

Look over the options to fts_open() to ensure this is the behavior you want.

It's only checking that the filename matches a regular file; find -name will also match against directory names. It's easy to address that just before the for-loop.

#include &lt;fts.h&gt;
#include &lt;errno.h&gt;
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;string.h&gt;
#include &lt;sys/stat.h&gt;
#include &lt;sys/types.h&gt;

int myfind(char *const path, const char *filename) {
	FTS *ftsp = fts_open(
		(char *const []) { path, NULL},
		FTS_LOGICAL | FTS_NOCHDIR | FTS_NOSTAT,
		NULL
	);
	if(!ftsp) {
		perror(&quot;&quot;);
		goto err;
	}
	for(FTSENT *ftsent = fts_read(ftsp); ftsent; ftsent = fts_read(ftsp)) {
		if(ftsent-&gt;fts_info &amp; FTS_D)
			for(ftsent = fts_children(ftsp, FTS_NAMEONLY); ftsent; ftsent = ftsent-&gt;fts_link)
				if(!strcmp(filename, ftsent-&gt;fts_name))
					printf(&quot;%s%s\n&quot;, ftsent-&gt;fts_path, ftsent-&gt;fts_name);
	}
	if(errno) {
		perror(&quot;&quot;);
		goto err;
	}
	fts_close(ftsp);
	return EXIT_SUCCESS;
err:
	if(ftsp)
		fts_close(ftsp);
	return EXIT_FAILURE;
}

int main(int argc, char *argv[]){
	if(argc != 3) {
		printf(&quot;usage: path filename\n&quot;);
		return 0;
	}
	return myfind(argv[1], argv[2]);
}

and here are some example runs (with a warm file cache to be comparable):

$ time find . -name non-existing_file
real    0m5.296s
user    0m0.865s

$ time ./myfind . non-existing_file
real    0m4.259s
user    0m0.942s
$ time find  . -name AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1670391304.3514712-709042-260172808017632/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658890524.6326995-109370-280029288970992/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1567991519.7013745-230514371888007/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658888271.8413315-108112-177019459387621/AnsiballZ_file.py

real    0m5.445s
user    0m0.847s
sys     0m0.961s

$ time ./myfind . AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1670391304.3514712-709042-260172808017632/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658890524.6326995-109370-280029288970992/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1567991519.7013745-230514371888007/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658888271.8413315-108112-177019459387621/AnsiballZ_file.py

real    0m7.481s
user    0m0.882s
sys     0m2.713s

I see ~3.9 to ~7.6 real time runs on my system so should be comparable to find.

huangapple
  • 本文由 发表于 2023年5月31日 23:45:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76375252.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定