英文:
Python: fastest way of checking if there are more than x files in a folder
问题
我在寻找一种非常快速的方法来检查一个文件夹是否包含超过2个文件。
我担心如果**/path/**中有大量文件,len(os.listdir('/path/')) > 2
可能会变得非常慢,特别是因为这个函数会被多个进程同时频繁调用。
英文:
I am looking for a very rapid way to check whether a folder contains more than 2 files.
I worry that len(os.listdir('/path/')) > 2
may become very slow if there are a lot of files in /path/, especially since this function will be called frequently by multiple processes at a time.
答案1
得分: 10
有一个确实由PEP471引入的另一个函数:os.scandir(path)
由于它返回一个生成器,不会创建列表,最坏的情况(大型目录)仍然会很轻量级。
它的高级接口os.walk(path)
将允许您浏览目录,而无需列出其中的所有内容。
以下是您特定情况的代码示例:
import os
MINIMUM_SIZE = 2
file_count = 0
for entry in os.scandir('.'):
if entry.is_file():
file_count += 1
if file_count == MINIMUM_SIZE:
break
enough_files = (file_count == MINIMUM_SIZE)
英文:
There is indeed another function introduced by PEP471 : os.scandir(path)
As it returns a generator, no list will be created and worse case scenario (huge directory) will still be lightweight.
Its higher level interface os.walk(path)
will allow you to go through a directory without having to list all of it.
Here is a code example for your specific case :
import os
MINIMUM_SIZE = 2
file_count = 0
for entry in os.scandir('.'):
if entry.is_file():
file_count += 1
if file_count == MINIMUM_SIZE:
break
enough_files = (file_count == MINIMUM_SIZE)
答案2
得分: 6
以下是您要翻译的内容:
要获得最快速度,可能需要一些巧妙的方法。
我的猜测是:
def iterdir_approach(path):
iter_of_files = (x for x in Path(path).iterdir() if x.is_file())
try:
next(iter_of_files)
next(iter_of_files)
next(iter_of_files)
return True
except:
return False
我们创建了一个生成器并尝试耗尽它,必要时捕获抛出的异常。
为了对这些方法进行性能分析,我们创建了一堆带有一堆文件的目录:
import shutil
import tempfile
import timeit
import matplotlib.pyplot as plt
from pathlib import Path
def create_temp_directory(num_directories):
temp_dir = tempfile.mkdtemp()
for i in range(num_directories):
dir_path = os.path.join(temp_dir, f"subdir_{i}")
os.makedirs(dir_path)
for j in range(random.randint(0,i)):
file_path = os.path.join(dir_path, f"file_{j}.txt")
with open(file_path, 'w') as file:
file.write("Sample content")
return temp_dir
我们定义了各种方法(将其他两种方法从问题的答案中复制过来):
def iterdir_approach(path):
#@swozny
iter_of_files = (x for x in Path(path).iterdir() if x.is_file())
try:
next(iter_of_files)
next(iter_of_files)
next(iter_of_files)
return True
except:
return False
def len_os_dir_approach(path):
#@bluppfisk
return len(os.listdir(path)) > 2
# 还有其他方法,此处省略...
然后,我们使用 timeit.timeit
对代码进行性能分析,并绘制不同目录数量的执行时间:
# 这里省略了性能分析的代码...
结果的可视化如下:
# 这里省略了结果可视化的代码...
最佳三种解决方案的详细情况如下:
# 这里省略了最佳三种解决方案的详细情况...
我已提供您要翻译的部分。如果您需要任何进一步的帮助,请随时告诉我。
英文:
To get the fastest it's probably something hacky.
My guess was:
def iterdir_approach(path):
iter_of_files = (x for x in Path(path).iterdir() if x.isfile())
try:
next(iter_of_files)
next(iter_of_files)
next(iter_of_files)
return True
except:
return False
We create a generator and try to exhaust it, catching the thrown exception if necessary.
To profile the approaches we create a bunch of directories with a bunch of files in them :
import shutil
import tempfile
import timeit
import matplotlib.pyplot as plt
from pathlib import Path
def create_temp_directory(num_directories):
temp_dir = tempfile.mkdtemp()
for i in range(num_directories):
dir_path = os.path.join(temp_dir, f"subdir_{i}")
os.makedirs(dir_path)
for j in range(random.randint(0,i)):
file_path = os.path.join(dir_path, f"file_{j}.txt")
with open(file_path, 'w') as file:
file.write("Sample content")
return temp_dir
We define the various approaches (Copied the other two from the answers to the question:
def iterdir_approach(path):
#@swozny
iter_of_files = (x for x in Path(path).iterdir() if x.isfile())
try:
next(iter_of_files)
next(iter_of_files)
next(iter_of_files)
return True
except:
return False
def len_os_dir_approach(path):
#@bluppfisk
return len(os.listdir(path)) > 2
def check_files_os_scandir_approach(path):
#@PoneyUHC
MINIMUM_SIZE = 3
file_count = 0
for entry in os.scandir(path):
if entry.is_file():
file_count += 1
if file_count == MINIMUM_SIZE:
return True
return False
def path_resolve_approach(path):
#@matleg
directory_path = Path(path).resolve()
nb_files = 0
enough_files = False
for file_path in directory_path.glob("*"):
if file_path.is_file():
nb_files += 1
if nb_files > 2:
return True
return False
def dilettant_approach(path):
#@dilettant
gen = os.scandir(path) # OP states only files in folder /path/
enough = 3 # At least 2 files
has_enough = len(list(itertools.islice(gen, enough))) >= enough
return has_enough
def adrian_ang_approach(path):
#@adrian_ang
count = 0
with os.scandir(path) as entries:
for entry in entries:
if entry.is_file():
count += 1
if count > 2:
return True
return False
Then we profile the code using timeit.timeit
and plot the execution times for various amounts of directories:
num_directories_list = [10, 50, 100, 200, 500,1000]
approach1_times = []
approach2_times = []
approach3_times = []
approach4_times = []
approach5_times = []
approach6_times = []
for num_directories in num_directories_list:
temp_dir = create_temp_directory(num_directories)
subdir_paths = [str(p) for p in Path(create_temp_directory(num_directories)).iterdir()]
approach1_time = timeit.timeit(lambda: [iterdir_approach(path)for path in subdir_paths], number=5)
approach2_time = timeit.timeit(lambda: [check_files_os_scandir_approach(path)for path in subdir_paths], number=5)
approach3_time = timeit.timeit(lambda: [path_resolve_approach(path)for path in subdir_paths], number=5)
approach4_time = timeit.timeit(lambda: [len_os_dir_approach(path)for path in subdir_paths], number=5)
approach5_time = timeit.timeit(lambda: [dilettant_approach(path)for path in subdir_paths], number=5)
approach6_time = timeit.timeit(lambda: [adrian_ang_approach(path)for path in subdir_paths], number=5)
approach1_times.append(approach1_time)
approach2_times.append(approach2_time)
approach3_times.append(approach3_time)
approach4_times.append(approach4_time)
approach5_times.append(approach5_time)
approach6_times.append(approach6_time)
shutil.rmtree(temp_dir)
Visualization of the results
plt.plot(num_directories_list, approach1_times, label='iterdir_approach')
plt.plot(num_directories_list, approach2_times, label='check_files_os_scandir_approach')
plt.plot(num_directories_list, approach3_times, label='path_resolve_approach')
plt.plot(num_directories_list, approach4_times, label='os_dir_approach')
plt.plot(num_directories_list, approach5_times, label='dilettant_approach')
plt.plot(num_directories_list, approach6_times, label='adrian_ang_approach')
plt.xlabel('Number of Directories')
plt.ylabel('Execution Time (seconds)')
plt.title('Performance Comparison')
plt.legend()
plt.show()
答案3
得分: 4
对于任何想尝试C语言方法的人,这是一个可以从Python导入的模块(只处理文件,不处理子目录)
构建:
```c
#define PY_SSIZE_T_CLEAN
#include <stdio.h>
#include <dirent.h>
#include <stdlib.h>
#include <Python.h>
static PyObject *
method_dircnt(PyObject *self, PyObject *args)
{
DIR *dir;
const char *dirname;
long min_count, count = 0;
struct dirent *ent;
if (!PyArg_ParseTuple(args, "sl", &dirname, &min_count))
{
return NULL;
}
dir = opendir(dirname);
while((ent = readdir(dir)))
if (ent->d_name[0] != '.') {
++count;
if (count >= min_count) {
closedir(dir);
Py_RETURN_FALSE;
}
}
closedir(dir);
Py_RETURN_TRUE;
}
static char dircnt_docs[] = "dircnt(dir, min_count): 如果目录中包含的文件数超过 min_count,则返回 False。\n";
static PyMethodDef dircnt_methods[] = {
{"dircnt", (PyCFunction)method_dircnt, METH_VARARGS, dircnt_docs},
{NULL, NULL, 0, NULL}
};
static struct PyModuleDef dircnt_module_def =
{
PyModuleDef_HEAD_INIT,
"dircnt",
"检查目录中是否存在超过 N 个文件",
-1,
dircnt_methods
};
PyMODINIT_FUNC PyInit_dircnt(void){
// Py_Initialize();
return PyModule_Create(&dircnt_module_def);
}
构建:
gcc -I /usr/include/python3.11 dircnt.c -v -shared -fPIC -o dircnt.so
用法:
from dircnt import dircnt
dircnt(path, min_count)
对于更高的 min_count
值,它要快得多:
[1]: https://i.stack.imgur.com/ggI88.png
[2]: https://i.stack.imgur.com/KWvLL.png
英文:
for anyone wanting to try the C approach, here's a module you can import from Python (only does files, not subdirs)
#define PY_SSIZE_T_CLEAN
#include <stdio.h>
#include <dirent.h>
#include <stdlib.h>
#include <Python.h>
static PyObject *
method_dircnt(PyObject *self, PyObject *args)
{
DIR *dir;
const char *dirname;
long min_count, count = 0;
struct dirent *ent;
if (!PyArg_ParseTuple(args, "sl", &dirname, &min_count))
{
return NULL;
}
dir = opendir(dirname);
while((ent = readdir(dir)))
if (ent->d_name[0] != '.') {
++count;
if (count >= min_count) {
closedir(dir);
Py_RETURN_FALSE;
}
}
closedir(dir);
Py_RETURN_TRUE;
}
static char dircnt_docs[] = "dircnt(dir, min_count): Returns False if dir countains more than min_count files.\n";
static PyMethodDef dircnt_methods[] = {
{"dircnt", (PyCFunction)method_dircnt, METH_VARARGS, dircnt_docs},
{NULL, NULL, 0, NULL}
};
static struct PyModuleDef dircnt_module_def =
{
PyModuleDef_HEAD_INIT,
"dircnt",
"Check if there are more than N files in dir",
-1,
dircnt_methods
};
PyMODINIT_FUNC PyInit_dircnt(void){
// Py_Initialize();
return PyModule_Create(&dircnt_module_def);
}
build:
gcc -I /usr/include/python3.11 dircnt.c -v -shared -fPIC -o dircnt.so
(or wherever your headers from the python-dev package are)
usage:
from dircnt import dircnt
dircnt(path, min_count)
It is a fair bit faster especially for higher min_count
values:
答案4
得分: 3
如果您想使用pathlib更明确的方式,您可以尝试:
from pathlib import Path
directory_path = Path('/path/').resolve()
nb_files = 0
enough_files = False
for file_path in directory_path.glob("*"):
if file_path.is_file():
nb_files += 1
if nb_files >= 2:
enough_files = True
break
print(enough_files)
英文:
If you want something more explicit using pathlib, you can try:
from pathlib import Path
directory_path = Path('/path/').resolve()
nb_files = 0
enough_files = False
for file_path in directory_path.glob("*"):
if file_path.is_file():
nb_files += 1
if nb_files >= 2:
enough_files = True
break
print(enough_files)
答案5
得分: 3
以下是您要翻译的内容:
"如OP所知,仅在/path/中有文件,一个优化方法是不测试文件属性。
这个版本应该从先前的知识/约束中获益:
import itertools
import os
gen = os.scandir('/path/') # OP指出仅在文件夹/path/中有文件
enough = 2 # 至少2个文件
创建一个迭代器,只返回足够多的元素
测量结果列表的长度(最多足够多的元素)
并应用标准以获取布尔结果
has_enough = len(list(itertools.islice(gen, enough))) >= enough
print(has_enough)
将此放入一个shell脚本中,并使用hyperfine来测量一些随机性能(包含500多个文件的文件夹):
❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
Time (mean ± σ): 77.6 ms ± 0.6 ms [用户:29.9 ms,系统:31.8 ms]
范围(最小...最大):76.3 ms... 79.4 ms 36次运行
...正如它实际上不应该影响同一个系统的文件夹,其中有超过100,000个文件:
❯ ls -l | wc -l
100204
~
❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
Time (mean ± σ): 79.6 ms ± 1.1 ms [用户:31.9 ms,系统:33.5 ms]
范围(最小...最大):76.8 ms... 82.1 ms 35次运行"
英文:
As the OP knows there are only files within /path/ one optimization is to not test on the file attributes.
This version should be profiting from the prior knowledge / constraints:
import itertools
import os
gen = os.scandir('/path/') # OP states only files in folder /path/
enough = 2 # At least 2 files
# Build an iterator that only returns the first enough elements
# measure the length of the resulting list (at most enough elements)
# and apply the criterion to get the boolean result
has_enough = len(list(itertools.islice(gen, enough))) >= enough
print(has_enough)
Placing this in a shell script and use hyperfine to measure some random performance (folder with 500+ files):
❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
Time (mean ± σ): 77.6 ms ± 0.6 ms [User: 29.9 ms, System: 31.8 ms]
Range (min … max): 76.3 ms … 79.4 ms 36 runs
... and as it should not really matter same system on a folder with more than 100k files:
❯ ls -l |wc -l
100204
~
❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
Time (mean ± σ): 79.6 ms ± 1.1 ms [User: 31.9 ms, System: 33.5 ms]
Range (min … max): 76.8 ms … 82.1 ms 35 runs
答案6
得分: 2
你可以使用os.scandir
函数来替代。例如,要检查一个文件夹是否包含超过2个文件,它仅迭代目录条目2次,并在目录至少有2个文件时返回True:
import os
def has_more_than_two_files(path):
count = 0
with os.scandir(path) as entries:
for entry in entries:
if entry.is_file():
count += 1
if count > 2:
return True
return False
英文:
You can use the os.scandir
function instead. For example, to check if a folder contains more than 2 files, it iterates over directory entries only 2 times and returns positively when the directory has at least 2 files:
import os
def has_more_than_two_files(path):
count = 0
with os.scandir(path) as entries:
for entry in entries:
if entry.is_file():
count += 1
if count > 2:
return True
return False
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论