Python:检查文件夹中是否有超过 x 个文件的最快方式

huangapple go评论76阅读模式
英文:

Python: fastest way of checking if there are more than x files in a folder

问题

我在寻找一种非常快速的方法来检查一个文件夹是否包含超过2个文件。

我担心如果**/path/**中有大量文件,len(os.listdir('/path/')) > 2 可能会变得非常慢,特别是因为这个函数会被多个进程同时频繁调用。

英文:

I am looking for a very rapid way to check whether a folder contains more than 2 files.

I worry that len(os.listdir('/path/')) > 2 may become very slow if there are a lot of files in /path/, especially since this function will be called frequently by multiple processes at a time.

答案1

得分: 10

有一个确实由PEP471引入的另一个函数:os.scandir(path)

由于它返回一个生成器,不会创建列表,最坏的情况(大型目录)仍然会很轻量级。

它的高级接口os.walk(path)将允许您浏览目录,而无需列出其中的所有内容。

以下是您特定情况的代码示例:

import os

MINIMUM_SIZE = 2

file_count = 0
for entry in os.scandir('.'):
    if entry.is_file():
        file_count += 1
    if file_count == MINIMUM_SIZE:
        break

enough_files = (file_count == MINIMUM_SIZE)
英文:

There is indeed another function introduced by PEP471 : os.scandir(path)

As it returns a generator, no list will be created and worse case scenario (huge directory) will still be lightweight.

Its higher level interface os.walk(path) will allow you to go through a directory without having to list all of it.

Here is a code example for your specific case :

import os

MINIMUM_SIZE = 2

file_count = 0
for entry in os.scandir('.'):
    if entry.is_file():
        file_count += 1
    if file_count == MINIMUM_SIZE:
        break

enough_files = (file_count == MINIMUM_SIZE)

答案2

得分: 6

以下是您要翻译的内容:

要获得最快速度,可能需要一些巧妙的方法。

我的猜测是:

def iterdir_approach(path):
    iter_of_files = (x for x in Path(path).iterdir() if x.is_file())
    try:
        next(iter_of_files)
        next(iter_of_files)
        next(iter_of_files)
        return True
    except:
        return False

我们创建了一个生成器并尝试耗尽它,必要时捕获抛出的异常。

为了对这些方法进行性能分析,我们创建了一堆带有一堆文件的目录:

import shutil
import tempfile
import timeit
import matplotlib.pyplot as plt
from pathlib import Path


def create_temp_directory(num_directories):
    temp_dir = tempfile.mkdtemp()
    for i in range(num_directories):
        dir_path = os.path.join(temp_dir, f"subdir_{i}")
        os.makedirs(dir_path)
        for j in range(random.randint(0,i)):
            file_path = os.path.join(dir_path, f"file_{j}.txt")
            with open(file_path, 'w') as file:
                file.write("Sample content")
    return temp_dir

我们定义了各种方法(将其他两种方法从问题的答案中复制过来):

def iterdir_approach(path):
    #@swozny
    iter_of_files = (x for x in Path(path).iterdir() if x.is_file())
    try:
        next(iter_of_files)
        next(iter_of_files)
        next(iter_of_files)
        return True
    except:
        return False

def len_os_dir_approach(path):
    #@bluppfisk
    return len(os.listdir(path)) > 2

# 还有其他方法,此处省略...

然后,我们使用 timeit.timeit 对代码进行性能分析,并绘制不同目录数量的执行时间:

# 这里省略了性能分析的代码...

结果的可视化如下:

# 这里省略了结果可视化的代码...

最佳三种解决方案的详细情况如下:

# 这里省略了最佳三种解决方案的详细情况...

我已提供您要翻译的部分。如果您需要任何进一步的帮助,请随时告诉我。

英文:

To get the fastest it's probably something hacky.

My guess was:


def iterdir_approach(path):
    iter_of_files = (x for x in Path(path).iterdir() if x.isfile())
    try:
        next(iter_of_files)
        next(iter_of_files)
        next(iter_of_files)
        return True
    except:
        return False

We create a generator and try to exhaust it, catching the thrown exception if necessary.

To profile the approaches we create a bunch of directories with a bunch of files in them :

import shutil
import tempfile
import timeit
import matplotlib.pyplot as plt
from pathlib import Path


def create_temp_directory(num_directories):
    temp_dir = tempfile.mkdtemp()
    for i in range(num_directories):
        dir_path = os.path.join(temp_dir, f"subdir_{i}")
        os.makedirs(dir_path)
        for j in range(random.randint(0,i)):
            file_path = os.path.join(dir_path, f"file_{j}.txt")
            with open(file_path, 'w') as file:
                file.write("Sample content")
    return temp_dir

We define the various approaches (Copied the other two from the answers to the question:


def iterdir_approach(path):
    #@swozny
    iter_of_files = (x for x in Path(path).iterdir() if x.isfile())
    try:
        next(iter_of_files)
        next(iter_of_files)
        next(iter_of_files)
        return True
    except:
        return False

def len_os_dir_approach(path):
    #@bluppfisk
    return len(os.listdir(path)) > 2


def check_files_os_scandir_approach(path):
    #@PoneyUHC
    MINIMUM_SIZE = 3
    file_count = 0
    for entry in os.scandir(path):
        if entry.is_file():
            file_count += 1
        if file_count == MINIMUM_SIZE:
            return True
    return False


def path_resolve_approach(path):
    #@matleg
    directory_path = Path(path).resolve()
    nb_files = 0
    enough_files = False
    for file_path in directory_path.glob("*"):
        if file_path.is_file():
            nb_files += 1
        if nb_files > 2:
            return True
    return False

def dilettant_approach(path):
    #@dilettant
    gen = os.scandir(path)  # OP states only files in folder /path/
    enough = 3  # At least 2 files

    has_enough = len(list(itertools.islice(gen, enough))) >= enough

    return has_enough
def adrian_ang_approach(path):
    #@adrian_ang
    count = 0
    with os.scandir(path) as entries:
        for entry in entries:
            if entry.is_file():
                count += 1
                if count > 2:
                    return True
    return False

Then we profile the code using timeit.timeit and plot the execution times for various amounts of directories:


num_directories_list = [10, 50, 100, 200, 500,1000]
approach1_times = []
approach2_times = []
approach3_times = []
approach4_times = []
approach5_times = []
approach6_times = []


for num_directories in num_directories_list:
    temp_dir = create_temp_directory(num_directories)
    subdir_paths = [str(p) for p in Path(create_temp_directory(num_directories)).iterdir()]
    approach1_time = timeit.timeit(lambda: [iterdir_approach(path)for path in subdir_paths], number=5)
    approach2_time = timeit.timeit(lambda: [check_files_os_scandir_approach(path)for path in subdir_paths], number=5)
    approach3_time = timeit.timeit(lambda: [path_resolve_approach(path)for path in subdir_paths], number=5)
    approach4_time = timeit.timeit(lambda: [len_os_dir_approach(path)for path in subdir_paths], number=5)
    approach5_time = timeit.timeit(lambda: [dilettant_approach(path)for path in subdir_paths], number=5)
    approach6_time = timeit.timeit(lambda: [adrian_ang_approach(path)for path in subdir_paths], number=5)


    approach1_times.append(approach1_time)
    approach2_times.append(approach2_time)
    approach3_times.append(approach3_time)
    approach4_times.append(approach4_time)
    approach5_times.append(approach5_time)
    approach6_times.append(approach6_time)




    shutil.rmtree(temp_dir)

Visualization of the results


plt.plot(num_directories_list, approach1_times, label='iterdir_approach')
plt.plot(num_directories_list, approach2_times, label='check_files_os_scandir_approach')
plt.plot(num_directories_list, approach3_times, label='path_resolve_approach')
plt.plot(num_directories_list, approach4_times, label='os_dir_approach')
plt.plot(num_directories_list, approach5_times, label='dilettant_approach')
plt.plot(num_directories_list, approach6_times, label='adrian_ang_approach')


plt.xlabel('Number of Directories')
plt.ylabel('Execution Time (seconds)')
plt.title('Performance Comparison')
plt.legend()
plt.show()

Python:检查文件夹中是否有超过 x 个文件的最快方式

Closeup of best 3 solutions:
Python:检查文件夹中是否有超过 x 个文件的最快方式

答案3

得分: 4

对于任何想尝试C语言方法的人这是一个可以从Python导入的模块只处理文件不处理子目录

构建

```c
#define PY_SSIZE_T_CLEAN
#include <stdio.h>
#include <dirent.h>
#include <stdlib.h>
#include <Python.h>

static PyObject *
method_dircnt(PyObject *self, PyObject *args)
{
    DIR *dir;
    const char *dirname;
    long min_count, count = 0;
    struct dirent *ent;

    if (!PyArg_ParseTuple(args, "sl", &dirname, &min_count))
    {
        return NULL;
    }

    dir = opendir(dirname);

    while((ent = readdir(dir)))
            if (ent->d_name[0] != '.') {
                ++count;
                if (count >= min_count) {
                    closedir(dir);
                    Py_RETURN_FALSE;
                }
            }

    closedir(dir);

    Py_RETURN_TRUE;
}

static char dircnt_docs[] = "dircnt(dir, min_count): 如果目录中包含的文件数超过 min_count,则返回 False。\n";

static PyMethodDef dircnt_methods[] = {
    {"dircnt", (PyCFunction)method_dircnt, METH_VARARGS, dircnt_docs},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef dircnt_module_def = 
{
    PyModuleDef_HEAD_INIT,
    "dircnt",
    "检查目录中是否存在超过 N 个文件",
    -1,
    dircnt_methods
};

PyMODINIT_FUNC PyInit_dircnt(void){
    // Py_Initialize();

    return PyModule_Create(&dircnt_module_def);
}

构建:

gcc -I /usr/include/python3.11 dircnt.c -v -shared -fPIC -o dircnt.so

用法:

from dircnt import dircnt
dircnt(path, min_count)

对于更高的 min_count 值,它要快得多:

min_count = 2
Python:检查文件夹中是否有超过 x 个文件的最快方式

min_count = 200
Python:检查文件夹中是否有超过 x 个文件的最快方式


[1]: https://i.stack.imgur.com/ggI88.png
[2]: https://i.stack.imgur.com/KWvLL.png
英文:

for anyone wanting to try the C approach, here's a module you can import from Python (only does files, not subdirs)

#define PY_SSIZE_T_CLEAN
#include &lt;stdio.h&gt;
#include &lt;dirent.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;Python.h&gt;

static PyObject *
method_dircnt(PyObject *self, PyObject *args)
{
    DIR *dir;
    const char *dirname;
    long min_count, count = 0;
    struct dirent *ent;

    if (!PyArg_ParseTuple(args, &quot;sl&quot;, &amp;dirname, &amp;min_count))
    {
        return NULL;
    }

    dir = opendir(dirname);

    while((ent = readdir(dir)))
            if (ent-&gt;d_name[0] != &#39;.&#39;) {
                ++count;
                if (count &gt;= min_count) {
                    closedir(dir);
                    Py_RETURN_FALSE;
                }
            }

    closedir(dir);

    Py_RETURN_TRUE;
}

static char dircnt_docs[] = &quot;dircnt(dir, min_count): Returns False if dir countains more than min_count files.\n&quot;;

static PyMethodDef dircnt_methods[] = {
    {&quot;dircnt&quot;, (PyCFunction)method_dircnt, METH_VARARGS, dircnt_docs},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef dircnt_module_def = 
{
    PyModuleDef_HEAD_INIT,
    &quot;dircnt&quot;,
    &quot;Check if there are more than N files in dir&quot;,
    -1,
    dircnt_methods
};

PyMODINIT_FUNC PyInit_dircnt(void){
    // Py_Initialize();

    return PyModule_Create(&amp;dircnt_module_def);
}

build:

gcc -I /usr/include/python3.11 dircnt.c -v -shared -fPIC -o dircnt.so (or wherever your headers from the python-dev package are)

usage:

from dircnt import dircnt
dircnt(path, min_count)

It is a fair bit faster especially for higher min_count values:

min_count = 2
Python:检查文件夹中是否有超过 x 个文件的最快方式

min_count = 200
Python:检查文件夹中是否有超过 x 个文件的最快方式

答案4

得分: 3

如果您想使用pathlib更明确的方式,您可以尝试:

from pathlib import Path

directory_path = Path('/path/').resolve()
nb_files = 0 
enough_files = False
for file_path in directory_path.glob("*"):
    if file_path.is_file():
        nb_files += 1
    if nb_files >= 2:
        enough_files = True
        break
print(enough_files)
英文:

If you want something more explicit using pathlib, you can try:

from pathlib import Path

directory_path = Path(&#39;/path/&#39;).resolve()
nb_files = 0 
enough_files = False
for file_path in directory_path.glob(&quot;*&quot;):
    if file_path.is_file():
        nb_files += 1
    if nb_files &gt;= 2:
        enough_files = True
        break
print(enough_files)

答案5

得分: 3

以下是您要翻译的内容:

"如OP所知,仅在/path/中有文件,一个优化方法是不测试文件属性。

这个版本应该从先前的知识/约束中获益:

import itertools
import os

gen = os.scandir('/path/') # OP指出仅在文件夹/path/中有文件
enough = 2 # 至少2个文件

创建一个迭代器,只返回足够多的元素

测量结果列表的长度(最多足够多的元素)

并应用标准以获取布尔结果

has_enough = len(list(itertools.islice(gen, enough))) >= enough

print(has_enough)

将此放入一个shell脚本中,并使用hyperfine来测量一些随机性能(包含500多个文件的文件夹):

❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
Time (mean ± σ): 77.6 ms ± 0.6 ms [用户:29.9 ms,系统:31.8 ms]
范围(最小...最大):76.3 ms... 79.4 ms 36次运行

...正如它实际上不应该影响同一个系统的文件夹,其中有超过100,000个文件:

❯ ls -l | wc -l
100204

~
❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
Time (mean ± σ): 79.6 ms ± 1.1 ms [用户:31.9 ms,系统:33.5 ms]
范围(最小...最大):76.8 ms... 82.1 ms 35次运行"

英文:

As the OP knows there are only files within /path/ one optimization is to not test on the file attributes.

This version should be profiting from the prior knowledge / constraints:

import itertools
import os

gen = os.scandir(&#39;/path/&#39;)  # OP states only files in folder /path/
enough = 2  # At least 2 files

# Build an iterator that only returns the first enough elements
# measure the length of the resulting list (at most enough elements)
# and apply the criterion to get the boolean result
has_enough = len(list(itertools.islice(gen, enough))) &gt;= enough

print(has_enough)

Placing this in a shell script and use hyperfine to measure some random performance (folder with 500+ files):

❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
  Time (mean &#177; σ):      77.6 ms &#177;   0.6 ms    [User: 29.9 ms, System: 31.8 ms]
  Range (min … max):    76.3 ms …  79.4 ms    36 runs

... and as it should not really matter same system on a folder with more than 100k files:

❯ ls -l |wc -l
  100204

~
❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
  Time (mean &#177; σ):      79.6 ms &#177;   1.1 ms    [User: 31.9 ms, System: 33.5 ms]
  Range (min … max):    76.8 ms …  82.1 ms    35 runs

答案6

得分: 2

你可以使用os.scandir函数来替代。例如,要检查一个文件夹是否包含超过2个文件,它仅迭代目录条目2次,并在目录至少有2个文件时返回True:

import os

def has_more_than_two_files(path):
    count = 0
    with os.scandir(path) as entries:
        for entry in entries:
            if entry.is_file():
                count += 1
                if count > 2:
                    return True
    return False
英文:

You can use the os.scandir function instead. For example, to check if a folder contains more than 2 files, it iterates over directory entries only 2 times and returns positively when the directory has at least 2 files:

import os

def has_more_than_two_files(path):
    count = 0
    with os.scandir(path) as entries:
        for entry in entries:
            if entry.is_file():
                count += 1
                if count &gt; 2:
                    return True
    return False

huangapple
  • 本文由 发表于 2023年7月4日 22:33:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76613672.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定