2023年7月4日 22:33:11go评论166阅读模式

英文:

Python: fastest way of checking if there are more than x files in a folder

问题

我在寻找一种非常快速的方法来检查一个文件夹是否包含超过2个文件。

我担心如果**/path/**中有大量文件，len(os.listdir('/path/')) > 2 可能会变得非常慢，特别是因为这个函数会被多个进程同时频繁调用。

英文:

I am looking for a very rapid way to check whether a folder contains more than 2 files.

I worry that len(os.listdir('/path/')) > 2 may become very slow if there are a lot of files in /path/, especially since this function will be called frequently by multiple processes at a time.

答案1

得分: 10

有一个确实由PEP471引入的另一个函数：os.scandir(path)

由于它返回一个生成器，不会创建列表，最坏的情况（大型目录）仍然会很轻量级。

它的高级接口os.walk(path)将允许您浏览目录，而无需列出其中的所有内容。

以下是您特定情况的代码示例：

import os

MINIMUM_SIZE = 2

file_count = 0
for entry in os.scandir('.'):
    if entry.is_file():
        file_count += 1
    if file_count == MINIMUM_SIZE:
        break

enough_files = (file_count == MINIMUM_SIZE)

英文:

There is indeed another function introduced by PEP471 : os.scandir(path)

As it returns a generator, no list will be created and worse case scenario (huge directory) will still be lightweight.

Its higher level interface os.walk(path) will allow you to go through a directory without having to list all of it.

Here is a code example for your specific case :

import os

MINIMUM_SIZE = 2

file_count = 0
for entry in os.scandir(&#39;.&#39;):
    if entry.is_file():
        file_count += 1
    if file_count == MINIMUM_SIZE:
        break

enough_files = (file_count == MINIMUM_SIZE)

答案2

得分: 6

以下是您要翻译的内容：

要获得最快速度，可能需要一些巧妙的方法。

我的猜测是：

def iterdir_approach(path):
    iter_of_files = (x for x in Path(path).iterdir() if x.is_file())
    try:
        next(iter_of_files)
        next(iter_of_files)
        next(iter_of_files)
        return True
    except:
        return False

我们创建了一个生成器并尝试耗尽它，必要时捕获抛出的异常。

为了对这些方法进行性能分析，我们创建了一堆带有一堆文件的目录：

import shutil
import tempfile
import timeit
import matplotlib.pyplot as plt
from pathlib import Path


def create_temp_directory(num_directories):
    temp_dir = tempfile.mkdtemp()
    for i in range(num_directories):
        dir_path = os.path.join(temp_dir, f"subdir_{i}")
        os.makedirs(dir_path)
        for j in range(random.randint(0,i)):
            file_path = os.path.join(dir_path, f"file_{j}.txt")
            with open(file_path, 'w') as file:
                file.write("Sample content")
    return temp_dir

我们定义了各种方法（将其他两种方法从问题的答案中复制过来）：

def iterdir_approach(path):
    #@swozny
    iter_of_files = (x for x in Path(path).iterdir() if x.is_file())
    try:
        next(iter_of_files)
        next(iter_of_files)
        next(iter_of_files)
        return True
    except:
        return False

def len_os_dir_approach(path):
    #@bluppfisk
    return len(os.listdir(path)) > 2

# 还有其他方法，此处省略...

然后，我们使用 timeit.timeit 对代码进行性能分析，并绘制不同目录数量的执行时间：

# 这里省略了性能分析的代码...

结果的可视化如下：

# 这里省略了结果可视化的代码...

最佳三种解决方案的详细情况如下：

# 这里省略了最佳三种解决方案的详细情况...

我已提供您要翻译的部分。如果您需要任何进一步的帮助，请随时告诉我。

英文:

To get the fastest it's probably something hacky.

My guess was:


def iterdir_approach(path):
    iter_of_files = (x for x in Path(path).iterdir() if x.isfile())
    try:
        next(iter_of_files)
        next(iter_of_files)
        next(iter_of_files)
        return True
    except:
        return False

We create a generator and try to exhaust it, catching the thrown exception if necessary.

To profile the approaches we create a bunch of directories with a bunch of files in them :

import shutil
import tempfile
import timeit
import matplotlib.pyplot as plt
from pathlib import Path


def create_temp_directory(num_directories):
    temp_dir = tempfile.mkdtemp()
    for i in range(num_directories):
        dir_path = os.path.join(temp_dir, f&quot;subdir_{i}&quot;)
        os.makedirs(dir_path)
        for j in range(random.randint(0,i)):
            file_path = os.path.join(dir_path, f&quot;file_{j}.txt&quot;)
            with open(file_path, &#39;w&#39;) as file:
                file.write(&quot;Sample content&quot;)
    return temp_dir

We define the various approaches (Copied the other two from the answers to the question:


def iterdir_approach(path):
    #@swozny
    iter_of_files = (x for x in Path(path).iterdir() if x.isfile())
    try:
        next(iter_of_files)
        next(iter_of_files)
        next(iter_of_files)
        return True
    except:
        return False

def len_os_dir_approach(path):
    #@bluppfisk
    return len(os.listdir(path)) &gt; 2


def check_files_os_scandir_approach(path):
    #@PoneyUHC
    MINIMUM_SIZE = 3
    file_count = 0
    for entry in os.scandir(path):
        if entry.is_file():
            file_count += 1
        if file_count == MINIMUM_SIZE:
            return True
    return False


def path_resolve_approach(path):
    #@matleg
    directory_path = Path(path).resolve()
    nb_files = 0
    enough_files = False
    for file_path in directory_path.glob(&quot;*&quot;):
        if file_path.is_file():
            nb_files += 1
        if nb_files &gt; 2:
            return True
    return False

def dilettant_approach(path):
    #@dilettant
    gen = os.scandir(path)  # OP states only files in folder /path/
    enough = 3  # At least 2 files

    has_enough = len(list(itertools.islice(gen, enough))) &gt;= enough

    return has_enough
def adrian_ang_approach(path):
    #@adrian_ang
    count = 0
    with os.scandir(path) as entries:
        for entry in entries:
            if entry.is_file():
                count += 1
                if count &gt; 2:
                    return True
    return False

Then we profile the code using timeit.timeit and plot the execution times for various amounts of directories:


num_directories_list = [10, 50, 100, 200, 500,1000]
approach1_times = []
approach2_times = []
approach3_times = []
approach4_times = []
approach5_times = []
approach6_times = []


for num_directories in num_directories_list:
    temp_dir = create_temp_directory(num_directories)
    subdir_paths = [str(p) for p in Path(create_temp_directory(num_directories)).iterdir()]
    approach1_time = timeit.timeit(lambda: [iterdir_approach(path)for path in subdir_paths], number=5)
    approach2_time = timeit.timeit(lambda: [check_files_os_scandir_approach(path)for path in subdir_paths], number=5)
    approach3_time = timeit.timeit(lambda: [path_resolve_approach(path)for path in subdir_paths], number=5)
    approach4_time = timeit.timeit(lambda: [len_os_dir_approach(path)for path in subdir_paths], number=5)
    approach5_time = timeit.timeit(lambda: [dilettant_approach(path)for path in subdir_paths], number=5)
    approach6_time = timeit.timeit(lambda: [adrian_ang_approach(path)for path in subdir_paths], number=5)


    approach1_times.append(approach1_time)
    approach2_times.append(approach2_time)
    approach3_times.append(approach3_time)
    approach4_times.append(approach4_time)
    approach5_times.append(approach5_time)
    approach6_times.append(approach6_time)




    shutil.rmtree(temp_dir)

Visualization of the results


plt.plot(num_directories_list, approach1_times, label=&#39;iterdir_approach&#39;)
plt.plot(num_directories_list, approach2_times, label=&#39;check_files_os_scandir_approach&#39;)
plt.plot(num_directories_list, approach3_times, label=&#39;path_resolve_approach&#39;)
plt.plot(num_directories_list, approach4_times, label=&#39;os_dir_approach&#39;)
plt.plot(num_directories_list, approach5_times, label=&#39;dilettant_approach&#39;)
plt.plot(num_directories_list, approach6_times, label=&#39;adrian_ang_approach&#39;)


plt.xlabel(&#39;Number of Directories&#39;)
plt.ylabel(&#39;Execution Time (seconds)&#39;)
plt.title(&#39;Performance Comparison&#39;)
plt.legend()
plt.show()

Closeup of best 3 solutions:

答案3

得分: 4

对于任何想尝试C语言方法的人，这是一个可以从Python导入的模块（只处理文件，不处理子目录）

构建：

```c
#define PY_SSIZE_T_CLEAN
#include <stdio.h>
#include <dirent.h>
#include <stdlib.h>
#include <Python.h>

static PyObject *
method_dircnt(PyObject *self, PyObject *args)
{
    DIR *dir;
    const char *dirname;
    long min_count, count = 0;
    struct dirent *ent;

    if (!PyArg_ParseTuple(args, "sl", &dirname, &min_count))
    {
        return NULL;
    }

    dir = opendir(dirname);

    while((ent = readdir(dir)))
            if (ent->d_name[0] != '.') {
                ++count;
                if (count >= min_count) {
                    closedir(dir);
                    Py_RETURN_FALSE;
                }
            }

    closedir(dir);

    Py_RETURN_TRUE;
}

static char dircnt_docs[] = "dircnt(dir, min_count): 如果目录中包含的文件数超过 min_count，则返回 False。\n";

static PyMethodDef dircnt_methods[] = {
    {"dircnt", (PyCFunction)method_dircnt, METH_VARARGS, dircnt_docs},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef dircnt_module_def = 
{
    PyModuleDef_HEAD_INIT,
    "dircnt",
    "检查目录中是否存在超过 N 个文件",
    -1,
    dircnt_methods
};

PyMODINIT_FUNC PyInit_dircnt(void){
    // Py_Initialize();

    return PyModule_Create(&dircnt_module_def);
}

构建：

gcc -I /usr/include/python3.11 dircnt.c -v -shared -fPIC -o dircnt.so

用法：

from dircnt import dircnt
dircnt(path, min_count)

对于更高的 min_count 值，它要快得多：

min_count = 2

min_count = 200


[1]: https://i.stack.imgur.com/ggI88.png
[2]: https://i.stack.imgur.com/KWvLL.png

英文:

for anyone wanting to try the C approach, here's a module you can import from Python (only does files, not subdirs)

#define PY_SSIZE_T_CLEAN
#include &lt;stdio.h&gt;
#include &lt;dirent.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;Python.h&gt;

static PyObject *
method_dircnt(PyObject *self, PyObject *args)
{
    DIR *dir;
    const char *dirname;
    long min_count, count = 0;
    struct dirent *ent;

    if (!PyArg_ParseTuple(args, &quot;sl&quot;, &amp;dirname, &amp;min_count))
    {
        return NULL;
    }

    dir = opendir(dirname);

    while((ent = readdir(dir)))
            if (ent-&gt;d_name[0] != &#39;.&#39;) {
                ++count;
                if (count &gt;= min_count) {
                    closedir(dir);
                    Py_RETURN_FALSE;
                }
            }

    closedir(dir);

    Py_RETURN_TRUE;
}

static char dircnt_docs[] = &quot;dircnt(dir, min_count): Returns False if dir countains more than min_count files.\n&quot;;

static PyMethodDef dircnt_methods[] = {
    {&quot;dircnt&quot;, (PyCFunction)method_dircnt, METH_VARARGS, dircnt_docs},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef dircnt_module_def = 
{
    PyModuleDef_HEAD_INIT,
    &quot;dircnt&quot;,
    &quot;Check if there are more than N files in dir&quot;,
    -1,
    dircnt_methods
};

PyMODINIT_FUNC PyInit_dircnt(void){
    // Py_Initialize();

    return PyModule_Create(&amp;dircnt_module_def);
}

build:

gcc -I /usr/include/python3.11 dircnt.c -v -shared -fPIC -o dircnt.so (or wherever your headers from the python-dev package are)

usage:

from dircnt import dircnt
dircnt(path, min_count)

It is a fair bit faster especially for higher min_count values:

min_count = 2

min_count = 200

答案4

得分: 3

如果您想使用pathlib更明确的方式，您可以尝试：

from pathlib import Path

directory_path = Path('/path/').resolve()
nb_files = 0 
enough_files = False
for file_path in directory_path.glob("*"):
    if file_path.is_file():
        nb_files += 1
    if nb_files >= 2:
        enough_files = True
        break
print(enough_files)

英文:

If you want something more explicit using pathlib, you can try:

from pathlib import Path

directory_path = Path(&#39;/path/&#39;).resolve()
nb_files = 0 
enough_files = False
for file_path in directory_path.glob(&quot;*&quot;):
    if file_path.is_file():
        nb_files += 1
    if nb_files &gt;= 2:
        enough_files = True
        break
print(enough_files)

答案5

得分: 3

以下是您要翻译的内容：

"如OP所知，仅在/path/中有文件，一个优化方法是不测试文件属性。

这个版本应该从先前的知识/约束中获益：

import itertools
import os

gen = os.scandir('/path/') # OP指出仅在文件夹/path/中有文件
enough = 2 # 至少2个文件

创建一个迭代器，只返回足够多的元素

测量结果列表的长度（最多足够多的元素）

并应用标准以获取布尔结果

has_enough = len(list(itertools.islice(gen, enough))) >= enough

print(has_enough)

将此放入一个shell脚本中，并使用hyperfine来测量一些随机性能（包含500多个文件的文件夹）：

❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
Time (mean ± σ): 77.6 ms ± 0.6 ms [用户：29.9 ms，系统：31.8 ms]
范围（最小...最大）：76.3 ms... 79.4 ms 36次运行

...正如它实际上不应该影响同一个系统的文件夹，其中有超过100,000个文件：

❯ ls -l | wc -l
100204

~
❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
Time (mean ± σ): 79.6 ms ± 1.1 ms [用户：31.9 ms，系统：33.5 ms]
范围（最小...最大）：76.8 ms... 82.1 ms 35次运行"

英文:

As the OP knows there are only files within /path/ one optimization is to not test on the file attributes.

This version should be profiting from the prior knowledge / constraints:

import itertools
import os

gen = os.scandir(&#39;/path/&#39;)  # OP states only files in folder /path/
enough = 2  # At least 2 files

# Build an iterator that only returns the first enough elements
# measure the length of the resulting list (at most enough elements)
# and apply the criterion to get the boolean result
has_enough = len(list(itertools.islice(gen, enough))) &gt;= enough

print(has_enough)

Placing this in a shell script and use hyperfine to measure some random performance (folder with 500+ files):

❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
  Time (mean &#177; σ):      77.6 ms &#177;   0.6 ms    [User: 29.9 ms, System: 31.8 ms]
  Range (min … max):    76.3 ms …  79.4 ms    36 runs

... and as it should not really matter same system on a folder with more than 100k files:

❯ ls -l |wc -l
  100204

~
❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
  Time (mean &#177; σ):      79.6 ms &#177;   1.1 ms    [User: 31.9 ms, System: 33.5 ms]
  Range (min … max):    76.8 ms …  82.1 ms    35 runs

答案6

得分: 2

你可以使用os.scandir函数来替代。例如，要检查一个文件夹是否包含超过2个文件，它仅迭代目录条目2次，并在目录至少有2个文件时返回True：

import os

def has_more_than_two_files(path):
    count = 0
    with os.scandir(path) as entries:
        for entry in entries:
            if entry.is_file():
                count += 1
                if count > 2:
                    return True
    return False

英文:

You can use the os.scandir function instead. For example, to check if a folder contains more than 2 files, it iterates over directory entries only 2 times and returns positively when the directory has at least 2 files:

import os

def has_more_than_two_files(path):
    count = 0
    with os.scandir(path) as entries:
        for entry in entries:
            if entry.is_file():
                count += 1
                if count &gt; 2:
                    return True
    return False

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python：检查文件夹中是否有超过 x 个文件的最快方式

问题

答案1

答案2

答案3

答案4

答案5

创建一个迭代器，只返回足够多的元素

测量结果列表的长度（最多足够多的元素）

并应用标准以获取布尔结果

答案6

Moebius Strip in Manim

在Pandas中填充不同数据框列切片中的NA值。

Pandas groupby和sum会丢弃数值列。

2D字典 – while True循环正在覆盖所有键值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论