为什么os.walk()(Python)会根据目录中的文件数量忽略OneDrive目录?

huangapple go评论73阅读模式
英文:

Why does os.walk() (Python) ignore a OneDrive directory depending on the number of files in it?

问题

我正在遍历我OneDrive同步的文件夹结构的特定部分,使用os.walk()函数。一切都运行正常,直到最近。现在,来自一个特定目录的所有文件都被忽略了。

我测试了几种可能的原因,并将其缩小到了这个问题:被忽略的目录是包含最多文件的一个目录(目前有897个文件)。

如果我从该目录中删除两个文件(无论是哪两个文件),它就可以正常运行,并且所有文件都被识别。但是,当我再次添加这两个文件时,结果是一样的:在我的os.walk()结果列表中,来自该目录的没有文件被找到。

我已经查看了Microsoft的OneDrive和SharePoint中的限制和限制,但远未达到任何文件大小和数量(12)的限制。

我的代码如下:

files = []
for root, dir, files in os.walk(mainDirectory):
    for f in files:
        if 'Common part' in root:
            files.append(os.path.join(root, f))

'Common part'是一个文本字符串,所有在mainDirectory中相关的文件夹都共有的。

目录本身始终被识别,只是文件没有被添加到我的列表中。所以,我尝试了另一种方法,使用glob.glob()函数。在这里,结果有些不同,但仍然不令人满意:

folders = []
for root, dir, files in os.walk(mainDirectory):
    for d in dir:
        if d.startswith('Common part'):
            folders.append(os.path.join(root, d))

files = [glob.glob(os.path.join(f, '*.xlsx')) for f in folders]

这确实给了我大约一半来自有问题文件夹的文件。同样,当我删除两个文件时,它就给了我完整的列表。

当我将文件复制/移动到本地(不是OneDrive同步的)路径时,它可以正常工作。所以我猜这与OneDrive有关。将文件放在OneDrive之外不是一个选择

问题中的目录不是直接在我的OneDrive中,而是从SharePoint中的“同步”/“快捷方式”中的目录。

所有文件都可以打开,它们是下载的,不是按需下载。我已经移除了同步并重新同步了文件夹。我已经重新启动了OneDrive(和我的计算机)多次。

我真的很茫然。欢迎任何提示!

更新:在@GordonAitchJay的帮助下,我们发现,在文件的阈值(或文件大小总和?)上,像*os.listdir()win32file.FindFilesW()*这样的函数停止返回它们通常的输出,而是返回OSError: [WinError 87] The parameter is incorrect

此外,在此期间,我们在同一组织内的另一台机器上重现了相同的行为。在完全重置我的OneDrive之后,没有任何改进。

英文:

I am doing an os.walk() over a certain part of my OneDrive synced folder structure. It all worked fine until recently. Now ALL files from one specific directory are ignored.
I tested several possible reasons and narrowed it down to this: The directory that is ignored is the one that holds the most files (897 at this point).

If I remove two of the files from said directory (it does not matter which two), it works and all files are recognized. When I add the files again, the result is the same: No files from that directory turn up in my os.walk() result list.

I did check Microsoft's Restrictions and limitations in OneDrive and SharePoint, but am far from any of the file size and number (1 ,2) limits mentioned.

My code looks like this

files = []
for root, dir, files in os.walk(mainDirectory):
    for f in files:
        if 'Common part' in root:
            files.append(os.path.join(root, f))

'Common part' is a text string, that all relevant folders in the mainDirectory have in common.

The directory itself is recognized all the times, just the files are not added to my list.
So, I tried another approach featuring glob.glob(). Here, the results are a bit different but still not satisfactory:

folders = []
for root, dir, files in os.walk(mainDirectory):
    for d in dir:
        if d.startswith('Common part')
            folders.append(os.path.join(root, d))

files = [glob.glob(os.path.join(f,'*.xlsx')) for f in folders]

This does give me approximately half the files from the problematic folder. Again, when I remove two files, it gives me the full list.

When I copy/move the files to a local (not OneDrive synced) path, it works. So I guess it does have to do with OneDrive.
Having the files outside of OneDrive is not an option.

The directory in question is not directly in my OneDrive but a "Sync"/"Shortcut" from SharePoint.

All files can be opened, they are downloaded, not on-demand. I have removed the sync and re-synced the folder. I have restarted OneDrive (and my machine) several times

I am really at a loss here. Any hints welcome!

Update: Thanks to the help of @GordonAitchJay, it could be established, that at the threshold of files (or sum of file sizes?) functions like os.listdir() and win32file.FindFilesW() stop returning their usual output and instead return OSError: [WinError 87] The parameter is incorrect

Also, in the meantime, we reproduced the same behaviour on another machine within the same organization. This was conducted after a full reset of my OneDrive did not result in any improvement.

答案1

得分: 1

这是一段Python代码,用于解决与OneDrive同步文件夹时出现的FindNextFileW调用失败的问题。这个问题似乎会导致ERROR_INVALID_PARAMETER错误,特别是当Python的os.walkos.listdirwin32file.FindFilesW调用它,并且在OneDrive目录中删除了一些文件后。这段代码提供了一种绕过此问题的方法,使用ctypes来调用较低级别的NtQueryDirectoryFile函数,这个函数实际上是FindNextFileW的底层实现。

如果您需要进一步了解这段代码的工作原理或如何使用它,请告诉我。

英文:

Though I can't prove it, it seems that OneDrive is up to some sort of tomfoolery that causes win32's FindNextFileW to fail with a ERROR_INVALID_PARAMETER error, but apparently only when it is called by Python's os.walk, os.listdir, and win32file.FindFilesW, and when some files have been deleted from the OneDrive directory syncing a SharePoint folder. Utterly bizarre. I'm thinking maybe OneDrive hooks FindNextFileW which remains after ending the OneDrive process and services with Task Manager.

A workaround is to use ctypes to call the lower level NtQueryDirectoryFile function (which is ultimately what FindNextFileW calls anyway).

Eryk Sun's answer to another question has a working example. I have copied it below, and have only changed the last couple lines:

import os
import msvcrt
import ctypes

from ctypes import wintypes

ntdll = ctypes.WinDLL('ntdll')
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

def NtError(status):
    err = ntdll.RtlNtStatusToDosError(status)
    return ctypes.WinError(err)

NTSTATUS = wintypes.LONG
STATUS_BUFFER_OVERFLOW = NTSTATUS(0x80000005).value
STATUS_NO_MORE_FILES = NTSTATUS(0x80000006).value
STATUS_INFO_LENGTH_MISMATCH = NTSTATUS(0xC0000004).value

ERROR_DIRECTORY = 0x010B
INVALID_HANDLE_VALUE = wintypes.HANDLE(-1).value
GENERIC_READ = 0x80000000
FILE_SHARE_READ = 1
OPEN_EXISTING = 3
FILE_FLAG_BACKUP_SEMANTICS = 0x02000000
FILE_ATTRIBUTE_DIRECTORY = 0x0010

FILE_INFORMATION_CLASS = wintypes.ULONG
FileDirectoryInformation = 1
FileBasicInformation = 4

LPSECURITY_ATTRIBUTES = wintypes.LPVOID
PIO_APC_ROUTINE = wintypes.LPVOID
ULONG_PTR = wintypes.WPARAM

class UNICODE_STRING(ctypes.Structure):
    _fields_ = (('Length',        wintypes.USHORT),
                ('MaximumLength', wintypes.USHORT),
                ('Buffer',        wintypes.LPWSTR))

PUNICODE_STRING = ctypes.POINTER(UNICODE_STRING)

class IO_STATUS_BLOCK(ctypes.Structure):
    class _STATUS(ctypes.Union):
        _fields_ = (('Status',  NTSTATUS),
                    ('Pointer', wintypes.LPVOID))
    _anonymous_ = '_Status',
    _fields_ = (('_Status',     _STATUS),
                ('Information', ULONG_PTR))

PIO_STATUS_BLOCK = ctypes.POINTER(IO_STATUS_BLOCK)

ntdll.NtQueryInformationFile.restype = NTSTATUS
ntdll.NtQueryInformationFile.argtypes = (
    wintypes.HANDLE,        # In  FileHandle
    PIO_STATUS_BLOCK,       # Out IoStatusBlock
    wintypes.LPVOID,        # Out FileInformation
    wintypes.ULONG,         # In  Length
    FILE_INFORMATION_CLASS) # In  FileInformationClass

ntdll.NtQueryDirectoryFile.restype = NTSTATUS
ntdll.NtQueryDirectoryFile.argtypes = (
    wintypes.HANDLE,        # In     FileHandle
    wintypes.HANDLE,        # In_opt Event
    PIO_APC_ROUTINE,        # In_opt ApcRoutine
    wintypes.LPVOID,        # In_opt ApcContext
    PIO_STATUS_BLOCK,       # Out    IoStatusBlock
    wintypes.LPVOID,        # Out    FileInformation
    wintypes.ULONG,         # In     Length
    FILE_INFORMATION_CLASS, # In     FileInformationClass
    wintypes.BOOLEAN,       # In     ReturnSingleEntry
    PUNICODE_STRING,        # In_opt FileName
    wintypes.BOOLEAN)       # In     RestartScan

kernel32.CreateFileW.restype = wintypes.HANDLE
kernel32.CreateFileW.argtypes = (
    wintypes.LPCWSTR,      # In     lpFileName
    wintypes.DWORD,        # In     dwDesiredAccess
    wintypes.DWORD,        # In     dwShareMode
    LPSECURITY_ATTRIBUTES, # In_opt lpSecurityAttributes
    wintypes.DWORD,        # In     dwCreationDisposition
    wintypes.DWORD,        # In     dwFlagsAndAttributes
    wintypes.HANDLE)       # In_opt hTemplateFile

class FILE_BASIC_INFORMATION(ctypes.Structure):
    _fields_ = (('CreationTime',   wintypes.LARGE_INTEGER),
                ('LastAccessTime', wintypes.LARGE_INTEGER),
                ('LastWriteTime',  wintypes.LARGE_INTEGER),
                ('ChangeTime',     wintypes.LARGE_INTEGER),
                ('FileAttributes', wintypes.ULONG))

class FILE_DIRECTORY_INFORMATION(ctypes.Structure):
    _fields_ = (('_Next',          wintypes.ULONG),
                ('FileIndex',      wintypes.ULONG),
                ('CreationTime',   wintypes.LARGE_INTEGER),
                ('LastAccessTime', wintypes.LARGE_INTEGER),
                ('LastWriteTime',  wintypes.LARGE_INTEGER),
                ('ChangeTime',     wintypes.LARGE_INTEGER),
                ('EndOfFile',      wintypes.LARGE_INTEGER),
                ('AllocationSize', wintypes.LARGE_INTEGER),
                ('FileAttributes', wintypes.ULONG),
                ('FileNameLength', wintypes.ULONG),
                ('_FileName',      wintypes.WCHAR * 1))

    @property
    def FileName(self):
        addr = ctypes.addressof(self) + type(self)._FileName.offset
        size = self.FileNameLength // ctypes.sizeof(wintypes.WCHAR)
        return (wintypes.WCHAR * size).from_address(addr).value

class DirEntry(FILE_DIRECTORY_INFORMATION):
    def __repr__(self):
        return '<{} {!r}>'.format(self.__class__.__name__, self.FileName)

    @classmethod
    def listbuf(cls, buf):
        result = []
        base_size = ctypes.sizeof(cls) - ctypes.sizeof(wintypes.WCHAR)
        offset = 0
        while True:
            fdi = cls.from_buffer(buf, offset)
            if fdi.FileNameLength and fdi.FileName not in ('.', '..'):
                cfdi = cls()
                size = base_size + fdi.FileNameLength
                ctypes.resize(cfdi, size)
                ctypes.memmove(ctypes.byref(cfdi), ctypes.byref(fdi), size)
                result.append(cfdi)
            if fdi._Next:
                offset += fdi._Next
            else:
                break
        return result

def isdir(path):
    if not isinstance(path, int):
        return os.path.isdir(path)
    try:
        hFile = msvcrt.get_osfhandle(path)
    except IOError:
        return False
    iosb = IO_STATUS_BLOCK()
    info = FILE_BASIC_INFORMATION()
    status = ntdll.NtQueryInformationFile(hFile, ctypes.byref(iosb),
                ctypes.byref(info), ctypes.sizeof(info),
                FileBasicInformation)
    return bool(status >= 0 and info.FileAttributes & FILE_ATTRIBUTE_DIRECTORY)

def ntlistdir(path=None):
    result = []

    if path is None:
        path = os.getcwd()

    if isinstance(path, int):
        close = False
        fd = path
        hFile = msvcrt.get_osfhandle(fd)
    else:
        close = True
        hFile = kernel32.CreateFileW(path, GENERIC_READ, FILE_SHARE_READ,
                    None, OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, None)
        if hFile == INVALID_HANDLE_VALUE:
            raise ctypes.WinError(ctypes.get_last_error())
        fd = msvcrt.open_osfhandle(hFile, os.O_RDONLY)

    try:
        if not isdir(fd):
            raise ctypes.WinError(ERROR_DIRECTORY)
        iosb = IO_STATUS_BLOCK()
        info = (ctypes.c_char * 4096)()
        while True:
            status = ntdll.NtQueryDirectoryFile(hFile, None, None, None,
                        ctypes.byref(iosb), ctypes.byref(info),
                        ctypes.sizeof(info), FileDirectoryInformation,
                        False, None, False)
            if (status == STATUS_BUFFER_OVERFLOW or
                iosb.Information == 0 and status >= 0):
                info = (ctypes.c_char * (ctypes.sizeof(info) * 2))()
            elif status == STATUS_NO_MORE_FILES:
                break
            elif status >= 0:
                sublist = DirEntry.listbuf(info)
                result.extend(sublist)
            else:
                raise NtError(status)
    finally:
        if close:
            os.close(fd)

    return result

for entry in ntlistdir(r"C:\Users\UserName\OneDriveFolder\BigFolder"):
    print(entry.FileName)

huangapple
  • 本文由 发表于 2023年3月15日 18:54:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75743739.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定