How to sort os.walk(path) in alphanumeric order, with duplicates coming after the original file, using Python 3?

huangapple go评论92阅读模式
英文:

How to sort os.walk(path) in alphanumeric order, with duplicates coming after the original file, using Python 3?

问题

在Python 3(具体是3.10.6版本)中,您如何更改os.walk(path)排序所找到的文件的方式呢?给定以下文件列表:

IMG0001.jpg
IMG0002.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0003.jpg

您希望按照这个顺序进行排序,其中每个(n)重复的文件位于原始文件之后。目前,os.walk(path)按以下方式对列表进行排序:

IMG0001.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0002.jpg
IMG0003.jpg

我猜测主要问题是默认的排序方法赋予((以及-)比扩展名中的.更高的“排序值”。如果这种情况正确,您该如何修改特殊字符的排序顺序呢?

我尝试过使用sorted(files),但它将文件与os.walk(path)已经排序的方式相同地排序。如果我尝试sorted(files, reverse=True),那么原始文件会出现在重复文件之前,但多个重复文件也会倒序排序,所有原始文件也会倒序排序,即:

IMG0003.jpg
IMG0002.jpg
IMG0002(2).jpg
IMG0002(1).jpg
IMG0001.jpg

如果您想按照您描述的方式排序文件,可以使用自定义的排序函数来实现。以下是一个示例代码:

import os

def custom_sort(file_name):
    if "(" in file_name:
        base_name, extension = file_name.rsplit("(", 1)
        return (base_name, int(extension.split(")")[0]), extension)
    else:
        base_name, extension = os.path.splitext(file_name)
        return (base_name, 0, extension)

path = "your_directory_path"
files = os.listdir(path)
sorted_files = sorted(files, key=custom_sort)

for file in sorted_files:
    print(file)

这将按照您期望的方式对文件进行排序。

英文:

In python 3 (specifically 3.10.6), how can you change the way that os.walk(path) sorts the files it finds? Given this list of files:

IMG0001.jpg
IMG0002.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0003.jpg

How would you sort it in that order, with each (n) duplicate file coming after the original file? Currently, os.walk(path) is sorting this list like this:

IMG0001.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0002.jpg
IMG0003.jpg

I suppose the main issue is that the default sort method is giving a higher "sort value" to the ( (and also -) than it is to the . in the extension. If this is correct in what's happening here, how would you modify which special characters come before others?

I've tried to use sorted(files), however that sorts it the same as os.walk(path) already sorts them. If I try sorted(files, reverse=True), then while the originals come before the duplicates, multiple duplicates are now sorted backward and all the originals are backward too, ie:

IMG0003.jpg
IMG0002.jpg
IMG0002(2).jpg
IMG0002(1).jpg
IMG0001.jpg

答案1

得分: 3

String ordering is lexicographic, so you'll need a custom sort key if you want something different. It's a little trickier than expected, but something like this should work:

import os
import re

def key(fname):
    basename, ext = os.path.splitext(fname)
    v = 0
    if m := re.match(r"(.*)\((\d+)\)$", basename):
        basename, v = m.groups()
        v = int(v)
    return basename, ext, v

Now you should be able to use something like files.sort(key=key).

英文:

String ordering is lexicographic, so you'll need a custom sort key if you want something different. It's a little trickier than expected, but something like this should work:

import os
import re

def key(fname):
    basename, ext = os.path.splitext(fname)
    v = 0
    if m := re.match(r"(.*)\((\d+)\)$", basename):
        basename, v = m.groups()
        v = int(v)
    return basename, ext, v

Now you should be able to use something like files.sort(key=key).

答案2

得分: 1

使用pathlib.Path可以更好地了解文件名的语义,构建一个元组,其中包含特殊情况作为前导元素,文件名作为最后一个元素。对这些元组进行排序,但仅保留最后一个元素。

def test():

    from pathlib import Path

    def filenamesort(inp: list[str]):
        """
        构建一个自定义元组列表,对其进行排序并返回最右边的字段,即文件名。
        """
        
        def tupleize(v):
            """
            根据文件名的 Path.stem 返回字符串元组

            特殊情况。分割成最后一个 `(` 之前的部分和之后的部分

            IMG0002(1).jpg => ('IMG0002', 1, 'IMG0002(2).jpg')

            正常情况,返回 stem 和一个空值

            IMG0002.jpg  => ('IMG0002', 0, 'IMG0002.jpg')

            最后一个元素,按照排序的重要性最小,是文件名

            为了更可靠,foo(xxx).jpg 应该被忽略,因为 xxx 不是数字。
            """
            
            pa = Path(v)
            stem = pa.stem
            if stem.endswith(")"):
                lead, seq = stem.rsplit("(", maxsplit=1)
                return (lead, int(seq.rstrip(")")), v)
            else:
                # 空字符串将在 "1)" 之前排序
                return (stem, 0, v)

        li = [tupleize(v) for v in inp]

        # 对列表进行排序,然后返回元组的最后一个位置:文件名本身
        return [v[-1] for v in sorted(li)]

    
    def fmt(sin: str):
        res = [v for line in sin.splitlines() if (v := line.strip())]
        return res

    inp = fmt("""
    IMG0001.jpg
    IMG0002(1).jpg
    IMG0002(2).jpg
    IMG0002(11).jpg
    IMG0002.jpg
    IMG0003.jpg
    """)

    exp = fmt("""
    IMG0001.jpg
    IMG0002.jpg
    IMG0002(1).jpg
    IMG0002(2).jpg
    IMG0002(11).jpg
    IMG0003.jpg
    """)

    dataexp = [
        (inp, exp),
    ]

    for inp, exp in dataexp:
        for f in [filenamesort]:
            got = f(inp)
            msg = f"\n{f.__name__} for {str(inp):100.100} \nexp :{exp}:\ngot :{got}:\n"
            if exp == got:
                print(f"✅! {msg}")
            else:
                print(f"❌!  {msg}")

test()

输出:

✅! 
filenamesort for ['IMG0001.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0002.jpg', 'IMG0003.jpg'] 
exp :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:
got :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:

我已经检查过tupleize(用于元组的函数)可以作为sort函数的key参数使用。

即,这也可以起作用:sorted(inp, key=tupleize)

Wim 是对的,这个方法在不同的扩展名下会失败。修复方法如下:

        pa = Path(v)
        stem = pa.stem
        if stem.endswith(")"):
            lead, seq = stem.rsplit("(", maxsplit=1)
            return (lead, pa.suffix, int(seq.rstrip(")")), v)
        else:
            # 空字符串将在 "1)" 之前排序
            return (stem, pa.suffix, 0, v)
英文:

Using pathlib.Path for more awareness of filename semantics, build a tuple
with special cases as leading elements and the filename at the end. Sort that list of tuples but keep only the last element.

def test():
from pathlib import Path
def filenamesort(inp : list[str]):
"""build a list of custom tuples from the filename list, sort it and 
return the rightmost field, which is the filename.
"""
def tupleize(v):
""" returns a tuple of strings based on Path.stem for the filename
special case.  split into the part before the last `(` and what comes after
IMG0002(1).jpg => ('IMG0002', 1, 'IMG0002(2).jpg')
normal case, return the stem and an empty value
IMG0002.jpg  => ('IMG0002', 0, 'IMG0002.jpg')
The last element, least significant to sort is the filename
to be more solid foo(xxx).jpg should be ignored as xxx is not a numeric.
"""
pa = Path(v)
stem = pa.stem
if stem.endswith(")"):
lead, seq = stem.rsplit("(",maxsplit=1)
return (lead,int(seq.rstrip(")")),v)
else:
# "" will sort before "1)"
return (stem,0,v)
li = [tupleize(v) for v in inp]
#sort the list then return the last position in the tuple: the filename proper
return [v[-1] for v in sorted(li)]
def fmt(sin : str):
res = [v for line in sin.splitlines() if (v:=line.strip())]
return res
inp = fmt("""IMG0001.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0002(11).jpg
IMG0002.jpg
IMG0003.jpg
""")
exp = fmt("""IMG0001.jpg
IMG0002.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0002(11).jpg
IMG0003.jpg""")
dataexp = [
(inp,exp),
]
for inp, exp in dataexp:
for f in [filenamesort]:
got = f(inp)
msg = f"\n{f.__name__} for {str(inp):100.100} \nexp :{exp}:\ngot :{got}:\n"
if exp == got:
print(f"✅! {msg}")
else:
print(f"❌!  {msg}")
test()

output:

✅! 
filenamesort for ['IMG0001.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0002.jpg', 'IMG0003.jpg'] 
exp :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:
got :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:

I've checked that tupleize (the one for the tuple) can be used as a key parameter to sort.

i.e. this works as well sorted(inp,key=tupleize)

Wim was right, this was failing with different extension. Fixed with adjustment below:

        pa = Path(v)
stem = pa.stem
if stem.endswith(")"):
lead, seq = stem.rsplit("(",maxsplit=1)
return (lead,pa.suffix, int(seq.rstrip(")")),v)
else:
# "" will sort before "1)"
return (stem,pa.suffix,0,v)

huangapple
  • 本文由 发表于 2023年3月1日 11:49:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75599415.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定