英文:
How to sort os.walk(path) in alphanumeric order, with duplicates coming after the original file, using Python 3?
问题
在Python 3(具体是3.10.6版本)中,您如何更改os.walk(path)
排序所找到的文件的方式呢?给定以下文件列表:
IMG0001.jpg
IMG0002.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0003.jpg
您希望按照这个顺序进行排序,其中每个(n)
重复的文件位于原始文件之后。目前,os.walk(path)
按以下方式对列表进行排序:
IMG0001.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0002.jpg
IMG0003.jpg
我猜测主要问题是默认的排序方法赋予(
(以及-
)比扩展名中的.
更高的“排序值”。如果这种情况正确,您该如何修改特殊字符的排序顺序呢?
我尝试过使用sorted(files)
,但它将文件与os.walk(path)
已经排序的方式相同地排序。如果我尝试sorted(files, reverse=True)
,那么原始文件会出现在重复文件之前,但多个重复文件也会倒序排序,所有原始文件也会倒序排序,即:
IMG0003.jpg
IMG0002.jpg
IMG0002(2).jpg
IMG0002(1).jpg
IMG0001.jpg
如果您想按照您描述的方式排序文件,可以使用自定义的排序函数来实现。以下是一个示例代码:
import os
def custom_sort(file_name):
if "(" in file_name:
base_name, extension = file_name.rsplit("(", 1)
return (base_name, int(extension.split(")")[0]), extension)
else:
base_name, extension = os.path.splitext(file_name)
return (base_name, 0, extension)
path = "your_directory_path"
files = os.listdir(path)
sorted_files = sorted(files, key=custom_sort)
for file in sorted_files:
print(file)
这将按照您期望的方式对文件进行排序。
英文:
In python 3 (specifically 3.10.6), how can you change the way that os.walk(path)
sorts the files it finds? Given this list of files:
IMG0001.jpg
IMG0002.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0003.jpg
How would you sort it in that order, with each (n)
duplicate file coming after the original file? Currently, os.walk(path)
is sorting this list like this:
IMG0001.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0002.jpg
IMG0003.jpg
I suppose the main issue is that the default sort method is giving a higher "sort value" to the (
(and also -
) than it is to the .
in the extension. If this is correct in what's happening here, how would you modify which special characters come before others?
I've tried to use sorted(files)
, however that sorts it the same as os.walk(path)
already sorts them. If I try sorted(files, reverse=True),
then while the originals come before the duplicates, multiple duplicates are now sorted backward and all the originals are backward too, ie:
IMG0003.jpg
IMG0002.jpg
IMG0002(2).jpg
IMG0002(1).jpg
IMG0001.jpg
答案1
得分: 3
String ordering is lexicographic, so you'll need a custom sort key if you want something different. It's a little trickier than expected, but something like this should work:
import os
import re
def key(fname):
basename, ext = os.path.splitext(fname)
v = 0
if m := re.match(r"(.*)\((\d+)\)$", basename):
basename, v = m.groups()
v = int(v)
return basename, ext, v
Now you should be able to use something like files.sort(key=key)
.
英文:
String ordering is lexicographic, so you'll need a custom sort key if you want something different. It's a little trickier than expected, but something like this should work:
import os
import re
def key(fname):
basename, ext = os.path.splitext(fname)
v = 0
if m := re.match(r"(.*)\((\d+)\)$", basename):
basename, v = m.groups()
v = int(v)
return basename, ext, v
Now you should be able to use something like files.sort(key=key)
.
答案2
得分: 1
使用pathlib.Path
可以更好地了解文件名的语义,构建一个元组,其中包含特殊情况作为前导元素,文件名作为最后一个元素。对这些元组进行排序,但仅保留最后一个元素。
def test():
from pathlib import Path
def filenamesort(inp: list[str]):
"""
构建一个自定义元组列表,对其进行排序并返回最右边的字段,即文件名。
"""
def tupleize(v):
"""
根据文件名的 Path.stem 返回字符串元组
特殊情况。分割成最后一个 `(` 之前的部分和之后的部分
IMG0002(1).jpg => ('IMG0002', 1, 'IMG0002(2).jpg')
正常情况,返回 stem 和一个空值
IMG0002.jpg => ('IMG0002', 0, 'IMG0002.jpg')
最后一个元素,按照排序的重要性最小,是文件名
为了更可靠,foo(xxx).jpg 应该被忽略,因为 xxx 不是数字。
"""
pa = Path(v)
stem = pa.stem
if stem.endswith(")"):
lead, seq = stem.rsplit("(", maxsplit=1)
return (lead, int(seq.rstrip(")")), v)
else:
# 空字符串将在 "1)" 之前排序
return (stem, 0, v)
li = [tupleize(v) for v in inp]
# 对列表进行排序,然后返回元组的最后一个位置:文件名本身
return [v[-1] for v in sorted(li)]
def fmt(sin: str):
res = [v for line in sin.splitlines() if (v := line.strip())]
return res
inp = fmt("""
IMG0001.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0002(11).jpg
IMG0002.jpg
IMG0003.jpg
""")
exp = fmt("""
IMG0001.jpg
IMG0002.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0002(11).jpg
IMG0003.jpg
""")
dataexp = [
(inp, exp),
]
for inp, exp in dataexp:
for f in [filenamesort]:
got = f(inp)
msg = f"\n{f.__name__} for {str(inp):100.100} \nexp :{exp}:\ngot :{got}:\n"
if exp == got:
print(f"✅! {msg}")
else:
print(f"❌! {msg}")
test()
输出:
✅!
filenamesort for ['IMG0001.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0002.jpg', 'IMG0003.jpg']
exp :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:
got :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:
我已经检查过tupleize
(用于元组的函数)可以作为sort
函数的key
参数使用。
即,这也可以起作用:sorted(inp, key=tupleize)
。
Wim 是对的,这个方法在不同的扩展名下会失败。修复方法如下:
pa = Path(v)
stem = pa.stem
if stem.endswith(")"):
lead, seq = stem.rsplit("(", maxsplit=1)
return (lead, pa.suffix, int(seq.rstrip(")")), v)
else:
# 空字符串将在 "1)" 之前排序
return (stem, pa.suffix, 0, v)
英文:
Using pathlib.Path
for more awareness of filename semantics, build a tuple
with special cases as leading elements and the filename at the end. Sort that list of tuples but keep only the last element.
def test():
from pathlib import Path
def filenamesort(inp : list[str]):
"""build a list of custom tuples from the filename list, sort it and
return the rightmost field, which is the filename.
"""
def tupleize(v):
""" returns a tuple of strings based on Path.stem for the filename
special case. split into the part before the last `(` and what comes after
IMG0002(1).jpg => ('IMG0002', 1, 'IMG0002(2).jpg')
normal case, return the stem and an empty value
IMG0002.jpg => ('IMG0002', 0, 'IMG0002.jpg')
The last element, least significant to sort is the filename
to be more solid foo(xxx).jpg should be ignored as xxx is not a numeric.
"""
pa = Path(v)
stem = pa.stem
if stem.endswith(")"):
lead, seq = stem.rsplit("(",maxsplit=1)
return (lead,int(seq.rstrip(")")),v)
else:
# "" will sort before "1)"
return (stem,0,v)
li = [tupleize(v) for v in inp]
#sort the list then return the last position in the tuple: the filename proper
return [v[-1] for v in sorted(li)]
def fmt(sin : str):
res = [v for line in sin.splitlines() if (v:=line.strip())]
return res
inp = fmt("""IMG0001.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0002(11).jpg
IMG0002.jpg
IMG0003.jpg
""")
exp = fmt("""IMG0001.jpg
IMG0002.jpg
IMG0002(1).jpg
IMG0002(2).jpg
IMG0002(11).jpg
IMG0003.jpg""")
dataexp = [
(inp,exp),
]
for inp, exp in dataexp:
for f in [filenamesort]:
got = f(inp)
msg = f"\n{f.__name__} for {str(inp):100.100} \nexp :{exp}:\ngot :{got}:\n"
if exp == got:
print(f"✅! {msg}")
else:
print(f"❌! {msg}")
test()
output:
✅!
filenamesort for ['IMG0001.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0002.jpg', 'IMG0003.jpg']
exp :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:
got :['IMG0001.jpg', 'IMG0002.jpg', 'IMG0002(1).jpg', 'IMG0002(2).jpg', 'IMG0002(11).jpg', 'IMG0003.jpg']:
I've checked that tupleize
(the one for the tuple) can be used as a key
parameter to sort
.
i.e. this works as well sorted(inp,key=tupleize)
Wim was right, this was failing with different extension. Fixed with adjustment below:
pa = Path(v)
stem = pa.stem
if stem.endswith(")"):
lead, seq = stem.rsplit("(",maxsplit=1)
return (lead,pa.suffix, int(seq.rstrip(")")),v)
else:
# "" will sort before "1)"
return (stem,pa.suffix,0,v)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论